{"id":328,"date":"2021-07-15T17:11:43","date_gmt":"2021-07-15T15:11:43","guid":{"rendered":"https:\/\/stage.usercentrics.com\/?post_type=knowledge&p=10990"},"modified":"2024-09-03T19:39:22","modified_gmt":"2024-09-03T17:39:22","slug":"optimizing-your-bigquery-tables-using-clustering","status":"publish","type":"knowledge","link":"https:\/\/usercentrics.com\/us\/knowledge-hub\/optimizing-your-bigquery-tables-using-clustering\/","title":{"rendered":"Optimizing your BigQuery tables using clustering"},"content":{"rendered":"\n\n<p>Besides the table partitioning to improve performance and costs in BigQuery, there is also another technique available called <em>clustering<\/em>.<\/p>\n<p><strong>A little heads-up:<\/strong> Clustering and partitioning don&#8217;t preempt each other. And in most cases, it makes sense to use both in conjunction.<\/p>\n\n\n<h2 class=\"wp-block-heading\">What are clustered tables?<\/h2>\n<p><span style=\"font-weight: 400;\">Using clustered tables, BigQuery automatically organizes and colocates related data based on one or multiple selected columns. Important to note here is that the order of columns you are specifying for clustering is essential. But more on this later.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Since BigQuery does almost everything automatically \u2013\u2013 in regards to clustering \u2013\u2013 the only thing you have to do is to tell BigQuery on which columns and in which order you want BigQuery to cluster your data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">BigQuery then stores the data in multiple blocks inside the internal BigQuery storage. Storing the data in pre-sorted blocks means that BigQuery can \u2013\u2013 based on your query filter clause \u2013\u2013 use these blocks to prevent scans of unnecessary data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">According to the official docs using clustering will significantly improve performance when the scanned table \u2013\u2013 or table partition \u2013\u2013 exceeds 1 GB.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">BigQuery offers automatic re-clustering, which means that even when you add new data to the tables, BigQuery will automatically sort them into the existing blocks.<\/span><\/p>\n<p><b>A small example of a clustered table compared to a non-clustered one<\/b><\/p>\n<p><span style=\"font-weight: 400;\">You can see a table with randomly inserted data on the left side and an on the <\/span><em><span style=\"font-weight: 400;\">name<\/span><\/em><span style=\"font-weight: 400;\"> and <\/span><em><span style=\"font-weight: 400;\">event<\/span><\/em><span style=\"font-weight: 400;\"> column clustered table on the right side in the following image.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">As you can see, rows with the same <\/span><em><span style=\"font-weight: 400;\">name<\/span><\/em><span style=\"font-weight: 400;\"> are located right beneath each other. And in the range of the same <\/span><em><span style=\"font-weight: 400;\">name&#8217;s<\/span><\/em><i><span style=\"font-weight: 400;\"> the<\/span><\/i><span style=\"font-weight: 400;\"> rows with the same <\/span><em><span style=\"font-weight: 400;\">event<\/span><\/em><span style=\"font-weight: 400;\"> are also colocated.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-10992 aligncenter\" src=\"https:\/\/usercentrics.com\/wp-content\/uploads\/sites\/7\/2021\/07\/IMAGE_CLUSTERED_TABLE-1-300x292.png\" alt=\"IMAGE_CLUSTERED_TABLE\" width=\"300\" height=\"292\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Taking this example, you can already imagine the possible improvement when querying data. As an example, let&#8217;s assume that you want to query all values where the <\/span><em><span style=\"font-weight: 400;\">name<\/span><\/em><span style=\"font-weight: 400;\"> equals <\/span><em><span style=\"font-weight: 400;\">&#8220;1&#8221;<\/span><\/em><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Without clustering, BigQuery would have to scan the whole table because each row could contain the value <\/span><em><span style=\"font-weight: 400;\">&#8220;1&#8221;<\/span><\/em><span style=\"font-weight: 400;\"> in its <\/span><em><span style=\"font-weight: 400;\">name<\/span><\/em><span style=\"font-weight: 400;\"> column.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">With clustering, BigQuery knows that only the first three columns will match this filter condition. That results in 50% fewer scanned columns since all the possible resulting rows are right beneath each other, and BigQuery can skip the other 50%. Also, 50% fewer checked rows also means 50% fewer costs since BigQuery charges you on the bytes read. Even though this might not be 100% accurate, as we will see later, it&#8217;s accurate enough to get the point of clustering data in tables.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The process of skipping specific blocks on scanning is also called <\/span><em><span style=\"font-weight: 400;\">block pruning<\/span><\/em><span style=\"font-weight: 400;\">. Even though BigQuery doesn&#8217;t necessarily create one block per one distinct value in the clustered column. The amount of blocks BigQuery creates depends highly on the amount of data stored. For 1 MB of data, even though it might contain 100 different values in the clustered column, BigQuery won&#8217;t create 100 blocks and save 99% of bytes read when filtering for one specific value. But the more (distinct) data is available inside the table; the more effective clustering will become. As already mentioned above, Google suggested a minimum table size of 1 GB to see significant improvements using clustering. But we will see in the next section that improvements will start way earlier.<\/span><\/p>\n\n\n<h2 class=\"wp-block-heading\">Hands-On<\/h2>\n<p><span style=\"font-weight: 400;\">Now it&#8217;s time for some actual hands-on coding and seeing some real-world examples. I split the hands-on part into four small portions.<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Easy script to create some test data we then can import and query.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Import the test data into BigQuery as a normal and as a clustered table.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Query both tables with the same query and compare results.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Making an assumption on Google&#8217;s clustering block size based on a real example.<\/span><\/li>\n<\/ol>\n<p><b>Create test data<\/b><\/p>\n<p><span style=\"font-weight: 400;\">We will orient ourselves by the first image in this article to create test data, where we showed how BigQuery organizes a clustered table compared to a normal one.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">That means that we will have two columns. One column is representing a <\/span><em><span style=\"font-weight: 400;\">name<\/span><\/em><span style=\"font-weight: 400;\">, the other one representing an <\/span><em><span style=\"font-weight: 400;\">event<\/span><\/em><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Unlike <a href=\"https:\/\/usercentrics.com\/knowledge-hub\/optimizing-your-bigquery-tables-using-partitioning-time-unit-column-partitioned-tables\/\">the last article<\/a><\/span><span style=\"font-weight: 400;\">, we will dynamically create the data via a JavaScript script and write it into a CSV file. We need a minimum amount of data \u2013\u2013 we will create 5,000,000 rows resulting in a 28 MB CSV file \u2013\u2013 so we can see some effect of the clustering.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">So, let&#8217;s check out the actual code for generating our test data.<\/span><\/p>\n<div id=\"attachment_11002\" style=\"width: 929px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-11002\" class=\"size-full wp-image-11002\" src=\"https:\/\/usercentrics.com\/wp-content\/uploads\/sites\/7\/2021\/07\/Code_csv4.png\" alt=\"Code_csv#4\" width=\"919\" height=\"587\" \/><p id=\"caption-attachment-11002\" class=\"wp-caption-text\">Code_csv#4<\/p><\/div>\n<p><span style=\"font-weight: 400;\">As you can see, we are using the <\/span><em><span style=\"font-weight: 400;\">csv-writer<\/span><\/em><span style=\"font-weight: 400;\"> NPM module for writing the CSV file. So make sure you install it via <\/span><em><span style=\"font-weight: 400;\">npm i &#8211;save csv-writer<\/span><\/em><span style=\"font-weight: 400;\"> before running the script.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The script itself should be easy to understand. First, we define some constants that determine the number of rows, amount of different names, and events that the script will generate randomly. Due to simplicity&#8217;s sake, I used plain numbers to represent the names.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">When you use my default values, the script will produce a 28 MB CSV file containing exactly 5,000,000 rows of data \u2013 plus one additional for the column header \u2013 with 1,000 different names and ten different events. The script defines a name as a value between 0 and 999 and an event as a value between 0 and 9.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Of course, the produced CSV file will contain a little bit of different data each time you run it. But due to the high number of occurrences, you can expect each name to occur approximately 5.000 times.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Now that the test data is ready, it&#8217;s time to ship it to BigQuery.<\/span><\/p>\n<p><b>Create a new table with data via CSV import<\/b><\/p>\n<p><span style=\"font-weight: 400;\">If you look at the following snippet, you will see many similarities to <a href=\"https:\/\/usercentrics.com\/knowledge-hub\/optimizing-your-bigquery-tables-using-partitioning-time-unit-column-partitioned-tables\/\">the last article&#8217;s CSV import script<\/a><\/span><span style=\"font-weight: 400;\">. And true, most of it is the same code. The main difference is the fourth parameter of the <\/span><em><span style=\"font-weight: 400;\">importCSV()<\/span><\/em><span style=\"font-weight: 400;\"> method. Before we used <\/span><em><span style=\"font-weight: 400;\">timePartitioning<\/span><\/em><i><span style=\"font-weight: 400;\">,<\/span><\/i><span style=\"font-weight: 400;\"> we now use <\/span><em><span style=\"font-weight: 400;\">clustering<\/span><\/em><span style=\"font-weight: 400;\"> to define the fields that BigQuery should cluster.<\/span><\/p>\n<div id=\"attachment_11003\" style=\"width: 661px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-11003\" class=\"size-full wp-image-11003\" src=\"https:\/\/usercentrics.com\/wp-content\/uploads\/sites\/7\/2021\/07\/Code_Import-clustered_csv.png\" alt=\"Code_Import-clustered_csv\" width=\"651\" height=\"914\" \/><p id=\"caption-attachment-11003\" class=\"wp-caption-text\">Code_Import-clustered_csv<\/p><\/div>\n<p><span style=\"font-weight: 400;\">In the script, we define the clustering fields as <\/span><em><span style=\"font-weight: 400;\">[&#8220;name&#8221;, &#8220;event&#8221;]<\/span><\/em><span style=\"font-weight: 400;\">, which means BigQuery first clusters for the <\/span><em><span style=\"font-weight: 400;\">name<\/span><\/em><span style=\"font-weight: 400;\"> and then for the <\/span><em><span style=\"font-weight: 400;\">event<\/span><\/em><span style=\"font-weight: 400;\"> column.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">After you run the script successfully \u2013\u2013 it might take a bit to import the ~28 MB CSV file to BigQuery and wait for the clustering to finish \u2013\u2013 we can verify via the CLI if BigQuery added clustering correctly.<\/span><\/p>\n<p><em><span style=\"font-weight: 400;\">bq show &#8211;format=prettyjson\u00a0<\/span><\/em><\/p>\n<p><em><span style=\"font-weight: 400;\">YOUR_GCP_PROJECT:YOUR_DATASET_ID.my_clustered_table<\/span><\/em><\/p>\n<p><span style=\"font-weight: 400;\">This command will print the metadata of your BigQuery table to the CLI. There you should find some entry for <\/span><i><span style=\"font-weight: 400;\">clustering<\/span><\/i><span style=\"font-weight: 400;\"> looking like the following:<\/span><\/p>\n<p><em><span style=\"font-weight: 400;\">&#8220;clustering&#8221;: {<\/span><\/em><\/p>\n<p><em><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0&#8220;fields&#8221;: [<\/span><\/em><\/p>\n<p><em><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0&#8220;name&#8221;,<\/span><\/em><\/p>\n<p><em><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0&#8220;event&#8221;<\/span><\/em><\/p>\n<p><em><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0]<\/span><\/em><\/p>\n<p><em><span style=\"font-weight: 400;\">},<\/span><\/em><\/p>\n<p><span style=\"font-weight: 400;\">If you run the same command for the <\/span><em><span style=\"font-weight: 400;\">non_clustered_table<\/span><\/em><span style=\"font-weight: 400;\">, you should not see any entry for <\/span><em><span style=\"font-weight: 400;\">clustering<\/span><\/em><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><b>Run queries on the imported data<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Now that we imported all the data to the normal and the clustered table and verified that BigQuery enabled the clustering correctly on our clustered table, it&#8217;s time to run some queries against it to compare their statistics.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-11008 aligncenter\" src=\"https:\/\/usercentrics.com\/wp-content\/uploads\/sites\/7\/2021\/07\/run-queries-1-676x1030.png\" alt=\"\" width=\"676\" height=\"1030\" \/> <img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-11009 aligncenter\" src=\"https:\/\/usercentrics.com\/wp-content\/uploads\/sites\/7\/2021\/07\/run-queries-2-672x1030.png\" alt=\"Code for running queries - Usercentrics\" width=\"672\" height=\"1030\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Again, we are using quite a lot of code from <a href=\"https:\/\/usercentrics.com\/knowledge-hub\/optimizing-your-bigquery-tables-using-partitioning-time-unit-column-partitioned-tables\/\">the previous article<\/a> <\/span><span style=\"font-weight: 400;\">because most of it is independent of any clustering or partitioning.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Compared to before, we added some <\/span><em><span style=\"font-weight: 400;\">printConfig<\/span><\/em><span style=\"font-weight: 400;\"> to our <\/span><em><span style=\"font-weight: 400;\">runQuery<\/span><\/em><span style=\"font-weight: 400;\"> method to define what it should print \u2014 either the query result or the job statistics, or both.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In our standard search query \u2013 <\/span><em><span style=\"font-weight: 400;\">createQuery<\/span><\/em><span style=\"font-weight: 400;\"> \u2013 we are filtering the rows by a specific <\/span><i><span style=\"font-weight: 400;\">name<\/span><\/i><span style=\"font-weight: 400;\"> with the value <\/span><em><span style=\"font-weight: 400;\">&#8220;1&#8221;<\/span><\/em><span style=\"font-weight: 400;\">. In theory, this should affect around 5,000 rows because we have created 5,000.000 rows with 1,000 different names resulting in approximately 5,000 entries per name.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We added one additional query \u2013 <\/span><em><span style=\"font-weight: 400;\">createCountQuery <\/span><\/em><span style=\"font-weight: 400;\">\u2013 that we will run and print its result. We run this query to verify our assumption that around 5.000 rows should be affected by our filer condition.<\/span><\/p>\n<p><b>Query results<\/b><\/p>\n<p><span style=\"font-weight: 400;\">When you run the above script, you should get a similar result to the following:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-11004\" src=\"https:\/\/usercentrics.com\/wp-content\/uploads\/sites\/7\/2021\/07\/Index_result.png\" alt=\"index results code - Usercentrics\" width=\"910\" height=\"646\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Because we don&#8217;t have a lot of data here, the <\/span><em><span style=\"font-weight: 400;\">Time<\/span><span style=\"font-weight: 400;\"> taken<\/span><\/em><span style=\"font-weight: 400;\"> is similar in both queries. As explained in the beginning, significant performance improvements will be visible when approaching the table size of 1 GB and more.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">But what you can see is that the <\/span><em><span style=\"font-weight: 400;\">Rows<\/span><span style=\"font-weight: 400;\"> read<\/span><\/em><span style=\"font-weight: 400;\"> and the <\/span><em><span style=\"font-weight: 400;\">Bytes processed<\/span><\/em><span style=\"font-weight: 400;\"> \u2013 which Google charges you for \u2013 has significantly decreased.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">From 5,000,000 rows read and 64,450,259 bytes processed, we decreased down to 384,853 rows read and 4,947,899 bytes processed. That is a theoretically cost saving of <\/span><b>92.3%<\/b><span style=\"font-weight: 400;\">! Just because we used clustering.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">You can also see this big difference in the following graph:<\/span><\/p>\n<div id=\"attachment_10996\" style=\"width: 510px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10996\" class=\"wp-image-10996\" src=\"https:\/\/usercentrics.com\/wp-content\/uploads\/sites\/7\/2021\/07\/Rows-Read-and-Bytes-Processed-1030x637.png\" alt=\"Rows-Read-and-Bytes-Processed\" width=\"500\" height=\"309\" \/><p id=\"caption-attachment-10996\" class=\"wp-caption-text\">Rows-Read-and-Bytes-Processed<\/p><\/div>\n<p><span style=\"font-weight: 400;\">The attentive audience will also notice that the <\/span><em><span style=\"font-weight: 400;\">Bytes billed<\/span><\/em><span style=\"font-weight: 400;\"> are not 92.3% smaller, even though one could expect this. And yes, absolutely right. But that is because Google charges a minimum of 10 MB per query regardless of the actual bytes processed. When the <\/span><em><span style=\"font-weight: 400;\">Bytes processed<\/span><\/em><span style=\"font-weight: 400;\"> are higher than 10 MB, the <\/span><em><span style=\"font-weight: 400;\">Bytes billed<\/span><\/em><span style=\"font-weight: 400;\"> will be almost always the same as the <\/span><em><span style=\"font-weight: 400;\">Bytes processed<\/span><span style=\"font-weight: 400;\">.<\/span><\/em><\/p>\n<p><b>Google&#8217;s clustering block size, an assumption<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Besides this, you might also notice that the <\/span><em><span style=\"font-weight: 400;\">Rows read<\/span><\/em><span style=\"font-weight: 400;\"> count is way higher than the 5,000 rows we expect to only scan for our query. That is due to some Google internal configuration of BigQuery where a cluster range&#8217;s minimum size exists. Even though Google&#8217;s &#8220;magic&#8221; is not only about the plain size of the colocated data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">You could assume that when you increase the amount of data up to a certain point, the <\/span><em><span style=\"font-weight: 400;\">Rows read<\/span><\/em><span style=\"font-weight: 400;\"> should have the same value as the affected rows. Since we saw above that even though only 5,000 rows should be affected \u2013 the query read 384,853 rows \u2013 we could think that somewhere around this value, there could be some &#8220;magic number&#8221;.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">As a small example \u2013 I don&#8217;t want to go too deep into this topic yet \u2013 I changed our above test data set like the following:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">I increased the total entries to 10,000,000.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">I decreased the different names to 20.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">These changes should, in theory, result in 20 different names, each occurring approximately 500,000 in the table. With this in place, we have more occurrences than the query from before has rows read.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">So let&#8217;s check the result:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-11005\" src=\"https:\/\/usercentrics.com\/wp-content\/uploads\/sites\/7\/2021\/07\/Cluster_research1.png\" alt=\"Code for clustering research - Usercentrics\" width=\"926\" height=\"366\" \/><\/p>\n<p><span style=\"font-weight: 400;\">At first glance, the result might look somewhat confusing. Even though we now have 500,000 entries that match our filter criteria, BigQuery just read close to 400k rows? And we also verified with our <\/span><em><span style=\"font-weight: 400;\">createCountQuery<\/span><\/em><span style=\"font-weight: 400;\"> that there are really around 500k rows matching our filter.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">But wait! If you remember, we are using a <\/span><em><span style=\"font-weight: 400;\">LIMIT 1<\/span><\/em><span style=\"font-weight: 400;\"> in our query because we were never interested in the actual result but only looked at the job statistics.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Let&#8217;s change this <\/span><em><span style=\"font-weight: 400;\">LIMIT 1<\/span><\/em><span style=\"font-weight: 400;\"> to a <\/span><em><span style=\"font-weight: 400;\">LIMIT 600000<\/span><\/em><span style=\"font-weight: 400;\"><em>,<\/em> so we are retrieving all affected rows and look again at the results of our script:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-11006\" src=\"https:\/\/usercentrics.com\/wp-content\/uploads\/sites\/7\/2021\/07\/Cluster_research.png\" alt=\"Code for clustering research - Usercentrics\" width=\"926\" height=\"369\" \/><\/p>\n<p><span style=\"font-weight: 400;\">We can now clearly see that the job read a total of 799,675. That pretty close to two times the read rows of the <\/span><em><span style=\"font-weight: 400;\">LIMIT 1<\/span><\/em><span style=\"font-weight: 400;\"> case, which is 798,948. There might be something there&#8230;<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Even though I haven&#8217;t found any official documentation, it might provide you some hints regarding the possible block size for clustered tables. But this is purely my investigation so far, and there are for sure also other factors involved. I want to move this even further, but I think that would be too much for this article&#8217;s scope since it should be rather some introduction to clustering and not a deep dive into its internal structures.<\/span><\/p>\n\n\n<h2 class=\"wp-block-heading\">Clustering limitations and common pitfalls to be aware of<\/h2>\n<p><b>Maximum number for clustered columns<\/b><span style=\"font-weight: 400;\"> \u2013\u2013 BigQuery supports up to four columns for clustering.<\/span><\/p>\n<p><b>Clustering of <\/b><em><b>string<\/b><\/em><b> types<\/b><span style=\"font-weight: 400;\"> \u2013\u2013 When you use <\/span><em><span style=\"font-weight: 400;\">string<\/span><\/em><span style=\"font-weight: 400;\"> types for clustering, it&#8217;s essential to be aware that BigQuery only uses the first 1,024 characters of the cell value for clustering. Everything beyond that limit will not be considered by BigQuery&#8217;s clustering algorithm even though it&#8217;s valid to have longer values written into the cells.<\/span><\/p>\n<p><b>Order of clustered columns<\/b><span style=\"font-weight: 400;\"> \u2013\u2013 The order in which you define the columns for clustering is essential for good performance. If you want to benefit from the clustering mechanism, it&#8217;s necessary to use all the clustered columns or a subset of them in the left-to-right sort order in your filter expression. If you have the clustered columns <\/span><em><span style=\"font-weight: 400;\">A<\/span><span style=\"font-weight: 400;\">, <\/span><span style=\"font-weight: 400;\">B<\/span><\/em><span style=\"font-weight: 400;\"><em>,<\/em> and <\/span><em><span style=\"font-weight: 400;\">C<\/span><\/em><span style=\"font-weight: 400;\"><em>,<\/em> you will have to filter for all three of them, just <\/span><em><span style=\"font-weight: 400;\">A, or A<\/span><\/em><span style=\"font-weight: 400;\"> and <\/span><em><span style=\"font-weight: 400;\">B<\/span><\/em><span style=\"font-weight: 400;\">. Just filtering for <\/span><em><span style=\"font-weight: 400;\">B<\/span><\/em><span style=\"font-weight: 400;\"> and <\/span><em><span style=\"font-weight: 400;\">C<\/span><\/em><span style=\"font-weight: 400;\"> will not result in the expected performance boost. As a best practice, you should always specify the most frequently filtered or aggregated column first. The order in the actual SQL filter expression doesn&#8217;t affect the performance. It&#8217;s just essential on which columns you are filtering.<\/span><\/p>\n\n\n<p><strong>What&#8217;s next?<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">After we now know how we can stream, load, and query data in BigQuery and use partitioned and clustered tables to improve our performance and reduce costs, we will learn how to set up scheduled queries and typical use cases in the following article.<\/span><\/p>\n\n\n<h2 class=\"wp-block-heading\">Author<\/h2>\n<p><strong>Pascal Zwikirsch<\/strong><\/p>\n<p>Pascal Zwikirsch is a Technical Team Lead at Usercentrics. He manages a cross-functional team of several frontend and backend developers while providing a seamless connection between the technical world and the product managers and stakeholders.<\/p>\n<p>Pascal was a former frontend developer for several years and now specializes in backend services, cloud architectures, and DevOps principles<b>.<\/b>\u00a0He experienced both worlds in detail.\u00a0This experience and the collaboration with product managers made him capable of constructing full-fledged scalable software.<\/p>\n<p>Besides his job as a Team Lead of the B2B developer team, he likes to share his in-depth knowledge and insights in articles he creates as an independent technical writer.<\/p>\n<p><strong>For more content, follow me on\u00a0<a href=\"https:\/\/www.linkedin.com\/in\/pascal-zwikirsch-3a95a1177\/\" target=\"_blank\" rel=\"noopener\">LinkedIn<\/a><\/strong><\/p>\n\n\n","protected":false},"excerpt":{"rendered":"<p>Besides the table partitioning to improve performance and costs in BigQuery, there is also another technique available called clustering.<\/p>\n","protected":false},"featured_media":2080,"template":"","meta":{"_acf_changed":false,"editor_notices":[],"footnotes":""},"tags":[],"magazine_issue":[],"magazine_tag":[],"resource_tag":[14],"class_list":["post-328","knowledge","type-knowledge","status-publish","has-post-thumbnail","hentry","resource_tag-privacy"],"acf":[],"yoast_head":"<title>Optimizing your BigQuery tables using clustering<\/title>\n<meta name=\"description\" content=\"Besides the table partitioning to improve performance and costs in BigQuery, there is also another technique available called clustering.\" \/>\n<meta name=\"robots\" content=\"noindex, follow\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Optimizing your BigQuery tables using clustering\" \/>\n<meta property=\"og:description\" content=\"Besides the table partitioning to improve performance and costs in BigQuery, there is also another technique available called clustering.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/usercentrics.com\/us\/knowledge-hub\/optimizing-your-bigquery-tables-using-clustering\/\" \/>\n<meta property=\"og:site_name\" content=\"Usercentrics - US\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/usercentrics\" \/>\n<meta property=\"article:modified_time\" content=\"2024-09-03T17:39:22+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/usercentrics.com\/wp-content\/uploads\/2021\/03\/Google-BigQuery.png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Optimizing your BigQuery tables using clustering\" \/>\n<meta name=\"twitter:description\" content=\"Besides the table partitioning to improve performance and costs in BigQuery, there is also another technique available called clustering.\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/usercentrics.com\/wp-content\/uploads\/2021\/03\/Google-BigQuery.png\" \/>\n<meta name=\"twitter:site\" content=\"@usercentrics\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/knowledge-hub\\\/optimizing-your-bigquery-tables-using-clustering\\\/\",\"url\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/knowledge-hub\\\/optimizing-your-bigquery-tables-using-clustering\\\/\",\"name\":\"Optimizing your BigQuery tables using clustering\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/knowledge-hub\\\/optimizing-your-bigquery-tables-using-clustering\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/knowledge-hub\\\/optimizing-your-bigquery-tables-using-clustering\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/wp-content\\\/uploads\\\/sites\\\/7\\\/2021\\\/03\\\/Google-BigQuery.png\",\"datePublished\":\"2021-07-15T15:11:43+00:00\",\"dateModified\":\"2024-09-03T17:39:22+00:00\",\"description\":\"Besides the table partitioning to improve performance and costs in BigQuery, there is also another technique available called clustering.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/knowledge-hub\\\/optimizing-your-bigquery-tables-using-clustering\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/knowledge-hub\\\/optimizing-your-bigquery-tables-using-clustering\\\/\"}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/knowledge-hub\\\/optimizing-your-bigquery-tables-using-clustering\\\/#primaryimage\",\"url\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/wp-content\\\/uploads\\\/sites\\\/7\\\/2021\\\/03\\\/Google-BigQuery.png\",\"contentUrl\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/wp-content\\\/uploads\\\/sites\\\/7\\\/2021\\\/03\\\/Google-BigQuery.png\",\"copyrightNotice\":\"\u00a9 Copyright 2026 Usercentrics GmbH\",\"creator\":{\"@type\":\"Organization\",\"name\":\"Usercentrics GmbH\"},\"creditText\":\"Image: Usercentrics GmbH\",\"license\":\"\\\/terms-and-conditions\\\/\",\"acquireLicensePage\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/contact\\\/\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/knowledge-hub\\\/optimizing-your-bigquery-tables-using-clustering\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Resources\",\"item\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/resources\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Blog\",\"item\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/knowledge-hub\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Optimizing your BigQuery tables using clustering\",\"item\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/knowledge-hub\\\/optimizing-your-bigquery-tables-using-clustering\\\/\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/#website\",\"url\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/\",\"name\":\"Usercentrics - US\",\"description\":\"Consent Management Platform (CMP) Usercentrics\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/?s={search_term_string}\"}}],\"inLanguage\":\"en-US\",\"publisher\":{\"@id\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/#organization\"}},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/#organization\",\"name\":\"Usercentrics\",\"legalName\":\"Usercentrics GmbH\",\"url\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/\",\"foundingDate\":\"2012\",\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/company\\\/usercentrics\",\"https:\\\/\\\/twitter.com\\\/usercentrics\",\"https:\\\/\\\/www.facebook.com\\\/usercentrics\\\/\",\"https:\\\/\\\/instagram.com\\\/usercentrics\\\/\",\"https:\\\/\\\/www.tiktok.com\\\/@usercentrics\",\"https:\\\/\\\/www.youtube.com\\\/@Usercentrics\"],\"contactPoint\":{\"@type\":\"ContactPoint\",\"contactType\":\"customer support\",\"url\":\"https:\\\/\\\/usercentrics.com\\\/us\\\/contact\\\/\"}}]}<\/script>","yoast_head_json":{"title":"Optimizing your BigQuery tables using clustering","description":"Besides the table partitioning to improve performance and costs in BigQuery, there is also another technique available called clustering.","robots":{"index":"noindex","follow":"follow"},"og_locale":"en_US","og_type":"article","og_title":"Optimizing your BigQuery tables using clustering","og_description":"Besides the table partitioning to improve performance and costs in BigQuery, there is also another technique available called clustering.","og_url":"https:\/\/usercentrics.com\/us\/knowledge-hub\/optimizing-your-bigquery-tables-using-clustering\/","og_site_name":"Usercentrics - US","article_publisher":"https:\/\/www.facebook.com\/usercentrics","article_modified_time":"2024-09-03T17:39:22+00:00","og_image":[{"url":"https:\/\/usercentrics.com\/wp-content\/uploads\/2021\/03\/Google-BigQuery.png","type":"","width":"","height":""}],"twitter_card":"summary_large_image","twitter_title":"Optimizing your BigQuery tables using clustering","twitter_description":"Besides the table partitioning to improve performance and costs in BigQuery, there is also another technique available called clustering.","twitter_image":"https:\/\/usercentrics.com\/wp-content\/uploads\/2021\/03\/Google-BigQuery.png","twitter_site":"@usercentrics","twitter_misc":{"Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/usercentrics.com\/us\/knowledge-hub\/optimizing-your-bigquery-tables-using-clustering\/","url":"https:\/\/usercentrics.com\/us\/knowledge-hub\/optimizing-your-bigquery-tables-using-clustering\/","name":"Optimizing your BigQuery tables using clustering","isPartOf":{"@id":"https:\/\/usercentrics.com\/us\/#website"},"primaryImageOfPage":{"@id":"https:\/\/usercentrics.com\/us\/knowledge-hub\/optimizing-your-bigquery-tables-using-clustering\/#primaryimage"},"image":{"@id":"https:\/\/usercentrics.com\/us\/knowledge-hub\/optimizing-your-bigquery-tables-using-clustering\/#primaryimage"},"thumbnailUrl":"https:\/\/usercentrics.com\/us\/wp-content\/uploads\/sites\/7\/2021\/03\/Google-BigQuery.png","datePublished":"2021-07-15T15:11:43+00:00","dateModified":"2024-09-03T17:39:22+00:00","description":"Besides the table partitioning to improve performance and costs in BigQuery, there is also another technique available called clustering.","breadcrumb":{"@id":"https:\/\/usercentrics.com\/us\/knowledge-hub\/optimizing-your-bigquery-tables-using-clustering\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":"https:\/\/usercentrics.com\/us\/knowledge-hub\/optimizing-your-bigquery-tables-using-clustering\/"}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/usercentrics.com\/us\/knowledge-hub\/optimizing-your-bigquery-tables-using-clustering\/#primaryimage","url":"https:\/\/usercentrics.com\/us\/wp-content\/uploads\/sites\/7\/2021\/03\/Google-BigQuery.png","contentUrl":"https:\/\/usercentrics.com\/us\/wp-content\/uploads\/sites\/7\/2021\/03\/Google-BigQuery.png","copyrightNotice":"\u00a9 Copyright 2026 Usercentrics GmbH","creator":{"@type":"Organization","name":"Usercentrics GmbH"},"creditText":"Image: Usercentrics GmbH","license":"\/terms-and-conditions\/","acquireLicensePage":"https:\/\/usercentrics.com\/us\/contact\/"},{"@type":"BreadcrumbList","@id":"https:\/\/usercentrics.com\/us\/knowledge-hub\/optimizing-your-bigquery-tables-using-clustering\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Resources","item":"https:\/\/usercentrics.com\/us\/resources\/"},{"@type":"ListItem","position":2,"name":"Blog","item":"https:\/\/usercentrics.com\/us\/knowledge-hub\/"},{"@type":"ListItem","position":3,"name":"Optimizing your BigQuery tables using clustering","item":"https:\/\/usercentrics.com\/us\/knowledge-hub\/optimizing-your-bigquery-tables-using-clustering\/"}]},{"@type":"WebSite","@id":"https:\/\/usercentrics.com\/us\/#website","url":"https:\/\/usercentrics.com\/us\/","name":"Usercentrics - US","description":"Consent Management Platform (CMP) Usercentrics","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/usercentrics.com\/us\/?s={search_term_string}"}}],"inLanguage":"en-US","publisher":{"@id":"https:\/\/usercentrics.com\/us\/#organization"}},{"@type":"Organization","@id":"https:\/\/usercentrics.com\/us\/#organization","name":"Usercentrics","legalName":"Usercentrics GmbH","url":"https:\/\/usercentrics.com\/us\/","foundingDate":"2012","sameAs":["https:\/\/www.linkedin.com\/company\/usercentrics","https:\/\/twitter.com\/usercentrics","https:\/\/www.facebook.com\/usercentrics\/","https:\/\/instagram.com\/usercentrics\/","https:\/\/www.tiktok.com\/@usercentrics","https:\/\/www.youtube.com\/@Usercentrics"],"contactPoint":{"@type":"ContactPoint","contactType":"customer support","url":"https:\/\/usercentrics.com\/us\/contact\/"}}]}},"_links":{"self":[{"href":"https:\/\/usercentrics.com\/us\/wp-json\/wp\/v2\/knowledge\/328","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/usercentrics.com\/us\/wp-json\/wp\/v2\/knowledge"}],"about":[{"href":"https:\/\/usercentrics.com\/us\/wp-json\/wp\/v2\/types\/knowledge"}],"version-history":[{"count":0,"href":"https:\/\/usercentrics.com\/us\/wp-json\/wp\/v2\/knowledge\/328\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/usercentrics.com\/us\/wp-json\/wp\/v2\/media\/2080"}],"wp:attachment":[{"href":"https:\/\/usercentrics.com\/us\/wp-json\/wp\/v2\/media?parent=328"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/usercentrics.com\/us\/wp-json\/wp\/v2\/tags?post=328"},{"taxonomy":"magazine_issue","embeddable":true,"href":"https:\/\/usercentrics.com\/us\/wp-json\/wp\/v2\/magazine_issue?post=328"},{"taxonomy":"magazine_tag","embeddable":true,"href":"https:\/\/usercentrics.com\/us\/wp-json\/wp\/v2\/magazine_tag?post=328"},{"taxonomy":"resource_tag","embeddable":true,"href":"https:\/\/usercentrics.com\/us\/wp-json\/wp\/v2\/resource_tag?post=328"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}