Determine Query Cost in BigQuery

Jatin
4 min readAug 26, 2021

In general, the queries that do less work perform better. When evaluating query performance in BigQuery, the amount of work depends on many factors.

a. Input data and data sources (I/O)

b. Communication between nodes (shuffling)

c. Computation

d. Outputs (materialization)

e. Query anti-patterns

In this article lets talk about computation alone. Compute is one of the building blocks of BigQuery. Compute is Dremel, a large multi-tenant cluster that executes Standard SQL queries. Dremel is a system that scales to thousands of CPUs and petabytes of data. It is scalable and interactive ad-hoc system for analysis and it compliments the MapReduce based computing.

Without going into much details, just think of it a machine, with some predefined configuration, which is capable of executing tasks or queries. The machine will execute or complete the task faster if the task is small and on the other hand if the task is huge, then its going to take some before completing the task.

Now, as BigQuery cost is determined by number of bytes processed, it is very important to write efficient queries which scan less data. Lower the number of bytes processed, lower the utilization of compute resources ( which are named as slots in BigQuery) and which in return lower your bill.

Let’s discuss some more about how to find queries which are costly i.e. consuming more resources in bigquery. We can utitlize the information available in INFORMATION_SCHEMA in bigquery. The INFORMATION_SCHEMA table is available with different metadata information. If you want to see the query history in a particular table then use INFORMATION_SCHEMA.JOBS_BY_PROJECT and on the other hand if you want to see all the query history in your organization then INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION.

There are other tables which you can use to get metadata information at different levels i.e. table / view/dataset/columns/reservation etc.

For now, lets discuss the INFORMATION_SCHEMA.JOBS_BY_PROJECT. You can get the detailed schema from this link . This returns all jobs submitted in the current project and queried as below: (change your region accordingly)

Let’s find the queries using above table which scan more data i.e. how to find number of bytes billed by a query.

Query to derive different performance metric from information_schema

The above script has number of useful columns which includes:

JobId — The unique id of the bigquery job

ParentJobId — This will include the job id of the parent job (if any). This is mostly useful when you want to search for a particular Stored Proc within the INFORMATION_SCHEMA table.

creation_duration_s — Total duration for which query was running for from the time of its creation in seconds

execution_duration_s — total_duration for which query was executing in seconds.

project_id — Project Id of the query

user_email, job_type, statement_type, query, total_bytes_billed are basic columns from the table and are self explanatory.

total_bytes_billed_GB — Number of bytes billed in GB. We are dividing total_bytes_billed by POWER(2,30) which will give the output in GBs. If you want in TBs then use POWER(2,40) and in case of MB use POWER(2,20).

query_cost_usd — This is calculated as: 1 TB = 5 USD, convert total_bytes_billed to TB and then multiply it with 5 to get total USD amount.

total_slot_ms — Number of slots consumed by the query per millisecond

avg_slots -This will give the average number of slots utitlized by the query over whole execution duration. This is derived by dividing the total_slot_ms by total duration of query in milliseconds. TIMESTAMP_DIFF(end_time,start_time,MILLISECOND). Since start_time and end_time are both in millisecond. This is very important field when you want to analyse the slot consumption per query in a project.

cache_hit — Just use this column to filter for only those queries which are not cached.

Using the above script as base, you can check the amount of data scanned and query cost over a period of time. Below script will give the output of total number of bytes scanned , total_query_cost per day for all queries or some specific query:

Performance of script over time

The output will look like :

Right now, we are scanning for a particular script and the above output is telling that script ran 48 times and it scanned 18GB on 08/25/2021 and total_cost was 0.088 USD. You can find a particular query using query column in INFORMATION_SCHEMA TABLE. Let’s assume we want to look for a particular stored proc then we can use something like: WHERE query like ‘%your stored-proc name%’. Other wise it will tell you total cost of all the queries and total_executions will tell you the total_number of queries ran on a particular day.

We can use the above information and create data studio dashboards which can very helpful in analyzing the costly queries, queries scanning more data than usual and execution duration of each query etc.

Below are the links which can useful to get more information on this topic:

  1. https://cloud.google.com/bigquery/docs/slots
  2. https://cloud.google.com/bigquery/docs/information-schema-jobs
  3. https://cloud.google.com/bigquery/docs/information-schema-jobs-timeline

There is also github repo link where you can find the queries related to performance optimization of bigquery scripts.

Thanks for Reading..!!

--

--