Divide the query to make it dirt cheap
Let’s take a look at a typical query where we select some spatial objects in some region. E.g. consider the first query in Dekart.xyz’s Overture samples, which selects roads in Nevada (slightly adapted for current schema):
-- Step 1: Get the geometry of Nevada
WITH nevada_geometry AS (
SELECT
geometry
FROM
`bigquery-public-data.overture_maps.division_area`
WHERE
country = 'US'
AND region = 'US-NV'
AND subtype = 'region'
)
-- Step 2: Select roads within Nevada
SELECT
s.geometry,
s.class,
s.speed_limits.list[safe_offset(0)].element.max_speed
as max_speed
FROM
`bigquery-public-data.overture_maps.segment` AS s,
nevada_geometry AS ng
WHERE
s.subtype = 'road'
and class not in ('track', 'driveway', 'path', 'footway', 'sidewalk', 'pedestrian', 'cycleway', 'steps', 'crosswalk', 'bridleway', 'alley')
AND ST_WITHIN(s.geometry, ng.geometry)
BigQuery estimates it to process 78 GB, it actually billed 75.47 GB (relevant if you are using on-demand queries) and used ~26 min of slot-seconds (the metric you should watch if you use a reservation).
It is not expensive — you can run dozen of such queries a month under the BigQuery free tier plan (1 free TiB of queries a month). But can we make it even cheaper?
The query plan shows that even though the query looks like two-stage query, first select Nevada polygon, and then query it, we actually got a spatial join:
Why is that, and can we avoid it? Let’s dig into it, we’ll find some things that make it difficult for BigQuery to optimize the query, some query planning features that are missing so far, and how to work around them.
First, even though we know the first query selects a single value, BigQuery does not know it. For the query planner, we are selecting some rows — there could be potentially thousands of them, so it has to use join.
Let’s try to explicitly tell the planner there can be only a single row there, by changing cross join into subquery. We just need to remove nevada_geometry AS ng
in FROM
clause, and use its value directly in the filter expression as ST_WITHIN(s.geometry, (SELECT geometry FROM nevada_geometry))
:
-- Step 1: Get the geometry of Nevada
WITH nevada_geometry AS (
SELECT
geometry
FROM
`bigquery-public-data.overture_maps.division_area`
WHERE
country = 'US'
AND region = 'US-NV'
AND subtype = 'region'
)
-- Step 2: Select roads within Nevada
SELECT
s.geometry,
s.class,
s.speed_limits.list[safe_offset(0)].element.max_speed
as max_speed
FROM
`bigquery-public-data.overture_maps.segment` AS s
WHERE
s.subtype = 'road'
and class not in ('track', 'driveway', 'path', 'footway', 'sidewalk', 'pedestrian', 'cycleway', 'steps', 'crosswalk', 'bridleway', 'alley')
AND ST_WITHIN(s.geometry, (SELECT geometry FROM nevada_geometry))
OK, we no longer have a join:
But we still processed 75 GB and the slot-seconds have actually increased (looks like this plan execution has some room for improvement, there is no reason for it to be more expensive).
What can we try next? Let’s actually split the query in two, so the first computes the Nevada polygon, saves it to a variable, and the second one does the actual filtering using the computed variable. The benefit is that for the second query it will be a constant value.
-- Step 1: Get the geometry of Nevada
DECLARE nevada_geometry GEOGRAPHY;
SET nevada_geometry = (
SELECT
geometry
FROM
`bigquery-public-data.overture_maps.division_area`
WHERE
country = 'US'
AND region = 'US-NV'
AND subtype = 'region'
);
-- Step 2: Select roads within Nevada
SELECT
s.geometry,
s.class,
s.speed_limits.list[safe_offset(0)].element.max_speed
as max_speed
FROM
`bigquery-public-data.overture_maps.segment` AS s
WHERE
s.subtype = 'road'
and class not in ('track', 'driveway', 'path', 'footway', 'sidewalk', 'pedestrian', 'cycleway', 'steps', 'crosswalk', 'bridleway', 'alley')
AND ST_WITHIN(s.geometry, nevada_geometry);
We have two queries now, but the first cost only 4.58 GB / 370 ms slot time (yes, milliseconds!), second cost 1.16 GB / 28 slot-seconds. The total is just ~6GB and ~30 slot-seconds — about 12x less scanned bytes, and about 50x less slot-seconds. The end-to-end time did not change, and I typically get about 5 to 7 seconds.
How could the second query be so cheap? It is cheap thanks to clustering of segment
table by geometry
column. Since nevada_geometry
is now a constant value, BigQuery could utilize this clustering and only read the specific tablets that are spatially close to Nevada. (You might ask why BigQuery did not use the clustering in the second version of the query? And that’s a valid question, it could in theory do it — but this optimization is not yet implemented.)
If we care about bytes scanned, and the first part of the query seems still too expensive — you can further extract the polygons for division levels you care about to a new table, pay the cost of this query once, and later use this tiny table to get the division area polygons. E.g. all US states are just 11.5MB:
CREATE TABLE tmp.us_states
AS
SELECT
country, region, geometry
FROM
`bigquery-public-data.overture_maps.division_area`
WHERE
country = 'US' and subtype = 'region'
Now some sad news: BigQuery GeoViz cannot run visualization using the last version of the SQL. GeoViz wraps the query with extra SQL to do WKT to GeoJson conversion, and this breaks for multi-stage queries. You can either materialize the query results to a temporary table, or use another tool. Here I rendered the results using Dekart.xyz, that does not have this issue: