BigQuery GIS Performance Tips
Random BQ GIS performance tips
1. Store Geography objects in the table
We sometimes see users who store spatial objects in the table, but keep them as lat/lon pairs or as WKT / GeoJSON text and construct Geography objects “on the fly” in a query. This prevents many BQGIS optimizations. Instead, add a Geography-typed column to the table (or replace existing columns completely). It has several benefits:
- it avoids the cost of constructing Geography objects at runtime,
- it allows BigQuery to use spatial index when querying this table.
E.g.
SELECT * FROM T WHERE
ST_DWITHIN(geo, ST_GeogPoint(122.3, 47.6), 100)
is typically much faster than
SELECT * FROM T WHERE
ST_DWITHIN(ST_GeogPoint(lon, lat), ST_GeogPoint(122.3, 47.6), 100)
To replace lat/lon pair with Geography-typed column, run
SELECT * EXCEPT(lat, lon), ST_GeogPoint(lon, lat) AS geog FROM T
To replace WKT text with Geography-typed column,
SELECT * EXCEPT(wkt) ST_GeogFromText(wkt) AS geog FROM T
2. Spatially cluster tables
See more about performance and cost benefits of storage clustering in BigQuery: https://cloud.google.com/bigquery/docs/clustered-tables
If you have a large amount of geospatial data, and frequently query it with spatial filters like SELECT ... WHERE ST_*(geo_column, geo_constant)
, it might be beneficial for query performance and to reduce query cost to spatially cluster stored data.
[Update] BigQuery now natively supports storage clustering by GEOGRAPHY columns. You can create spatially clustered tables using CREATE TABLE ... [PARTITION BY ...] CLUSTER BY <geog_column>
If you have any proxy column that provides reasonable spatial clustering — you can also cluster by this column. You can use zipcode for US locations, a combination of country code, state and zipcode if the data is global, geohash string, etc.
3. Maybe combine spatial predicates
This one is not a hard rule. Depending on specific data distribution it might help performance or degrade it. Experiment and test both approaches.
Say you want to find all the records that are within 100 meters of at least one of 10 points. You might write this query as
SELECT ... FROM T
WHERE ST_DWITHIN(location, ST_GeogFromText('POINT(123 45)'), 100)
OR ST_DWITHIN(location, ST_GeogFromText('POINT(145 23)'), 100)
OR ...
It might be more efficient to combine the points into a single object, and check whether the location is within 100 meters of the union:
SELECT ... FROM T
WHERE ST_DWITHIN(location,
ST_GeogFromText('MULTIPOINT((123 45), (145 23), ...)'),
100)
4. Avoid mixing equality and spatial join conditions
This one is also not a hard rule, but on all occasions I’ve observed it improved query performance.
The issue mostly affects advanced users, who know the spatial join is complex and expensive, and try to improve the query by adding equality condition to the join. So they might write something like
SELECT ... FROM T1 a JOIN T2 b
ON CAST(a.lat*100 AS INT64) = CAST(b.lat*100 AS INT64)
AND CAST(a.lon*100 AS INT64) = CAST(b.lon*100 AS INT64)
AND ST_DWITHIN(a.geo, b.geo, 10)
Another common version of this idea is to use some existing spatial hierarchy:
SELECT ... FROM T1 a JOIN T2 b
ON a.juristiction = b.jurisdiction
AND ST_DWITHIN(a.geo, b.geo, 10)
This query adds proximity of latitude and longitude, which is much cheaper to test, before checking spatial predicate ST_DWITHIN
(or ST_CONTAINS
, etc). In most cases the addition has harmful effect on query correctness and performance:
- the lat/lon checks would incorrectly filter out some close neighbors,
- the lat/lon checks prevents BigQuery from using optimized geospatial join. When both equality and geospatial conditions are present, BigQuery executes query using equality join, with post-join evaluation of spatial condition. Geospatial join is usually faster.
Use simpler version in most cases:
SELECT ... FROM T1 a JOIN T2 b
WHERE ST_DWITHIN(a.geo, b.geo, 10)
If you really need the second condition not for performance, but for correctness, hide it from the query optimizer behind some opaque function of both sides of the join, e.g. for integers SAFE_SUBTRACT(a.geo_id, b.geo_id) = 0
, for strings maybe STARTS_WITH(a.id, b.id) AND STARTS_WITH(b.id, a.id)
.
5. Avoid spatial OUTER JOIN
At the time of this writing, BigQuery does not optimize LEFT/RIGHT/FULL spatial joins, so they are much slower than INNER join on spatial condition. The following post has a few ideas how to rewrite outer join using inner join:
Next in performance series:
- Which spatial predicate should I use?
- Did my query get optimized join?