BigQuery GIS performance tips

Random BQ GIS performance tips

1. Store Geography objects in the table

We sometimes see users who store spatial objects in the table, but keep them as lat/lon pairs or as WKT / GeoJSON text and construct Geography objects “on the fly” in a query. This prevents many BQGIS optimizations. Instead, add a Geography-typed column to the table (or replace existing columns completely). It has several benefits:

E.g.

SELECT * FROM T WHERE
ST_DWITHIN(geo, ST_GeogPoint(122.3, 47.6), 100)

is typically much faster than

SELECT * FROM T WHERE 
ST_DWITHIN(ST_GeogPoint(lon, lat), ST_GeogPoint(122.3, 47.6), 100)

To replace lat/lon pair with Geography-typed column, run

SELECT * EXCEPT(lat, lon), ST_GeogPoint(lon, lat) AS geog FROM T

To replace WKT text with Geography-typed column,

SELECT * EXCEPT(wkt) ST_GeogFromText(wkt) AS geog FROM T

2. Spatially cluster tables

See more about performance and cost benefits of storage clustering in BigQuery: https://cloud.google.com/bigquery/docs/clustered-tables

If you have a large amount of geospatial data, and frequently query it with spatial filters like SELECT ... WHERE ST_*(geo_column, geo_constant), it might be beneficial for query performance and cost to spatially cluster stored data.

BigQuery GIS does not support clustering by GEOGRAPHY column yet, but if you have any proxy column that provides reasonable spatial clustering — cluster by this column. You can use zipcode for US locations, a combination of country code, state and zipcode if the data is global, geohash string, etc.

[Update] BigQuery native storage clustering by GEOGRAPHY columns is supported now, you can create spatially clustered tables using
CREATE TABLE ... [PARTITION BY ...] CLUSTER BY <geog_column>

3. Maybe combine spatial predicates

This one is not a hard rule. Depending on specific data distribution it might help performance or degrade it. Experiment and test both approaches.

Say you want to find all the records that are within 100 meters of at least one of 10 points. You might write this query as

SELECT ... FROM T 
WHERE ST_DWITHIN(location, ST_GeogFromText('POINT(123 45)'), 100)
OR ST_DWITHIN(location, ST_GeogFromText('POINT(145 23)'), 100)
OR ...

It might be more efficient to combine the points into a single object, and check whether the location is within 100 meters of the union:

SELECT ... FROM T 
WHERE ST_DWITHIN(location,
ST_GeogFromText('MULTIPOINT((123 45), (145 23), ...)'),
100)

4. Avoid mixing equality and spatial join conditions

This one is also not a hard rule, but in all occasions I’ve observed it improved query performance.

The issue mostly affects advanced users, who know the spatial join is complex and expensive, and try to improve the query by adding equality condition to the join. So they might write something like

SELECT ... FROM T1 a JOIN T2 b 
ON CAST(a.lat*100 AS INT64) = CAST(b.lat*100 AS INT64)
AND CAST(a.lon*100 AS INT64) = CAST(b.lon*100 AS INT64)
WHERE ST_DWITHIN(a.geo, b.geo, 10)

This query adds proximity of latitude and longitude, which is much cheaper to test, before checking spatial predicate ST_DWITHIN (or ST_CONTAINS, etc). In most cases the addition has harmful effect on query correctness and performance:

Use simpler version in most cases:

SELECT ... FROM T1 a JOIN T2 b
WHERE ST_DWITHIN(a.geo, b.geo, 10)

Next in performance series:

Hi, I'm TL of BigQuery GIS project. Posting small recipes and various notes for BQ GIS users.

Hi, I'm TL of BigQuery GIS project. Posting small recipes and various notes for BQ GIS users.