BigQuery GIS Performance Tips

Random BQ GIS performance tips

1. Store Geography objects in the table

  • it avoids the cost of constructing Geography objects at runtime,
  • it allows BigQuery to use spatial index when querying this table.

E.g.

SELECT * FROM T WHERE
ST_DWITHIN(geo, ST_GeogPoint(122.3, 47.6), 100)

is typically much faster than

SELECT * FROM T WHERE 
ST_DWITHIN(ST_GeogPoint(lon, lat), ST_GeogPoint(122.3, 47.6), 100)

To replace lat/lon pair with Geography-typed column, run

SELECT * EXCEPT(lat, lon), ST_GeogPoint(lon, lat) AS geog FROM T

To replace WKT text with Geography-typed column,

SELECT * EXCEPT(wkt) ST_GeogFromText(wkt) AS geog FROM T

2. Spatially cluster tables

If you have a large amount of geospatial data, and frequently query it with spatial filters like SELECT ... WHERE ST_*(geo_column, geo_constant), it might be beneficial for query performance and cost to spatially cluster stored data.

BigQuery GIS does not support clustering by GEOGRAPHY column yet, but if you have any proxy column that provides reasonable spatial clustering — cluster by this column. You can use zipcode for US locations, a combination of country code, state and zipcode if the data is global, geohash string, etc.

[Update] BigQuery native storage clustering by GEOGRAPHY columns is supported now, you can create spatially clustered tables using
CREATE TABLE ... [PARTITION BY ...] CLUSTER BY <geog_column>

3. Maybe combine spatial predicates

Say you want to find all the records that are within 100 meters of at least one of 10 points. You might write this query as

SELECT ... FROM T 
WHERE ST_DWITHIN(location, ST_GeogFromText('POINT(123 45)'), 100)
OR ST_DWITHIN(location, ST_GeogFromText('POINT(145 23)'), 100)
OR ...

It might be more efficient to combine the points into a single object, and check whether the location is within 100 meters of the union:

SELECT ... FROM T 
WHERE ST_DWITHIN(location,
ST_GeogFromText('MULTIPOINT((123 45), (145 23), ...)'),
100)

4. Avoid mixing equality and spatial join conditions

The issue mostly affects advanced users, who know the spatial join is complex and expensive, and try to improve the query by adding equality condition to the join. So they might write something like

SELECT ... FROM T1 a JOIN T2 b 
ON CAST(a.lat*100 AS INT64) = CAST(b.lat*100 AS INT64)
AND CAST(a.lon*100 AS INT64) = CAST(b.lon*100 AS INT64)
AND ST_DWITHIN(a.geo, b.geo, 10)

Another common version of this idea is to use some existing spatial hierarchy:

SELECT ... FROM T1 a JOIN T2 b 
ON a.juristiction = b.jurisdiction
AND ST_DWITHIN(a.geo, b.geo, 10)

This query adds proximity of latitude and longitude, which is much cheaper to test, before checking spatial predicate ST_DWITHIN (or ST_CONTAINS, etc). In most cases the addition has harmful effect on query correctness and performance:

  • the lat/lon checks would incorrectly filter out some close neighbors,
  • the lat/lon checks prevents BigQuery from using optimized geospatial join. When both equality and geospatial conditions are present, BigQuery executes query using equality join, with post-join evaluation of spatial condition. Geospatial join is usually faster.

Use simpler version in most cases:

SELECT ... FROM T1 a JOIN T2 b
WHERE ST_DWITHIN(a.geo, b.geo, 10)

Next in performance series:

--

--

Hi, I'm TL of BigQuery Geospatial project. Posting small recipes and various notes for BQ Geospatial users.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Michael Entin

Hi, I'm TL of BigQuery Geospatial project. Posting small recipes and various notes for BQ Geospatial users.