Two BigQuery Geospatial query tricks

Today we’ll explore two useful tricks that came up during a discussion on how to implement some complex query.

First one is about OUTER spatial JOIN. As you might know, BigQuery does not yet implement optimized spatial OUTER JOIN. What if we do need one, something like

Update: BigQuery now allows such queries, but would use regular join query plan rather than optimized spatial join, so it might work but would be very slow for larger inputs.

One can do a generic rewrite, using TWO joins. We start with a spatial INNER join first. This drops polygons that don’t intersect any points. Then we use LEFT OUTER (equality) JOIN on geo_id column, it will keep all left rows and bring the computed column, like:

This is a reasonable approach, and often results in good performance. But it is also possible to do a spatial-specific trick in use cases like this, where we don’t care about all the joined pairs, but only about one side of the join and some measure aggregated over the other side of the join. E.g. in this case we care about geo_id from LEFT side, and count of points on the RIGHT side.

The idea is to use inner JOIN, but make sure all polygons have some match and are never dropped by our (inner) JOIN. The insight is that, assuming all poly_geom fields are non-NULL and not the empty geography, any polygon intersects the whole Earth! Let’s just add it to the second dataset, and tweak aggregations to ignore this extra row:

This version has another benefit, important when you don’t have a unique ID in the OUTER table, or you are working on generic system and don’t know if a column is unique. You might try to use ROW_NUMBER() to uniquely identify rows. But the previous solution references polygon_dataset twice, and if you add ROW_NUMBER() to each of them — the two functions might produce different result unless you sort all input data, which is expensive and again requires a unique column. Even if you use CTE (aka Common Table Expression) via WITH statement, BigQuery is somewhat different from many other databases here, and would compute ROW_NUMBER() independently for our two instances of polygon_dataset2. You thus should not join two instances of the table on ROW_NUMBER(), the following query could produce wrong results:

Our second solution avoids this problem, and can safely use ROW_NUMBER() as a unique identifier, since it is only computed once.

Second idea applies when we need additional fields from the table that we use in GROUP BY. You might "aggregate" each of them using ANY_VALUE function, but sometimes you are building a generic stored procedure that takes the left-side table or query as an input, and you might not even have a list of all of its fields in advance. The trick is you can capture the whole row as a struct using the table name or alias. You can later unpack this row into individual fields using STRUCT.* operator:

--

--

Hi, I'm TL of BigQuery Geospatial project. Posting small recipes and various notes for BQ Geospatial users.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Michael Entin

Hi, I'm TL of BigQuery Geospatial project. Posting small recipes and various notes for BQ Geospatial users.