Two BigQuery Geospatial query tricks
Today we’ll explore two useful tricks that came up during a discussion on how to implement some complex query.
First one is about OUTER spatial JOIN. As you might know, BigQuery does not yet implement optimized spatial OUTER JOIN. What if we do need one, something like
SELECT
a.geo_id,
COUNT(b.point_geom) AS count
FROM
`polygon_dataset` a
LEFT OUTER JOIN -- ERROR
`point_dataset` b
ON
st_intersects(a.poly_geom, b.point_geom))
GROUP BY
a.geo_id
Update: BigQuery now allows such queries, but would use regular join query plan rather than optimized spatial join, so it might work but would be very slow for larger inputs.
One can do a generic rewrite, using TWO joins. We start with a spatial INNER join first. This drops polygons that don’t intersect any points. Then we use LEFT OUTER (equality) JOIN on geo_id
column, it will keep all left rows and bring the computed column, like:
SELECT
aa.geo_id,
COALESCE(bb.count, 0) AS count
FROM
`polygon_dataset` aa
LEFT OUTER JOIN
(
SELECT
a.geo_id,
COUNT(b.point_geom) AS count
FROM
`polygon_dataset` a
JOIN
`point_dataset` b
ON
st_intersects(a.poly_geom, b.point_geom))
GROUP BY
a.geo_id
) bb
ON aa.geo_id = bb.geo_id
This is a reasonable approach, and often results in good performance. But it is also possible to do a spatial-specific trick in use cases like this, where we don’t care about all the joined pairs, but only about one side of the join and some measure aggregated over the other side of the join. E.g. in this case we care about geo_id
from LEFT side, and count of points on the RIGHT side.
The idea is to use inner JOIN, but make sure all polygons have some match and are never dropped by our (inner) JOIN. The insight is that, assuming all poly_geom
fields are non-NULL and not the empty geography, any polygon intersects the whole Earth! Let’s just add it to the second dataset, and tweak aggregations to ignore this extra row:
SELECT
a.geo_id,
COUNTIF(real_data) AS count
FROM
`polygon_dataset` a
JOIN
(SELECT point_geom, TRUE AS real_data FROM `point_dataset`
UNION ALL
-- add whole Earth, and flag it with real_data = FALSE
SELECT ST_GeogFromText('FULLGLOBE'), FALSE) b
ON
st_intersects(a.poly_geom, b.point_geom))
GROUP BY
a.geo_id
This version has another benefit, important when you don’t have a unique ID in the OUTER table, or you are working on generic system and don’t know if a column is unique. You might try to use ROW_NUMBER()
to uniquely identify rows. But the previous solution references polygon_dataset
twice, and if you add ROW_NUMBER()
to each of them — the two functions might produce different result unless you sort all input data, which is expensive and again requires a unique column. Even if you use CTE (aka Common Table Expression) via WITH
statement, BigQuery is somewhat different from many other databases here, and would compute ROW_NUMBER()
independently for our two instances of polygon_dataset2
. You thus should not join two instances of the table on ROW_NUMBER()
, the following query could produce wrong results:
--- WARNING: INCORRECT QUERY, do not copy ---WITH polygon_dataset2 AS (
(SELECT ROW_NUMBER() row_id, * FROM `polygon_dataset`)
)
SELECT
aa.row_id, bb.count
FROM
`polygon_dataset2` aa
LEFT OUTER JOIN
(
SELECT
a.row_id,
COUNT(b.point_geom) AS count
FROM
`polygon_dataset2` a
JOIN
`point_dataset` b
ON
st_intersects(a.poly_geom, b.point_geom))
GROUP BY
a.row_id
) bb
-- BAD JOIN, the two row_id are not guaranteed to match
ON aa.row_id = bb.row_id--- WARNING: INCORRECT QUERY, do not copy ---
Our second solution avoids this problem, and can safely use ROW_NUMBER()
as a unique identifier, since it is only computed once.
Second idea applies when we need additional fields from the table that we use in GROUP BY. You might "aggregate" each of them using ANY_VALUE
function, but sometimes you are building a generic stored procedure that takes the left-side table or query as an input, and you might not even have a list of all of its fields in advance. The trick is you can capture the whole row as a struct using the table name or alias. You can later unpack this row into individual fields using STRUCT.*
operator:
SELECT
-- unpacks the row into individual fields
ANY_VALUE(a.properties).*,
COUNTIF(real_data) AS count
FROM
(SELECT poly_geom, geo_id,
-- packs whole row
p AS properties
FROM `polygon_dataset` p) a
JOIN
(SELECT point_geom, TRUE AS real_data FROM `point_dataset`
UNION ALL
SELECT ST_GeogFromText('FULLGLOBE'), FALSE) b
ON
st_intersects(a.poly_geom, b.point_geom))
GROUP BY
a.geo_id