Two BigQuery Geospatial query tricks

Today we’ll explore two useful tricks that came up during a discussion on how to implement some complex query.

First one is about OUTER spatial JOIN. As you might know, BigQuery does not yet implement optimized spatial OUTER JOIN. What if we do need one, something like

SELECT
a.geo_id,
COUNT(b.point_geom) AS count
FROM
`polygon_dataset` a
LEFT OUTER JOIN -- ERROR
`point_dataset` b
ON
st_intersects(a.poly_geom, b.point_geom))
GROUP BY
a.geo_id

One can do a generic rewrite, using TWO joins. We start with a spatial INNER join first. This drops polygons that don’t intersect any points. Then we use LEFT OUTER (equality) JOIN on column, it will keep all left row and bring the computed column, like:

SELECT 
aa.geo_id,
COALESCE(bb.count, 0) AS count
FROM
`polygon_dataset` aa
LEFT OUTER JOIN
(
SELECT
a.geo_id,
COUNT(b.point_geom) AS count
FROM
`polygon_dataset` a
JOIN
`point_dataset` b
ON
st_intersects(a.poly_geom, b.point_geom))
GROUP BY
a.geo_id
) bb
ON aa.geo_id = bb.geo_id

This is a reasonable approach, and often results in good performance. But it is also possible to do a spatial-specific trick here: let’s make sure all polygons have some match and are never dropped by our (inner) JOIN. The insight is that, assuming all fields are non-NULL and not the empty geography, any polygon intersects the whole Earth! Let’s just add it to the second dataset, and tweak aggregations to ignore this extra row:

SELECT
a.geo_id,
COUNTIF(real_data) AS count
FROM
`polygon_dataset` a
JOIN
(SELECT point_geom, TRUE AS real_data FROM `point_dataset`
UNION ALL
-- add whole Earth, and flag it with real_data = FALSE
SELECT ST_GeogFromText('FULLGLOBE'), FALSE) b
ON
st_intersects(a.poly_geom, b.point_geom))
GROUP BY
a.geo_id

This version has another benefit, important when you don’t have a unique ID in the OUTER table, and have to use to uniquely identify rows. The previous solution references twice, and if you add to each of them — the two functions might produce different result unless you sort all input data, which is expensive and requires a unique column. Even if you use CTE (aka Common Table Expression) via statement, BigQuery is somewhat different from many other databases here, and would compute independently for our two instances of . You thus should not join two instances of the table on , the following query could produce wrong results:

--- WARNING: INCORRECT QUERY, do not copy ---WITH polygon_dataset2 AS (
(SELECT ROW_NUMBER() row_id, * FROM `polygon_dataset`)
)
SELECT
aa.row_id, bb.count
FROM
`polygon_dataset2` aa
LEFT OUTER JOIN
(
SELECT
a.row_id,
COUNT(b.point_geom) AS count
FROM
`polygon_dataset2` a
JOIN
`point_dataset` b
ON
st_intersects(a.poly_geom, b.point_geom))
GROUP BY
a.row_id
) bb
-- BAD JOIN, the two row_id are not guaranteed to match
ON aa.row_id = bb.row_id
--- WARNING: INCORRECT QUERY, do not copy ---

Our second solution avoids this problem, and can safely use as a unique identifier, since it is only computed once.

Second idea applies when we need additional fields from the table that we use in GROUP BY. You might "aggregate" each of them using function, but sometimes you are building a generic stored procedure that takes the left-side table or query as an input, and you might not even have a list of all of its fields in advance. The trick is you can capture the whole row as a struct using the table name or alias. You can later unpack this row into individual fields using operator:

SELECT
-- unpacks the row into individual fields
ANY_VALUE(a.properties).*,
COUNTIF(real_data) AS count
FROM
(SELECT poly_geom, geo_id,
p AS properties -- packs whole row
FROM `polygon_dataset` p) a
JOIN
(SELECT point_geom, TRUE AS real_data FROM `point_dataset`
UNION ALL
SELECT ST_GeogFromText('FULLGLOBE'), FALSE) b
ON
st_intersects(a.poly_geom, b.point_geom))
GROUP BY
a.geo_id

Hi, I'm TL of BigQuery GIS project. Posting small recipes and various notes for BQ GIS users.