HowTo: Group By Geography column
Let’s look at how to aggregate data by a geography column. BigQuery refuses to GROUP BY
using geography as key, so let’s consider alternatives, and learn about under-appreciated ANY_VALUE
function.
Lacking the ability to GROUP BY a geography column, what can one use right now? A very common solution is to do GROUP BY
by some other unique id that you might have in the table. E.g. you might often have some geo_id
. Another common solution is to convert geography to text, typically using ST_AsText
, and create new unique column suitable as GROUP BY
key.
What if you need the geography value in the aggregation too? Here are a few solutions I’ve seen, that you should not use:
SELECT geo_id, ST_Union(geog) as geog -- don't use this!
FROM table GROUP BY geo_id
The idea here is that ST_Union of identical geography values would produce the same value, no matter how many of those you have. But it is very computationally expensive to do this union, so don’t.
Another popular solution, especially when using ST_AsText
as key, is to use ST_GeogFromText
to convert text back to geography:
SELECT ST_GeogFromText(geog), s -- don't use this either!
FROM (SELECT ST_AsText(geog), SUM(something) s
FROM table GROUP BY 1)
This is much better than ST_Union
above, but nevertheless parsing a geography back from text is not optimal, and might also produce slightly different geography (it still ST_Equal's
to the original, but might differ as described below). My recommendation is to use ANY_VALUE
function. It is an aggregation function that returns some arbitrary value from the values in a group, choosing non-NULL value if one exists. In our case, we assume all the geog
values in the group are identical, so any value works:
SELECT geo_id, ANY_VALUE(geog) g -- no formatting or parsing needed
FROM table GROUP BY geo_id
Or if you don’t have unique id and use GROUP BY
with ST_AsText
:
SELECT ANY_VALUE(geog), SUM(something) s
FROM table GROUP BY ST_AsText(geog)
For the curious, why does BigQuery have this limitation? To allow GROUP BY, we need to define what makes two geographies identical when they form the same group. And we are still debating about the proper rule.
Normally, one compares geographies using ST_Equals
function. However, it returns true
for many geographies that are a little bit different internally. E.g. it does not care about the order of points and vertices, so MULTIPOINT(1 2, 3 4)
compares equal to MULTIPOINT(3 4, 1 2)
, and POLYGON(1 1, 2 1, 2 2, 1 1)
compares equal to POLYGON(2 2, 1 1, 2 1, 2 2)
. It also allows vertices and edges to diverge by ~1 micron (this is called snapping distance and is used to make spatial algorithms robust), and due to this snapping it is impossible to make it a transitive relation. We can see this in the example below, where center point b
is closer than 1 micron to both a
and c
, but corners a
and c
are further away from each other):
with points as (
select st_geogfromtext('point(1 1)') as a,
st_geogfromtext('point(1 1.000000000006)') as b,
st_geogfromtext('point(1 1.000000000012)') as c
)
select st_equals(a, b), st_equals(b, c), st_equals(a, c)
from points+--------+--------+--------+
| true | true | false |
+--------+--------+--------+
Lacking transitivity, ST_Equals
cannot be used for GROUP BY
, and we need a different rule. Such a rule cannot ignore even these minor differences in position. However we are still free to either ignore vertex order or make vertex order affect grouping. The internal debate about proper rule is still ongoing, and we want to make sure the rule would not be confusing. The industry typically uses of grouping on binary representation, thus taking into account both full precision and order of elements and vertices. It seems likely BigQuery might one day adopt this solution too, but this is not yet settled.