HowTo: Group By Geography column
Let’s look at how to aggregate data by a geography column. BigQuery refuses to
GROUP BY using geography as key, so let’s consider alternatives, and learn about under-appreciated
Lacking the ability to GROUP BY a geography column, what can one use right now? A very common solution is to do
GROUP BY by some other unique id that you might have in the table. E.g. you might often have some
geo_id. Another common solution is to convert geography to text, typically using
ST_AsText, and create new unique column suitable as
GROUP BY key.
What if you need the geography value in the aggregation too? Here are a few solutions I’ve seen, that you should not use:
SELECT geo_id, ST_Union(geog) as geog -- don't use this!
FROM table GROUP BY geo_id
The idea here is that ST_Union of identical geography values would produce the same value, no matter how many of those you have. But it is very computationally expensive to do this union, so don’t.
Another popular solution, especially when using
ST_AsText as key, is to use
ST_GeogFromText to convert text back to geography:
SELECT ST_GeogFromText(geog), s -- don't use this either!
FROM (SELECT ST_AsText(geog), SUM(something) s
FROM table GROUP BY 1)
This is much better than
ST_Union above, but nevertheless parsing a geography back from text is not optimal, and might also produce slightly different geography (it still
ST_Equal's to the original, but might differ as described below). My recommendation is to use
ANY_VALUE function. It is an aggregation function that returns some arbitrary value from the values in a group, choosing non-NULL value if one exists. In our case, we assume all the
geog values in the group are identical, so any value works:
SELECT geo_id, ANY_VALUE(geog) g -- no formatting or parsing needed
FROM table GROUP BY geo_id
Or if you don’t have unique id and use
GROUP BY with
SELECT ANY_VALUE(geog), SUM(something) s
FROM table GROUP BY ST_AsText(geog)
For the curious, why does BigQuery have this limitation? To allow GROUP BY, we need to define what makes two geographies identical when they form the same group. And we are still debating about the proper rule.
Normally, one compares geographies using
ST_Equals function. However, it returns
true for many geographies that are a little bit different internally. E.g. it does not care about the order of points and vertices, so
MULTIPOINT(1 2, 3 4) compares equal to
MULTIPOINT(3 4, 1 2), and
POLYGON(1 1, 2 1, 2 2, 1 1) compares equal to
POLYGON(2 2, 1 1, 2 1, 2 2). It also allows vertices and edges to diverge by ~1 micron (this is called snapping distance and is used to make spatial algorithms robust), and due to this snapping it is impossible to make it a transitive relation. We can see this in the example below, where center point
b is closer than 1 micron to both
c, but corners
c are further away from each other):
with points as (
select st_geogfromtext('point(1 1)') as a,
st_geogfromtext('point(1 1.000000000006)') as b,
st_geogfromtext('point(1 1.000000000012)') as c
select st_equals(a, b), st_equals(b, c), st_equals(a, c)
| true | true | false |
ST_Equals cannot be used for
GROUP BY, and we need a different rule. Such a rule cannot ignore even these minor differences in position. However we are still free to either ignore vertex order or make vertex order affect grouping. The internal debate about proper rule is still ongoing, and we want to make sure the rule would not be confusing. The industry typically uses of grouping on binary representation, thus taking into account both full precision and order of elements and vertices. It seems likely BigQuery might one day adopt this solution too, but this is not yet settled.