HowTo: Group By Geography column

Let’s look at how to aggregate data by a geography column. BigQuery refuses to GROUP BY using geography, so let’s consider alternatives, and learn about under-appreciated ANY_VALUE function.

Lacking the ability to GROUP BY a geography column, what can one use right now? A very common solution is to do GROUP BY by some other unique id that you might have in the table. E.g. you might often have some geo_id. Another common solution is to convert geography to text, typically using ST_AsText, and create new unique column suitable as GROUP BY key.

What if you need the geography value in the aggregation too? Here are a few solutions I’ve seen, that you should not use:

SELECT geo_id, ST_Union(geog) as geog  -- don't use this!
FROM table GROUP BY geo_id

The idea here is that ST_Union of identical geography values would produce the same value, no matter how many of those you have. But it is very computationally expensive to do this union, so don’t.

Another popular solution, especially when using ST_AsText as key, is to use ST_GeogFromText to convert text back to geography:

SELECT ST_GeogFromText(geog), s  -- don't use this either!
FROM (SELECT ST_AsText(geog), SUM(something) s
FROM table GROUP BY 1)

This is much better than ST_Union above, but nevertheless parsing a geography back from text is not optimal, and might also produce slightly different geography (it still ST_Equal's to the original, but might differ as described below). My recommendation is to use function. It is an aggregation function that returns some arbitrary value from the values in a group. In our case, we assume all the geog values in the group are identical, so any value works:

SELECT geo_id, ANY_VALUE(geog) g  -- no formatting or parsing needed
FROM table GROUP BY geo_id

Or if you don’t have unique id and use GROUP BY with ST_AsText:

SELECT ANY_VALUE(geog), SUM(something) s 
FROM table GROUP BY ST_AsText(geog)

For the curious, why does BigQuery have this limitation? To allow GROUP BY, we need to define what makes two geographies identical when they form the same group. And we are still debating about the proper rule.

Normally, one compares geographies using ST_Equals function. However, it returns true for many geographies that are a little bit different internally. E.g. it does not care about the order of points and vertices, so MULTIPOINT(1 2, 3 4) compares equal to MULTIPOINT(3 4, 1 2), and POLYGON(1 1, 2 1, 2 2, 1 1) compares equal to POLYGON(2 2, 1 1, 2 1, 2 2). It also allows vertices and edges to diverge by ~1 micron, and due to this snapping it is necessarily not a transitive relation. We can see this in the example below, where center point b is closer than 1 micron to both a and c, but corners a and c are further away from each other):

with points as (
select st_geogfromtext('point(1 1)') as a,
st_geogfromtext('point(1 1.000000000006)') as b,
st_geogfromtext('point(1 1.000000000012)') as c
)
select st_equals(a, b), st_equals(b, c), st_equals(a, c)
from points
+--------+--------+--------+
| true | true | false |
+--------+--------+--------+

Lacking transitivity, ST_Equals cannot be used for GROUP BY, and we need a different rule. Such a rule cannot ignore even these minor differences in position. However we are still free to ignore vertex order or make vertex order affect grouping. The internal debate about proper rule is still ongoing, and we want to make sure the rule would not be confusing. The industry typically uses of grouping on binary representation, thus taking into account both full precision and order of elements and vertices. It seems likely BigQuery might one day adopt this solution too, but this is not yet settled.

Hi, I'm TL of BigQuery GIS project. Posting small recipes and various notes for BQ GIS users.