Creating spatial hierarchy from imperfect data
Last time we observed how planar-to-spherical conversion can lead to a slightly inconsistent conversion of related shapes, and how this can lead to unexpected result of spatial predicates. E.g. starting with flat map geometries a
and b
satisfying ST_Covers(a, b)
, when we translate them to spherical geographies a'
and b'
, the expression ST_Covers(a', b')
might be false.
Another common case when ST_Covers
or ST_Contains
gives unexpected result is inconsistent data preparation. E.g. the datasets for states and counties could have been independently prepared by different organizations, or with different precision, and then ST_Covers(state, county)
might be false when we expected true, regardless of planar or spherical version.
Let’s discuss some ways to restore spacial hierarchy from fuzzy data.
First, let’s prepare demo data: like in the previous post we’ll use perfect squares (on flat map), but they are distorted when translated to the globe.
We cannot use ST_MakePolygon, since it operates on sphere, let’s create GeoJson version using BigQuery GeoJson translator:
create function demo.geojson_polygon(
points ARRAY<STRUCT<x int64, y int64>>) AS (
ST_GeogFromGeoJson(CONCAT(
'{"type": "Polygon", "coordinates": [[',
ARRAY_TO_STRING(
ARRAY(SELECT FORMAT('[%d, %d]', x, y) FROM UNNEST(points)),
','),
']]}')));
We can now create outer shapes:
create or replace table demo.a AS
SELECT
x + 100 * y AS id,
demo.geojson_polygon([STRUCT(x,y), STRUCT(x+30,y),
STRUCT(x+30,y+30), STRUCT(x,y+30), STRUCT(x,y)]) AS geog
FROM unnest(generate_array(0, 50, 30)) x,
unnest(generate_array(0, 50, 30)) y
This creates four polygons, with an approximate width of 30 degrees in both lon and lat, sitting between 0 and 60 degrees.
demo.b
was created by replacing 30 with 10 in the script above to get 36 polygons of width ~10 degrees in the same space.
Let’s check if determining the hierarchy using ST_Covers
works:
select count(*) from demo.a a, demo.b b
where st_covers(a.geog, b.geog) -- 18
Ouch, this missed half of our rows in table b
. Bonus points if you can guess which 18 we missed :).
Can we use ST_Intersects
instead?
select count(*) from demo.a a, demo.b b
where st_intersects(a.geog, b.geog) -- 64
This returns too much. But some of this is due to nested polygon just touching outer polygon. Let’s exclude these:
select count(*) from demo.a a, demo.b b
where st_intersects(a.geog, b.geog) and
not st_touches(a.geog, b.geog) -- 48
Closer, but we want 36 rows, when each shape in b
joins only once.
One idea is to sort potential parents by intersection area and take the largest. ARRAY_AGG
with LIMIT
clause does the trick:
select
b.id,
array_agg(a.id
order by st_area(st_intersection(a.geog, b.geog)) desc
limit 1)[offset(0)] as parent_id,
any_value(b.geog) geog
from demo.a a, demo.b b
where st_intersects(a.geog, b.geog)
group by b.id
After the JOIN, we aggregate by id of b
shape, take all the shapes in a
intersecting our b
, ordering them by intersection area, and take largest one.
If the query above looks too complex, I agree. For simple cases like this, we can take the potential parent if the intersection area is more than half of the child:
select a.id as parent_id, b.id, b.geog from demo.a a, demo.b b
where st_intersects(a.geog, b.geog) and
st_area(st_intersection(a.geog, b.geog)) > 0.5 * st_area(b.geog)
This returns 36 rows. Let’s check them in BQ GeoViz. I’ll draw all b.geog
and use parent_id
to color them. Perfect now:
One more way to do it is to replace shape b
with its centroid, i.e. use condition ST_Intersects(a.geog, ST_Centroid(b.geog))
. This works reliably if at least one of the shapes a
or b
is convex. It might give wrong results if both a
and b
are concave, but that’s rather rare.