Creating spatial hierarchy from imperfect data

Michael Entin
3 min readAug 25, 2019

--

Last time we observed how planar-to-spherical conversion can lead to a slightly inconsistent conversion of related shapes, and how this can lead to unexpected result of spatial predicates. E.g. starting with flat map geometries a and b satisfying ST_Covers(a, b), when we translate them to spherical geographies a' and b', the expression ST_Covers(a', b') might be false.

Another common case when ST_Covers or ST_Contains gives unexpected result is inconsistent data preparation. E.g. the datasets for states and counties could have been independently prepared by different organizations, or with different precision, and then ST_Covers(state, county) might be false when we expected true, regardless of planar or spherical version.

Let’s discuss some ways to restore spacial hierarchy from fuzzy data.

First, let’s prepare demo data: like in the previous post we’ll use perfect squares (on flat map), but they are distorted when translated to the globe.

We cannot use ST_MakePolygon, since it operates on sphere, let’s create GeoJson version using BigQuery GeoJson translator:

create function demo.geojson_polygon(
points ARRAY<STRUCT<x int64, y int64>>) AS (
ST_GeogFromGeoJson(CONCAT(
'{"type": "Polygon", "coordinates": [[',
ARRAY_TO_STRING(
ARRAY(SELECT FORMAT('[%d, %d]', x, y) FROM UNNEST(points)),
','),
']]}')));

We can now create outer shapes:

create or replace table demo.a AS
SELECT
x + 100 * y AS id,
demo.geojson_polygon([STRUCT(x,y), STRUCT(x+30,y),
STRUCT(x+30,y+30), STRUCT(x,y+30), STRUCT(x,y)]) AS geog
FROM unnest(generate_array(0, 50, 30)) x,
unnest(generate_array(0, 50, 30)) y

This creates four polygons, with an approximate width of 30 degrees in both lon and lat, sitting between 0 and 60 degrees.

demo.b was created by replacing 30 with 10 in the script above to get 36 polygons of width ~10 degrees in the same space.

Let’s check if determining the hierarchy using ST_Covers works:

select count(*) from demo.a a, demo.b b 
where st_covers(a.geog, b.geog) -- 18

Ouch, this missed half of our rows in table b. Bonus points if you can guess which 18 we missed :).

Can we use ST_Intersects instead?

select count(*) from demo.a a, demo.b b
where st_intersects(a.geog, b.geog) -- 64

This returns too much. But some of this is due to nested polygon just touching outer polygon. Let’s exclude these:

select count(*) from demo.a a, demo.b b
where st_intersects(a.geog, b.geog) and
not st_touches(a.geog, b.geog) -- 48

Closer, but we want 36 rows, when each shape in b joins only once.

One idea is to sort potential parents by intersection area and take the largest. ARRAY_AGG with LIMIT clause does the trick:

select 
b.id,
array_agg(a.id
order by st_area(st_intersection(a.geog, b.geog)) desc
limit 1)[offset(0)] as parent_id
,
any_value(b.geog) geog
from demo.a a, demo.b b
where st_intersects(a.geog, b.geog)
group by b.id

After the JOIN, we aggregate by id of b shape, take all the shapes in a intersecting our b, ordering them by intersection area, and take largest one.

If the query above looks too complex, I agree. For simple cases like this, we can take the potential parent if the intersection area is more than half of the child:

select a.id as parent_id, b.id, b.geog from demo.a a, demo.b b
where st_intersects(a.geog, b.geog) and
st_area(st_intersection(a.geog, b.geog)) > 0.5 * st_area(b.geog)

This returns 36 rows. Let’s check them in BQ GeoViz. I’ll draw all b.geog and use parent_id to color them. Perfect now:

One more way to do it is to replace shape b with its centroid, i.e. use condition ST_Intersects(a.geog, ST_Centroid(b.geog)). This works reliably if at least one of the shapes a or b is convex. It might give wrong results if both a and b are concave, but that’s rather rare.

--

--

Michael Entin

Hi, I'm TL of BigQuery Geospatial project. Posting small recipes and various notes for BQ Geospatial users.