# Creating spatial hierarchy from imperfect data

--

Last time we observed how planar-to-spherical conversion can lead to a slightly inconsistent conversion of related shapes, and how this can lead to unexpected result of spatial predicates. E.g. starting with flat map geometries `a` and `b` satisfying `ST_Covers(a, b)`, when we translate them to spherical geographies `a'` and `b'`, the expression `ST_Covers(a', b')` might be false.

Another common case when `ST_Covers` or `ST_Contains` gives unexpected result is inconsistent data preparation. E.g. the datasets for states and counties could have been independently prepared by different organizations, or with different precision, and then `ST_Covers(state, county)` might be false when we expected true, regardless of planar or spherical version.

Let’s discuss some ways to restore spacial hierarchy from fuzzy data.

First, let’s prepare demo data: like in the previous post we’ll use perfect squares (on flat map), but they are distorted when translated to the globe.

We cannot use ST_MakePolygon, since it operates on sphere, let’s create GeoJson version using BigQuery GeoJson translator:

`create function demo.geojson_polygon(    points ARRAY<STRUCT<x int64, y int64>>) AS (  ST_GeogFromGeoJson(CONCAT(    '{"type": "Polygon", "coordinates": [[',    ARRAY_TO_STRING(       ARRAY(SELECT FORMAT('[%d, %d]', x, y) FROM UNNEST(points)),       ','),    ']]}')));`

We can now create outer shapes:

`create or replace table demo.a ASSELECT   x + 100 * y AS id,   demo.geojson_polygon([STRUCT(x,y), STRUCT(x+30,y),      STRUCT(x+30,y+30), STRUCT(x,y+30), STRUCT(x,y)]) AS geogFROM unnest(generate_array(0, 50, 30)) x,      unnest(generate_array(0, 50, 30)) y`

This creates four polygons, with an approximate width of 30 degrees in both lon and lat, sitting between 0 and 60 degrees.

`demo.b` was created by replacing 30 with 10 in the script above to get 36 polygons of width ~10 degrees in the same space.

Let’s check if determining the hierarchy using `ST_Covers` works:

`select count(*) from demo.a a, demo.b b where st_covers(a.geog, b.geog) -- 18`

Ouch, this missed half of our rows in table `b`. Bonus points if you can guess which 18 we missed :).

Can we use `ST_Intersects` instead?

`select count(*) from demo.a a, demo.b bwhere st_intersects(a.geog, b.geog) -- 64`

This returns too much. But some of this is due to nested polygon just touching outer polygon. Let’s exclude these:

`select count(*) from demo.a a, demo.b bwhere st_intersects(a.geog, b.geog) and       not st_touches(a.geog, b.geog) -- 48`

Closer, but we want 36 rows, when each shape in `b` joins only once.

One idea is to sort potential parents by intersection area and take the largest. `ARRAY_AGG` with `LIMIT` clause does the trick:

`select   b.id,   array_agg(a.id             order by st_area(st_intersection(a.geog, b.geog)) desc            limit 1)[offset(0)] as parent_id,   any_value(b.geog) geogfrom demo.a a, demo.b bwhere st_intersects(a.geog, b.geog)group by b.id`

After the JOIN, we aggregate by id of `b` shape, take all the shapes in `a` intersecting our `b`, ordering them by intersection area, and take largest one.

If the query above looks too complex, I agree. For simple cases like this, we can take the potential parent if the intersection area is more than half of the child:

`select a.id as parent_id, b.id, b.geog from demo.a a, demo.b bwhere st_intersects(a.geog, b.geog) and  st_area(st_intersection(a.geog, b.geog)) > 0.5 * st_area(b.geog)`

This returns 36 rows. Let’s check them in BQ GeoViz. I’ll draw all `b.geog` and use `parent_id` to color them. Perfect now:

One more way to do it is to replace shape `b` with its centroid, i.e. use condition `ST_Intersects(a.geog, ST_Centroid(b.geog))`. This works reliably if at least one of the shapes `a` or `b` is convex. It might give wrong results if both `a` and `b` are concave, but that’s rather rare.

--

--

Hi, I'm TL of BigQuery Geospatial project. Posting small recipes and various notes for BQ Geospatial users.