Nearest neighbor in BigQuery GIS

--

Let’s discuss how to find an item from some dataset nearest to each item in another dataset. Say we have set of people locations and set of restaurants, and we want to find nearest restaurant for each person.

BigQuery does not have a dedicated Nearest Neighbor TVF, so let’s look at available options.

The simplest solution: we can do full cross join, and then group by people. We select the nearest neighbor for each person by sorting all the restaurants according to distance, and selecting first one:

`SELECT   a.id,   ARRAY_AGG(b.id ORDER BY ST_Distance(a.geog, b.geog) LIMIT 1)      [ORDINAL(1)] as neighbor_idFROM people_table a JOIN restaurant_table bGROUP BY a.id`

But that would be a very expensive join, and tons of computations of distances. While very simple, this solution does not scale.

Can we limit the number of pairs we need to consider? BigQuery has fast spatial join, here we can use`ST_DWithin` condition that returns subset of pairs within some distance. This means we’ll have to define some bounding search radius, in the example above this might mean we only search for restaurants closer than say 10 km. The smaller the limit, the less pairs are matched by join, and the faster the algorithm runs, but it potentially leaves more points with no neighbor at all. Larger search radius means slower query, but we find neighbors for more points.

`SELECT   a.id,   ARRAY_AGG(b.id ORDER BY ST_Distance(a.geog, b.geog) LIMIT 1)      [ORDINAL(1)] as neighbor_idFROM people_table a JOIN restaurant_table bON ST_DWithin(a.geog, b.geom, 123.45) -- 123.45 is search radiusGROUP BY a.id`

Note this is regular JOIN, so it will drop all points in `a` for which it does not find any close-enough neighbor. Don’t try to make it LEFT JOIN, BQ GIS doesn’t have spatial optimizations for outer join yet.

Some more hints:

• If you need more fields from table `b`, replace expression `b.id` inside `ARRAY_AGG` with `STRUCT(b.id, b.extra_field, b.one_more_field)`
• If you need more fields from table `a`, especially in case of `GEOGRAPHY` that you cannot use as aggregation key, use `ANY_VALUE` aggregation function:
`SELECT ... , ANY_VALUE(a.geog) ... GROUP BY a.id`

If needed, we can then repeat this query with points missed by original search, now using larger search radius, and append it to original result.

`WITH missed_people_table AS (  SELECT a.* FROM people_table a  WHERE NOT EXISTS(      SELECT 1       FROM people_with_neighbors n -- result of first SELECT      WHERE n.id = a.id)SELECT   ... -- the same query as above only      -- 1) using missed_people_table      -- 2) larger search radius`