Nearest neighbor in BigQuery GIS

Michael Entin
2 min readAug 29, 2019

--

Let’s discuss how to find an item from some dataset nearest to each item in another dataset. Say we have set of people locations and set of restaurants, and we want to find nearest restaurant for each person.

BigQuery does not have a dedicated Nearest Neighbor TVF, so let’s look at available options.

The simplest solution: we can do full cross join, and then group by people. We select the nearest neighbor for each person by sorting all the restaurants according to distance, and selecting first one:

SELECT 
a.id,
ARRAY_AGG(b.id ORDER BY ST_Distance(a.geog, b.geog) LIMIT 1)
[ORDINAL(1)] as neighbor_id

FROM people_table a JOIN restaurant_table b
GROUP BY a.id

But that would be a very expensive join, and tons of computations of distances. While very simple, this solution does not scale.

Can we limit the number of pairs we need to consider? BigQuery has fast spatial join, here we can useST_DWithin condition that returns subset of pairs within some distance. This means we’ll have to define some bounding search radius, in the example above this might mean we only search for restaurants closer than say 10 km. The smaller the limit, the less pairs are matched by join, and the faster the algorithm runs, but it potentially leaves more points with no neighbor at all. Larger search radius means slower query, but we find neighbors for more points.

SELECT 
a.id,
ARRAY_AGG(b.id ORDER BY ST_Distance(a.geog, b.geog) LIMIT 1)
[ORDINAL(1)] as neighbor_id
FROM people_table a JOIN restaurant_table b
ON ST_DWithin(a.geog, b.geom, 123.45) -- 123.45 is search radius
GROUP BY a.id

Note this is regular JOIN, so it will drop all points in a for which it does not find any close-enough neighbor. Don’t try to make it LEFT JOIN, BQ GIS doesn’t have spatial optimizations for outer join yet.

Some more hints:

  • If you need more fields from table b, replace expression b.id inside ARRAY_AGG with STRUCT(b.id, b.extra_field, b.one_more_field)
  • If you need more fields from table a, especially in case of GEOGRAPHY that you cannot use as aggregation key, use ANY_VALUE aggregation function:
    SELECT ... , ANY_VALUE(a.geog) ... GROUP BY a.id

If needed, we can then repeat this query with points missed by original search, now using larger search radius, and append it to original result.

WITH missed_people_table AS (
SELECT a.* FROM people_table a
WHERE NOT EXISTS(
SELECT 1
FROM people_with_neighbors n -- result of first SELECT
WHERE n.id = a.id)
SELECT
... -- the same query as above only
-- 1) using missed_people_table
-- 2) larger search radius

--

--

Michael Entin

Hi, I'm TL of BigQuery Geospatial project. Posting small recipes and various notes for BQ Geospatial users.