Migrating Geospatial Data to BigQuery
After working with BigQuery Geospatial customers for a few years, I think I learned common hurdles one may stumble at when moving from another GIS or SQL system to BigQuery. Let’s have a single page intro that might be useful to new users.
Please comment if I missed something major, made a mistake about some system, or if I should add a comparison to some popular system.
Is My World Flat or Spherical?
The first major differentiator is projected SRS vs geodesic SRS. Former systems operate on a flat map, latter on spherical / spheroid globe.
It is common in SQL world to have two types, Geometry for flat map, and Geography for spherical world. MySql is the exception here, it uses a single type, and chooses behavior depending on whether the geometry SRID is projected or geographic SRS. BigQuery currently only supports spherical Geography type.
- Planar: PostGIS Geometry, Microsoft SqlServer Geometry, MySQL with projected SRS, most GIS software.
- Spherical: PostGIS Geography, Microsoft SqlServer Geography, MySQL 8+ with geographic SRS (e.g. 4326), BigQuery Geography.
If you already work with spherical world, you should be quite comfortable with BigQuery. If you work with planar world, a few things will be different in BigQuery, but your life will also be simpler: you don’t need to think about projections, all measures return meters, you don’t need ST_DistanceSphere
, etc.
Coordinate Order
Coordinate order varies between systems and sometimes within a system. Geospatial applications like Google Maps usually put latitude first, longitude second (humanity invented how to measure longitude much later than latitude). Most databases and formats put longitude first, latitude second, following (x, y) math convention. But some systems use different conventions in different contexts, be careful.
- Database Systems that always use longitude-latitude order: PostGIS for both Geometry and Geography, BigQuery.
- Formats that require longitude-latitude order: GeoJSON, GeoPackage, GeoParquet.
- Microsoft SqlServer uses SRS order (thus lat-lon order for 4326) for
Point
constructor, but lon-lat order when reading or writing WKT. I.e. two queries below construct the same point:
-- SqlServer constructors for a point with longitude:-122, latitude:47
SELECT geography::STGeomFromText('POINT(-122 47)', 4326);
SELECT geography::Point(47, -122, 4326);
-- both print 'POINT (-122 47)'
- MySQL 8.0 does the opposite of SqlServer in each of these cases, i.e.
Point
function accepts lon-lat, while it uses SRS order (lat-lng for 4326) for WKT. These represent the same point again:
-- MySql constructors for a point with longitude:-122, latitude:47
SELECT ST_GeomFromText('POINT(47 -122)', 4326));
SELECT ST_SRID(Point(-122, 47), 4326);
-- both print '{"x":-122,"y":47}'
Also, MySQL 5.7 behavior was different from MySQL 8.0.
BigQuery and PostGIS always use longitude-latitude order:
SELECT ST_GeogFromText('POINT(-122 47)');
SELECT ST_GeogPoint(-122, 47);
Lines and Edges
In planar world, lines and edges of polygons follow straight line on the currently used projection. On a sphere, lines and polygon edges follow geodesic lines.
What happens when you feed planar data (in a proper coordinate system, usually WSG84) to spherical system, or vice versa? Most of the time, the destination system just treats the input data as if it were its own, e.g. a spherical system will consider edges geodesic regardless of source. BTW, similar thing happens in most tools when you use ST_Transform
or tools like ogr2ogr
to translate geometries between projections — this reprojects the vertices, but lines and edges start to follow straight lines in the new projection, which usually differs from straight line between same points in previous projection.
This might have bad effect on the geometries: geometries now describe different shapes, some geometries might become invalid. Still, for small enough line segments and edges this is often good-enough approximation.
BigQuery has another option: when it reads GeoJson format (the RFC specifies a line follows straight Cartesian line), or parses WKT string given ST_GeogFromText
option planar=>TRUE
, BigQuery checks if it needs to interpolate those straight lines in WGS84 space, by adding additional vertices if needed to achieve precision of ~10 meters (currently not configurable):
select ST_GeogFromText(l) geo, ST_GeogFromText(l, planar=>TRUE) interpolated
from unnest(['Linestring(0 5, 1 5)', 'Linestring(0 10, 1 10)']) l
Row geo interpolated
1 LINESTRING(0 5, 1 5) LINESTRING(0 5, 0.5 5, 1 5)
2 LINESTRING(0 10, 1 10) LINESTRING(0 10, 0.25 10, 0.5 10, 0.75 10, 1 10)
For deeper technical discussion, see
- https://mentin.medium.com/limits-of-planar-spherical-edge-conversion-56cbba1ffebd
- https://mentin.medium.com/long-lines-b65ad9fa8e14
Polygons Orientation
Polygons loaded with wrong orientation is probably the most common problem users encounter. I’ll point to the official docs describing what it is about: https://cloud.google.com/bigquery/docs/geospatial-data#polygon_orientation
Wrong orientation may manifest itself in dozens of ways: you compute ST_Area
and get a huge number, GeoViz fills the whole map or does not fill the polygon, results of other computations are wrong.
A common way to get polygon orientation wrong is by loading geometries as WKT/WKB data. In this case, BigQuery assumes the input might potentially contain huge polygons, and respects polygon orientation. You don’t get this problem when loading GeoJson strings or if converting geometries using ST_GeogFromText
(it ignores input orientation by default, and chooses smaller polygon; one has to pass explicit oriented=>TRUE
argument if data might include huge polygons).
The most reliable way to avoid this problem, if you bring data from a planar world, is to load geometries as GeoJson strings. As noted above, the GeoJson uses planar SRS, where polygon orientation does not matter, so BigQuery can load it reliably (and approximate planar edges if needed).
If WKT is more convenient, and you are not sure about polygon orientation (but all polygons are small), the best option right now is probably to load it as string into a temporary table, and then build Geography using ST_GeogFromText(wkt)
or ST_GeogFromText(wkt, planar => TRUE)
if data comes from planar system.
If you already loaded the geometries, and want to check if you did it right — compute ST_Area
of the polygons. If all the polygons are supposed to cover
“small” area (smaller than hemisphere), but ST_Area
returns values over 1e14 — these polygons are likely inverted.
Invalid Geometries
Often the real world data you need to use is “dirty” and contains invalid geometries. Most systems deal with it by allowing one to load arbitrary geometric data, even when it does not follow OGC standards. Using such invalid geometries can lead to surprises later, giving unexpected and inconsistent results. Always check your geometries with ST_IsValid
and fix them with ST_MakeValid
to avoid surprises.
BigQuery made a less common choice, and to avoid such surprises or having to deal with invalid data internally, it validates all data when creating Geography objects, rejecting invalid ones. Instead of using ST_MakeValid,
you can fix the geometries at creation time, by passing make_valid => TRUE
argument, e.g. ST_GeogFromText(wkt, make_valid => TRUE)
.
Indexing
Most DBMS offer some sort of indices including GiST/spatial indices. BigQuery instead offers partitioning and clustering, both can break huge tables into smaller pieces, making it faster and cheaper to find data you need. For Geography
type, only clustering is supported, it helps with queries that select Geographies close to some constant geography object.
Note that if you want to find all geographies close to a location that comes from another table, i.e. this constant used in filter is dynamic, you might have to break the query into two statements in order for the clustering to be useful:
DECLARE center geography DEFAULT (
SELECT boundary FROM `my_table` where id = 42
);
SELECT * FROM another_table WHERE ST_DWithin(geo, center, 100);
Spatial joins (only inner joins as of early 2024) build and use spatial indices regardless of storage clustering.