Let’s look at how to aggregate data by a geography column. BigQuery refuses to
GROUP BY using geography, so let’s consider alternatives, and learn about under-appreciated
Lacking the ability to GROUP BY a geography column, what can one use right now? A very common solution is to do
GROUP BY by some other unique id that you might have in the table. E.g. you might often have some
geo_id. Another common solution is to convert geography to text, typically using
ST_AsText, and create new unique column suitable as
GROUP BY key.
What if you need the geography value…
If you looked at BigQuery native visualization options some time ago, you might have tried BigQuery GeoViz and found it limited. But a lot of things have changed over the time, and I want to summarize major things that might help you to visualize Geospatial data in BigQuery.
First, BigQuery GeoViz can handle much more data now. It is still a client-side app and is limited by the client memory, but can handle up to about a quarter gigabyte of data. This is not huge, but it allows it to show e.g. choropleth map of all US counties. …
Today’s story is inspired by a feature request opened for BigQuery GIS. The request, as I understand it, is to provide aggregate version of
ST_Intersection. BigQuery GIS has
ST_Union_Agg, but no
ST_Intersection_Agg. Frankly, I still wonder about a use case when this would be needed, but let’s create such a function anyway.
The reason this is interesting is we’ll use complementary polygons here. Usually they cause troubles when a polygon with incorrect orientation is loaded to BigQuery, but here we’ll use them to our advantage.
Short refresher: on a sphere, every polygon has a complementary polygon. For example…
Today’s note is not specific for BQ GIS, but is handy when doing geospatial or other types of visualization and the dimension you are interested in has big outliers. Such extreme values might make charts ugly, or squeeze the choropleth map’s effective color range.
As an example, you might have most values in 0 to 100 range, and a few values reach 1000. Sometimes these outliers are interesting, but in my case they were noise, and I wanted to ignore them. I used to adjust the chart’s range manually or add another color interval, but decided to automate it. This…
Last time I showed how to use a PostgreSQL instance with PostGIS to fix invalid geometries, now I’m glad to introduce new BigQuery native options that simplify loading geospatial data, including invalid geometries.
The new signatures of
ST_GeogFromGeoJson support two common requests
If you worked with BigQuery GIS, you know it treats
WKT as describing an object on spherical Earth, with geodesic edges, matching semantics of internal
GEOGRAPHY type. And if you had planar data, loading it using
WKT format sometimes is OK, but could lead to problems if the geometries are spatially large. …
Today we’ll have a complex script that ties together new features of BigQuery to transparently talk from BigQuery Script to a PostgreSQL instance. With Dynamic SQL we can build BigQuery SQL queries dynamically, and here we’ll build and execute PostgreSQL queries dynamically in BigQuery.
A practical motivating example here is being able to fix invalid polygons that don’t conform to OGC spec and cannot be loaded to BigQuery. Unfortunately, such bad data can be found in various datasets one has to work with. You can often fix it using ST_MakeValid function in PostgreSQL, but there is no such facility yet…
You might have already seen many ways to convert GeoJson files to something BigQuery can understand. Let’s invent one more wheel!
E.g. this python script from Lak Lakshmanan:
Or using org2org tools to produce CSV file:
But not everyone knows how to run Python and installing Node.js might be a mess, so here is a simpler way to do it, using
jq command line tool.
Jq (https://stedolan.github.io/jq/) is single-executable command line tool for processing json…
Today’s recipe discusses the following problem: you have some geospatial statistics collected for one set of spatial hierarchy, but you need to do some data analysis using another hierarchy. Say we have very detailed US census statistics, collected for census tracts, but we need statistics for zip codes, and we don’t have any.
Now a huge disclaimer: you should NOT use zip codes for your geospatial analysis, they are bad for that purpose. See this excellent article in Carto blog that demonstrates well why this is bad:
But well, sometimes you need to do something even if you should…
Today’s short notice is not about GIS, just about BigQuery. Let’s fix a reliability bug!
I’ve just noticed the following pattern used in various posts, to convert any column value to a small integer number in a fixed range. It is used for partitioning, random sampling, etc:
The idea is to compute hash, and then bring it to a fixed range we need, in this case [0, 10).
FARM_FINGERPRINT returns a 64-bit signed integer, we compute it’s absolute value to avoid issues with negative values, and then take modulo 10 to bring it to the range we…
We’ve already discussed Nearest Neighbors problem:
The general idea is to start with small search radius, and increase it if we fail to find anything. Since that article, we got BigQuery Scripting, so let’s use it to do this. This time we’ll solve single neighbor search — find nearest neighbor for a specific point, rather than finding neighbors for a set of points.
For demo purposes, we’ll search nearest road in US Tiger database. Note it is not a very good road dataset for actual use, but fine for demo. The search with a specific radius will look like this: