Puppies & BigQuery: Analyzing Geospatial Data
Let’s try to repeat the following analysis in BigQuery and see how easy it is to do geospatial analysis in BigQuery GIS and BigQuery public datasets. It uses a few publicly available datasets to find highest density of pets in Seattle area.
We’ll follow the general path of this article, with modifications as needed.
First, we’ll skip all the installation steps, nothing needs to be installed! OK, I’m cheating a bit here — I’ll use bq
command line to upload data, and this requires Google Cloud SDK. But you can upload data in UI too, and with some extra mouse clicks avoid any installation completely.
As preparation step, lets open BigQuery UI and create demo
dataset for intermediate results:
create schema if not exists demo;
Second, the article describes loading shape file. We can do this — see https://medium.com/@mentin/loading-large-spatial-features-to-bigquery-geography-2f6ceb6796df but we’ll skip it for this article, as BigQuery public dataset already has the tables we need, bigquery-public-data.geo_us_boundaries.zip_codes
for zip codes, and bigquery-public-data.geo_us_census_places.places_washington
for city boundaries:
SELECT place_geom
FROM `bigquery-public-data.geo_us_census_places.places_washington`
WHERE place_name = 'Seattle'
To find out zip codes within Seattle area, we’ll join and compute area of intersection, just like the original article:
CREATE TABLE demo.seattle_zip_codes AS
SELECT zip_code, zip_geom_within_seattle,
ST_Area(zip_geom_within_seattle)/1e6 AS area_km2
FROM (
SELECT
zip_code,
ST_Intersection(place_geom, zip_code_geom)
AS zip_geom_within_seattle
FROM `bigquery-public-data.geo_us_census_places.places_washington` p
JOIN `bigquery-public-data.geo_us_boundaries.zip_codes` z
ON ST_Intersects(place_geom, zip_code_geom)
WHERE place_name = 'Seattle')
Third, we’ll need pet licenses data — download from https://data.seattle.gov/Community/Seattle-Pet-Licenses/jguv-t9rb as CSV, and upload to BigQuery using UI or with the following command:
bq load --autodetect --skip_leading_rows=1 --schema="Issue_Date, License_Number, Animal_Name, Species, Primary_Breed, Secondary_Breed, ZIP" demo.pets Seattle_Pet_Licenses.csv
Let’s examine our uploaded data, the particular column of interest is ZIP — interestingly it was uploaded as STRING! Let’s figure out why --autodetect
did not load it as INT64:
SELECT CAST(ZIP AS INT64) FROM demo.pets;
We’ll get an error similar to one below (your specific error may vary):
Bad int64 value: 98105-3010
Aha, some zip codes in the pets database were entered in ZIP+4 form. We’ll only need first part, so let’s do a simple ETL with pets data:
CREATE TABLE demo.pets2 AS
SELECT *, split(ZIP, '-')[ordinal(1)] as zip5
FROM demo.pets;
Finally, we are ready to join this data and visualize it in BQ GeoViz:
select
zip5,
cnt,
SAFE_DIVIDE(cnt, area_km2) as cnt_per_km2,
zip_geom_within_seattle geom
FROM
(SELECT zip5, count(*) AS cnt
FROM demo.pets2 GROUP BY zip5) p
JOIN
demo.seattle_zip_codes z
ON p.zip5 = z.zip_code
We now get a picture similar to the one obtained in the article: