Today’s recipe discusses the following problem: you have some geospatial statistics collected for one set of spatial hierarchy, but you need to do some data analysis using another hierarchy. Say we have very detailed US census statistics, collected for census tracts, but we need statistics for zip codes, and we don’t have any.
Now a huge disclaimer: you should NOT use zip codes for your geospatial analysis, they are bad for that purpose. See this excellent article in Carto blog that demonstrates well why this is bad:
But well, sometimes you need to do something even if you should not. Or rather let’s say we’ll use zip codes as convenient demonstrations, but you promise to use it for something better, maybe map to river basin areas?
So for today we want to count people and find average incomes in each zip code, but we only have census data. Note that there is no common hierarchy between postal codes and census tracts. A zip code can contain several tracts, or vice versa, or they may partially overlap.
Let’s see them in BigQuery GeoViz. It does not support multiple layers, so I’ll use the following
UNION ALL query to visualize them together in GeoViz, adding
census column to be able to distinguish them:
select * from (
1 as census, tract_geom geom
`bigquery-public-data.geo_census_tracts.census_tracts_washington` union all select
0 as census, zip_code_geom geom
where st_dwithin(geom, st_geogpoint(-122.191667, 47.685833), 5000)
With the added column, we can draw zip code boundaries green, and tract boundaries red to distinguish them. And we see they are completely independent polygons.
What can we do? A simple idea is to split census tracts on zip code boundaries. We then pretend each census tract is uniform, so if 40% of tract surface area is in a zip code A, we assign 40% of the population to that zip code.
How can we compute average incomes? We cannot assign 40% of average to a zip code, like we do with people, this makes no sense. But we can assign 40% of total tract’s income, computed as
average * population to a zip code! Note this only works with averages, but we cannot use this simple trick with medians — they cannot be mapped this simple way.
OK, let’s start coding. First, we need to find all pairwise intersections between tracts and zip codes, and for each pair compute tract’s percentage of area that we assign to zip code it intersects.
CREATE OR REPLACE TABLE
WITH zip_tract_join AS (
zips.functional_status as zip_functional_status,
tracts.geo_id as tract_geo_id,
tracts.functional_status as tract_functional_status,
/ ST_Area(tracts.tract_geom) as tract_pct_in_zip_code
SELECT * FROM zip_tract_join WHERE tract_pct_in_zip_code > 0;
This creates a table
zip_tract_join with all non-empty intersections, where
tract_pct_in_zip_code is share of the tract that belongs to particular intersection, computed as area of intersection divided by total tract area. BTW, this query took less than 30 seconds for the whole US!
Next, remember that we need additive values, like total income, rather than averages that are hard to work with. We’ll use this query:
SELECT geo_id, total_pop,
total_pop * income_per_capita as total_income
Then we join this with
demo.zip_tract_join table created in previous step, aggregate by zip code, and convert back to averages. Whole query is:
CREATE OR REPLACE TABLE demo.zip_pop_income AS
WITH census_totals AS (
-- convert averages to additive totals
total_pop * income_per_capita AS total_income
joined AS (
-- join with precomputed census/zip pairs,
-- compute zip's share of tract
total_pop * tract_pct_in_zip_code AS zip_pop,
total_income * tract_pct_in_zip_code AS zip_income
FROM census_totals c
JOIN demo.zip_tract_join ztj
ON c.geo_id = ztj.tract_geo_id
sums AS (
-- aggregate all "pieces" of zip code
SUM(zip_pop) AS zip_pop,
SUM(zip_income) AS zip_total_inc
GROUP BY zip_code
-- convert to averages
zip_total_inc/zip_pop AS income_per_capita
We got a zip code statistics table
zip_pop_income for the whole U.S. It does not contain zip geometry information, only statistics. If we need a specific zip code, we can query this table alone. For area query, wee can join it with zip boundaries table:
with zipcodes_within_distance as (
1609 * 5)
on area.zip_code = stats.zip_code
For sanity check, let’s compare this with direct query of original census stats:
t.geo_id = s.geo_id
1609 * 5)
The maps are similar, but a lot of fine grained detail is lost in zip code version. This of course supports that one should try to avoid using zip codes for geospatial analysis. But if you need to —we now have a recipe how to do this.