Simple automated data cleaning in BigQuery
Today’s note is not specific for BQ GIS, but is handy when doing geospatial or other types of visualization and the dimension you are interested in has big outliers. Such extreme values might make charts ugly, or squeeze the choropleth map’s effective color range.
As an example, you might have most values in 0 to 100 range, and a few values reach 1000. Sometimes these outliers are interesting, but in my case they were noise, and I wanted to ignore them. I used to adjust the chart’s range manually or add another color interval, but decided to automate it.
To trim outliers, let’s just compute quantiles, and then trim the most extreme values. The query might look like this:
WITH caps_ as (
SELECT
approx_quantiles(m, 100)[offset(99)] as max_cap,
approx_quantiles(m, 100)[offset(1)] as min_cap
FROM my_table
)
SELECT *,
greatest((select min_cap from caps_),
least((select max_cap from caps_), m)) as m_capped
FROM my_table
The caps
subquery computes approximate quantiles, and uses first and last quantiles as value caps. Next we ensure our value is within [min_cap, max_cap] range. But writing this for every measure was tedious, so I decided to automate it using BigQuery scripting. The result allows creating a view like above with a single call to a stored procedure:
CALL gislib.sp.CreateCappedView(
'my_project.views.table_capped', -- view name to create
'my_project.dataset.table', -- base table name
['amount', 'coolness', 'value'] -- list of measures
);
This creates a view with name specified by view_name
argument, for a base table (or view) specified by table_name
. The view has all original columns, plus for each of the string in measures
list, the view will have a column named <measure_name>_capped
. E.g. in the example above we’ll have extra columns amound_capped
, coolness_capped
, value_capped
.
We can go even further and avoid specifying the list of columns, this creates capped columns for all available columns:
CALL gislib.sp.CreateAutoCappedView(
'my_project.views.table_capped', -- view name to create
'my_project.dataset.table', -- base table name
);
Internals
Here is the script to create the text of our first stored procedure:
CREATE OR REPLACE FUNCTION gislib.sp.RepeatPattern(
pattern STRING, x STRING, items ARRAY<STRING>)
AS ((
SELECT STRING_AGG(REPLACE(pattern, x, i), ', \n')
FROM UNNEST(items) i
));CREATE OR REPLACE PROCEDURE `gislib.sp.CreateCappedView`(
view_name STRING,
table_name STRING,
measures ARRAY<STRING>)
BEGIN
EXECUTE IMMEDIATE
'CREATE OR REPLACE VIEW `' || view_name || '` AS \n'
|| 'WITH caps_ as (SELECT \n'
|| gislib.sp.RepeatPattern('approx_quantiles($, 100)[offset(99)] as $_max_cap_', '$', measures)
|| ', \n'
|| gislib.sp.RepeatPattern('approx_quantiles($, 100)[offset(1)] as $_min_cap_', '$', measures)
|| ' FROM `' || table_name || '` )\n'
|| ' SELECT *, \n'
|| gislib.sp.RepeatPattern('greatest((select $_min_cap_ from caps_), least((select $_max_cap_ from caps_), $)) as $_capped', '$', measures)
|| ' FROM `' || table_name || '`';
END;
RepeatPattern is a small helper function to repeat a specified patten, second is a procedure that builds a SQL text for CREATE VIEW
DDL command. Note that the code makes no efforts to prevent SQL injection or handle table names that need quotes, so use it carefully.
We can then create a second procedure that uses INFORMATION_SCHEMA
to get all the interesting columns, and creates the view fully automatically:
CREATE OR REPLACE PROCEDURE gislib.sp.CreateAutoCappedView(view STRING, schema_table STRING)
BEGIN DECLARE project STRING DEFAULT SPLIT(schema_table, '.')[offset(0)];
DECLARE schema STRING DEFAULT SPLIT(schema_table, '.')[offset(1)];
DECLARE table STRING DEFAULT SPLIT(schema_table, '.')[offset(2)]; DECLARE sql STRING DEFAULT
"SELECT ARRAY( SELECT column_name FROM `" || project || '.' || schema || "`.INFORMATION_SCHEMA.COLUMNS " ||
"WHERE table_name = '" || table || "' AND data_type IN ('FLOAT64', 'NUMERIC', 'INT64') );"; DECLARE columns ARRAY<STRING>;
EXECUTE IMMEDIATE sql INTO columns;
CALL gislib.sp.CreateCappedView(view, schema_table, columns);END;