Today’s short notice is not about GIS, just about BigQuery. Let’s fix a reliability bug!
I’ve just noticed the following pattern used in various posts, to convert any column value to a small integer number in a fixed range. It is used for partitioning, random sampling, etc:
The idea is to compute hash, and then bring it to a fixed range we need, in this case [0, 10).
FARM_FINGERPRINT returns a 64-bit signed integer, we compute it’s absolute value to avoid issues with negative values, and then take modulo 10 to bring it to the range we need.
Is there an issue here? Well, yes. If
FARM_FINGERPRINT result happens to be exactly minimum int64 value,
ABS function fails because there is no corresponding positive value, with error
int64 overflow: --9223372036854775808. (This is also why BigQuery has
SAFE_NEGATE function.) It is not likely to happen, but with big data, even extremely unlikely events do happen some day.
To make this safe, take
MOD first and then
You might be concerned about equal probabilities —
MOD returns values from -9 to 9; when we take
ABS of this, only 0 is converted to 0, while both 1 and -1 are converted to 1, etc. Does this make 0 half as probable? Not really, since when we compute
MOD, 0 is twice as probable as other values, because both positive and negative numbers divisible by 10 produce
MOD == 0. We can check quality of distribution with the following test:
with random_data as (
select cast(RAND() as string) v
from unnest(GENERATE_ARRAY(0, 1000000)) x
abs(mod(farm_fingerprint(v), 10)) k, count(*) c
group by k order by k;
In other cases where you need
ABS but don’t have
MOD, you can use e.g. expression