Be careful with ABS function
Today’s short notice is not about GIS, just about BigQuery. Let’s fix a reliability bug!
I’ve just noticed the following pattern used in various posts, to convert any column value to a small integer number in a fixed range. It is used for partitioning, random sampling, etc:
MOD(ABS(FARM_FINGERPRINT(field)), 10)
The idea is to compute hash, and then bring it to a fixed range we need, in this case [0, 10).
The FARM_FINGERPRINT
returns a 64-bit signed integer, we compute it’s absolute value to avoid issues with negative values, and then take modulo 10 to bring it to the range we need.
Is there an issue here? Well, yes. If FARM_FINGERPRINT
result happens to be exactly minimum int64 value, ABS
function fails because there is no corresponding positive value, with error int64 overflow: --9223372036854775808
. (This is also why BigQuery has SAFE_NEGATE
function.) It is not likely to happen, but with big data, even extremely unlikely events do happen some day.
To make this safe, take MOD
first and then ABS
:
ABS(MOD(FARM_FINGERPRINT(field), 10))
You might be concerned about equal probabilities — MOD
returns values from -9 to 9; when we take ABS
of this, only 0 is converted to 0, while both 1 and -1 are converted to 1, etc. Does this make 0 half as probable? Not really, since when we computeMOD
, 0 is twice as probable as other values, because both positive and negative numbers divisible by 10 produce MOD
== 0. We can check quality of distribution with the following test:
with random_data as (
select cast(RAND() as string) v
from unnest(GENERATE_ARRAY(0, 1000000)) x
)
select
abs(mod(farm_fingerprint(v), 10)) k, count(*) c
from random_data
group by k order by k;
In other cases where you need ABS
but don’t have MOD
, you can use e.g. expression IFNULL(SAFE.ABS(…), 0)
.