Be careful with ABS function

Michael Entin
2 min readDec 17, 2019

--

Today’s short notice is not about GIS, just about BigQuery. Let’s fix a reliability bug!

I’ve just noticed the following pattern used in various posts, to convert any column value to a small integer number in a fixed range. It is used for partitioning, random sampling, etc:

MOD(ABS(FARM_FINGERPRINT(field)), 10)

The idea is to compute hash, and then bring it to a fixed range we need, in this case [0, 10).

The FARM_FINGERPRINT returns a 64-bit signed integer, we compute it’s absolute value to avoid issues with negative values, and then take modulo 10 to bring it to the range we need.

Is there an issue here? Well, yes. If FARM_FINGERPRINT result happens to be exactly minimum int64 value, ABS function fails because there is no corresponding positive value, with error int64 overflow: --9223372036854775808. (This is also why BigQuery has SAFE_NEGATE function.) It is not likely to happen, but with big data, even extremely unlikely events do happen some day.

To make this safe, take MOD first and then ABS:

ABS(MOD(FARM_FINGERPRINT(field), 10))

You might be concerned about equal probabilities — MOD returns values from -9 to 9; when we take ABS of this, only 0 is converted to 0, while both 1 and -1 are converted to 1, etc. Does this make 0 half as probable? Not really, since when we computeMOD, 0 is twice as probable as other values, because both positive and negative numbers divisible by 10 produce MOD == 0. We can check quality of distribution with the following test:

with random_data as (
select cast(RAND() as string) v
from unnest(GENERATE_ARRAY(0, 1000000)) x
)
select
abs(mod(farm_fingerprint(v), 10)) k, count(*) c
from random_data
group by k order by k;

In other cases where you need ABS but don’t have MOD, you can use e.g. expression IFNULL(SAFE.ABS(…), 0).

--

--

Michael Entin

Hi, I'm TL of BigQuery Geospatial project. Posting small recipes and various notes for BQ Geospatial users.