Friday, May 30, 2025

Masking out Sensitive Data on Nested Column in the Google BigQuery Table


In Google BigQuery, nested column is column with STRUCT or RECORD type that put group related fields into a single, structured object within a row. It creates hierarchy of data instead of having separate of flat columns in every piece of information there.

When BigQuery Export is enabled in Google Workspace, logs are sent to BigQuery in two main big tables, named "usage" and "activity" there.

Google Workspace logs contain a lot of sensitive data. And with that condition, the data masking feature in the BigQuery provides many benefits to ease masking out those sensitive data for reporting purposes. 

Unfortunately, inside the "usage" and "activity" tables, they have so many nested columns. 

Example on the "activity" table, if we'd like to mask out all columns that contain email sensitivity, we need to identify all email columns (e.g. search with keyword "user_email"), and then apply mask function/policy respectively there. Those columns are mostly inside their parent field with column type RECORD (so called "nested"):






















The policy tags and data policies ( https://cloud.google.com/bigquery/docs/column-data-masking ) cannot be applied to the nested columns there. As a result, we need to create custom masking routines ( https://cloud.google.com/bigquery/docs/user-defined-functions#custom-mask ), so called "UDF" (User-Defined Function) with some complex SQL syntaxes there.

Suppose we have created following user defined function to mask out username in the email address using SHA functions like this:

CREATE OR REPLACE FUNCTION `projectid-abcde.dataset-xyz.my_custom_email_mask`(email STRING)
RETURNS STRING
AS (
  CASE
    WHEN email IS NULL THEN NULL
    ELSE LEFT(TO_HEX(SHA1(CAST(email AS BYTES))), 7) || '-' || RIGHT(TO_HEX(SHA256(CAST(email AS BYTES))), 7) || '@' || SUBSTR(email, INSTR(email, '@') + 1)
  END
);

So, if the "activity" table has column email with type STRING to be masked out as sensitive data, we simply can execute the SQL syntax like this:

SELECT
  (dataset-xyz.my_custom_email_mask(email)) as email,
   t.* EXCEPT (email) 
FROM
  `projectid-abcde.dataset-xyz.activity` t
WHERE
  email IS NOT NULL;

Problem comes when the column is not STRING but inside nested RECORD.

If the column is nested, the syntax will give error as the BigQuery doesn't allow nested record to be executed directly like this there:

  SELECT
    (dataset-xyz.my_custom_email_mask(email)) as email,
    (dataset-xyz.my_custom_email_mask(mobile.user_email)) as mobile.user_email,
    (dataset-xyz.my_custom_email_mask(meet.identifier)) as meet.identifier,
    (dataset-xyz.my_custom_email_mask(meet.organizer_email)) as meet.organizer_email,
     t.* EXCEPT (email, mobile.user_email, meet.identifier, meet.organizer_email) 
  FROM
    `projectid-abcde.dataset-xyz.activity` t
  WHERE
    mobile is not NULL or meet is not NULL;

So, to mask the nested columns like "mobile.user_email", "meet.identifier", "meet.organizer_email", we need to use STRUCT like this:

  SELECT
    (dataset-xyz.my_custom_email_mask(email)) as email,
     t.* EXCEPT (email) REPLACE(
       (SELECT AS STRUCT (dataset-xyz.my_custom_email_mask(mobile.user_email)) as user_email, mobile.* EXCEPT (user_email)) AS mobile,
       (SELECT AS STRUCT (dataset-xyz.my_custom_email_mask(meet.identifier)) as identifier, (dataset-xyz.my_custom_email_mask(meet.organizer_email)) as organizer_email, meet.* EXCEPT (identifier, organizer_email)) AS meet
     )
  FROM
    `projectid-abcde.dataset-xyz.activity` t
  WHERE
    mobile is not NULL or meet is not NULL;

Complicated and can result to having very long syntaxes, but that's how those work so far 😐














To simplify the query, we may want to create a VIEW like this:

CREATE OR REPLACE VIEW `projectid-abcde.dataset-xyz.activity_with_masked_email_view` AS
  SELECT
    (dataset-xyz.my_custom_email_mask(email)) as email,
     t.* EXCEPT (email) REPLACE(
       (SELECT AS STRUCT (dataset-xyz.my_custom_email_mask(mobile.user_email)) as user_email, mobile.* EXCEPT (user_email)) AS mobile,
       (SELECT AS STRUCT (dataset-xyz.my_custom_email_mask(meet.identifier)) as identifier, (dataset-xyz.my_custom_email_mask(meet.organizer_email)) as organizer_email, meet.* EXCEPT (identifier, organizer_email)) AS meet
     )
  FROM
    `projectid-abcde.dataset-xyz.activity` t
  WHERE
    mobile is not NULL or meet is not NULL;

Hopefully in future Google can make those syntaxes to be shorter and simpler 😘