Friday, May 30, 2025

Masking out Sensitive Data on Nested Column in the Google BigQuery Table


In Google BigQuery, nested column is column with STRUCT or RECORD type that put group related fields into a single, structured object within a row. It creates hierarchy of data instead of having separate of flat columns in every piece of information there.

When BigQuery Export is enabled in Google Workspace, logs are sent to BigQuery in two main big tables, named "usage" and "activity" there.

Google Workspace logs contain a lot of sensitive data. And with that condition, the data masking feature in the BigQuery provides many benefits to ease masking out those sensitive data for reporting purposes. 

Unfortunately, inside the "usage" and "activity" tables, they have so many nested columns. 

Example on the "activity" table, if we'd like to mask out all columns that contain email sensitivity, we need to identify all email columns (e.g. search with keyword "user_email"), and then apply mask function/policy respectively there. Those columns are mostly inside their parent field with column type RECORD (so called "nested"):






















The policy tags and data policies ( https://cloud.google.com/bigquery/docs/column-data-masking ) cannot be applied to the nested columns there. As a result, we need to create custom masking routines ( https://cloud.google.com/bigquery/docs/user-defined-functions#custom-mask ), so called "UDF" (User-Defined Function) with some complex SQL syntaxes there.

Suppose we have created following user defined function to mask out username in the email address using SHA functions like this:

CREATE OR REPLACE FUNCTION `projectid-abcde.dataset-xyz.my_custom_email_mask`(email STRING)
RETURNS STRING
AS (
  CASE
    WHEN email IS NULL THEN NULL
    ELSE LEFT(TO_HEX(SHA1(CAST(email AS BYTES))), 7) || '-' || RIGHT(TO_HEX(SHA256(CAST(email AS BYTES))), 7) || '@' || SUBSTR(email, INSTR(email, '@') + 1)
  END
);

So, if the "activity" table has column email with type STRING to be masked out as sensitive data, we simply can execute the SQL syntax like this:

SELECT
  (dataset-xyz.my_custom_email_mask(email)) as email,
   t.* EXCEPT (email) 
FROM
  `projectid-abcde.dataset-xyz.activity` t
WHERE
  email IS NOT NULL;

Problem comes when the column is not STRING but inside nested RECORD.

If the column is nested, the syntax will give error as the BigQuery doesn't allow nested record to be executed directly like this there:

  SELECT
    (dataset-xyz.my_custom_email_mask(email)) as email,
    (dataset-xyz.my_custom_email_mask(mobile.user_email)) as mobile.user_email,
    (dataset-xyz.my_custom_email_mask(meet.identifier)) as meet.identifier,
    (dataset-xyz.my_custom_email_mask(meet.organizer_email)) as meet.organizer_email,
     t.* EXCEPT (email, mobile.user_email, meet.identifier, meet.organizer_email) 
  FROM
    `projectid-abcde.dataset-xyz.activity` t
  WHERE
    mobile is not NULL or meet is not NULL;

So, to mask the nested columns like "mobile.user_email", "meet.identifier", "meet.organizer_email", we need to use STRUCT like this:

  SELECT
    (dataset-xyz.my_custom_email_mask(email)) as email,
     t.* EXCEPT (email) REPLACE(
       (SELECT AS STRUCT (dataset-xyz.my_custom_email_mask(mobile.user_email)) as user_email, mobile.* EXCEPT (user_email)) AS mobile,
       (SELECT AS STRUCT (dataset-xyz.my_custom_email_mask(meet.identifier)) as identifier, (dataset-xyz.my_custom_email_mask(meet.organizer_email)) as organizer_email, meet.* EXCEPT (identifier, organizer_email)) AS meet
     )
  FROM
    `projectid-abcde.dataset-xyz.activity` t
  WHERE
    mobile is not NULL or meet is not NULL;

Complicated and can result to having very long syntaxes, but that's how those work so far 😐














To simplify the query, we may want to create a VIEW like this:

CREATE OR REPLACE VIEW `projectid-abcde.dataset-xyz.activity_with_masked_email_view` AS
  SELECT
    (dataset-xyz.my_custom_email_mask(email)) as email,
     t.* EXCEPT (email) REPLACE(
       (SELECT AS STRUCT (dataset-xyz.my_custom_email_mask(mobile.user_email)) as user_email, mobile.* EXCEPT (user_email)) AS mobile,
       (SELECT AS STRUCT (dataset-xyz.my_custom_email_mask(meet.identifier)) as identifier, (dataset-xyz.my_custom_email_mask(meet.organizer_email)) as organizer_email, meet.* EXCEPT (identifier, organizer_email)) AS meet
     )
  FROM
    `projectid-abcde.dataset-xyz.activity` t
  WHERE
    mobile is not NULL or meet is not NULL;

Hopefully in future Google can make those syntaxes to be shorter and simpler 😘

Saturday, April 19, 2025

Apply Network Policy in the Kubernetes Cluster


When we deploy GenAI app on-prem like ollama with Open WebUI, we may get question: "how can we ensure our GenAI is fully processed on-prem internally and doesn't connect/go out to public services like huggingface, OpenAI or Groq?"

Simple answer maybe "just disconnect your wifi on laptop and see how it goes".

It sounds simple. But when the apps are deployed in data center, we cannot anyhow try and do something like that as it can pose risk on getting connectivity lost when we modify the firewall rules that currently connect to our end thru site-to-site VPN.

Fortunately if the apps are deployed in kubernetes cluster, this can be a relieve because there is a so called "Network Policies" in kubernetes ( https://kubernetes.io/docs/concepts/services-networking/network-policies/ ) that can be applied at namespace level without affecting the whole cluster in the case if we mistakenly apply policy with wrong rules.

This is the example of network policy yaml to block Internet to all pods in the "test" namespace:

$ vi netpol-block-internet-on-test-ns.yaml

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: block-internet-only
  namespace: test
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 10.0.0.0/8
  - to:
    - ipBlock:
        cidr: 192.168.0.0/16
  - to:
    - ipBlock:
        cidr: 172.16.0.0/20

$ kubectl create -f ./netpol-block-internet-on-test-ns.yaml


Done, all pods at "test" namespace are no longer able to access Internet anymore. If we block wrongly, we won't get disconnected from the cluster but only to the pods in that specific "test" namespace.

When that happens, just simply perform edit netpol and retest as per needed:

$ kubectl get netpol -n test
NAME                  POD-SELECTOR   AGE
block-internet-only   &ltnone&gt         5d20h

$ kubectl edit netpol -n test block-internet-only

$ kubectl get pods -n test
NAME                                     READY   STATUS    RESTARTS   AGE
virt-launcher-rhel9-test4-5x4cl          1/1     Running   0          28h
virt-launcher-ubuntu2204-lg4c5           1/1     Running   0          7d6h
web-test-7bb8d8fdc8-bwwjz                1/1     Running   0          8d
web2-test-db86ffffb-stsqq                1/1     Running   0          8d
web3-test-564d95cb8f-s8ptl               1/1     Running   0          8d
chroma-chromadb-0                        1/1     Running   0          8d
ollama-6866946df5-w5vc9                  1/1     Running   0          31h
open-webui-855594f59b-cj8v2              1/1     Running   0          8d

$ kubectl exec -it -n test ollama-6866946df5-w5vc9 -- /bin/bash
root@ollama-6866946df5-w5vc9:/# ping 8.8.8.8
bash: ping: command not found

root@ollama-6866946df5-w5vc9:/# apt update ; apt install -y iputils-ping    #(Oops, I cannot connect to Internet so I need to remove back the above netpol temporarily in order to perform this installation :))
root@ollama-6866946df5-w5vc9:/# ping 8.8.8.8   #(now here I can try to apply the netpol again and check the effect on the Internet connectivities)


Kubernetes is complicated and complex with full of yamls and command lines. But when we know how to use those yamls and command lines, it is so convenient to deploy/apply any capabilities we need them there.👍

Even Redhat has its premium product with Openshift and can simplify most of those yamls and command lines thru clicks on web-based portal, it is still not as convenient as applying/executing them thru classic terminal and "oc" command lines there. 😃

















Which one do you prefer to use: 
- Typing command lines thru terminal?
- Or, clicking the link and button thru web portal?😏

Don't you know that kubectl and oc commands can perform auto-completion in the bash terminal to speed up typing the command with its parameters? 

And don't you know as well that when we install Redhat CoreOS, the oc auto-completion has already been turned-on by default on the bash shell there? 😀