augusto kiniama rosa

Posted on Nov 7, 2023 • Originally published at blog.infostrux.com on Sep 30, 2023

New Features: Snowflake International Tags & Shared Tags

#snowflake #governance #technology #data

Newly Released Tag Support for International Tag Standard ISO-3166–1 alpha-2 and Ability to Share Tags

Photo by Brett Jordan on Unsplash

In August, Snowflake released two new features related to Data Governance, which included support for the International Tag Standard ISO-3166–1 alpha-2 and shared tags.

Let's look at a use case on how we can implement both of these features and how they can help with the data governance of your Snowflake implementation.

What is the standard ISO 3166–1 alpha-2?

The ISO 3166–1 alpha-2 code is a two-letter code uniquely identifying a country or a significant geographic area. It’s part of the ISO 3166 standard adopted by the International Organization for Standardization (ISO). This standard defines codes for the names of countries, dependent territories, and particular areas of geographical interest.

For example:

US for the United States
CA for Canada
FR for France
CN for China
IN for India … and etc etc.

These codes are used in various international contexts, including in vehicle registration plates, domain names (like .us, .ca, .uk), and international shipping labels, among other uses.

This a working standard, and version 3, for example, uses a 3-letter country code, and version 2 includes states and provinces.

What is Snowflake implementation?

Snowflake used the standard to augment the classification system, which helps create metadata to identify PII data with a focus and support for international values. Keep in mind that it is still an alpha standard.

Snowflake offers SEMANTIC_CATEGORY tags for various countries including Australia, Canada, the United Kingdom, and the United States. These tags are grouped under specific parent categories to support the respective countries, and parent categories provide details about the classification outcome, such as whether a column primarily comprises values from a particular country.

Lets create a quick test and its capabilities, just remember, all I did run is the SQL scripts below, and I did not create other tags.

CREATE OR REPLACE TABLE customers (
    first_name VARCHAR,
    last_name VARCHAR,
    email_address VARCHAR,
    street_address VARCHAR,
    city VARCHAR,
    province_state VARCHAR,
    postal_code VARCHAR,
    drivers_license VARCHAR
    );

INSERT INTO customers (first_name, last_name, email_address, street_address, city, province_state, postal_code, drivers_license) 
    VALUES 
        ('Augusto','Rosi','1@fake.com','7A - 6th Avenue','New York','NY','12346','R72727272772'),
        ('Kilo', 'Start','2@fake.com','7A - 6th Avenue','Markham','ON','L7T 1T1','344866431'),
        ('Audrey','Curty','3@fake.com','7A - 6th Avenue','New York','NY','12346','344866432'),
        ('Albertino','Moty','4@fake.com','7A - 6th Avenue','New Jersey','NJ','12346','344866434'),
        ('Wong', 'August','5@fake.com','7A - 6th Avenue','Los Angeles','CA','12346','344866432'),
        ('King', 'Maria','6@fake.com','7A - 6th Avenue','New York','NY','12346','344866439');

Here is results, noticed how it was able to automatically find what was a Canadian or american address

SELECT EXTRACT_SEMANTIC_CATEGORIES('customers');

RESULTS:
{
  "CITY": {
    "alternates": [],
    "recommendation": {
      "confidence": "HIGH",
      "coverage": 0.8333333,
      "details": [
        {
          "coverage": 0.8333333,
          "semantic_category": "US_CITY"
        }
      ],
      "privacy_category": "QUASI_IDENTIFIER",
      "semantic_category": "CITY"
    },
    "valid_value_ratio": 1
  },
  "DRIVERS_LICENSE": {
    "alternates": [],
    "recommendation": {
      "confidence": "MEDIUM",
      "coverage": 0.5,
      "details": [
        {
          "coverage": 0.5,
          "semantic_category": "US_DRIVERS_LICENSE"
        },
        {
          "coverage": 0.41666666,
          "semantic_category": "AU_DRIVERS_LICENSE"
        },
        {
          "coverage": 0.41666666,
          "semantic_category": "CA_DRIVERS_LICENSE"
        }
      ],
      "privacy_category": "IDENTIFIER",
      "semantic_category": "DRIVERS_LICENSE"
    },
    "valid_value_ratio": 1
  },
  "EMAIL_ADDRESS": {
    "alternates": [],
    "recommendation": {
      "confidence": "HIGH",
      "coverage": 1,
      "details": [],
      "privacy_category": "IDENTIFIER",
      "semantic_category": "EMAIL"
    },
    "valid_value_ratio": 1
  },
  "FIRST_NAME": {
    "alternates": [],
    "recommendation": {
      "confidence": "HIGH",
      "coverage": 1,
      "details": [],
      "privacy_category": "IDENTIFIER",
      "semantic_category": "NAME"
    },
    "valid_value_ratio": 1
  },
  "LAST_NAME": {
    "alternates": [],
    "recommendation": {
      "confidence": "MEDIUM",
      "coverage": 0.5,
      "details": [],
      "privacy_category": "IDENTIFIER",
      "semantic_category": "NAME"
    },
    "valid_value_ratio": 1
  },
  "POSTAL_CODE": {
    "alternates": []
  },
  "PROVINCE_STATE": {
    "alternates": [],
    "recommendation": {
      "confidence": "HIGH",
      "coverage": 0.8333333,
      "details": [
        {
          "coverage": 0.8333333,
          "semantic_category": "US_STATE_OR_TERRITORY"
        }
      ],
      "privacy_category": "QUASI_IDENTIFIER",
      "semantic_category": "ADMINISTRATIVE_AREA_1"
    },
    "valid_value_ratio": 1
  },
  "STREET_ADDRESS": {
    "alternates": [],
    "recommendation": {
      "confidence": "HIGH",
      "coverage": 1,
      "details": [
        {
          "coverage": 1,
          "semantic_category": "US_STREET_ADDRESS"
        },
        {
          "coverage": 1,
          "semantic_category": "CA_STREET_ADDRESS"
        }
      ],
      "privacy_category": "IDENTIFIER",
      "semantic_category": "STREET_ADDRESS"
    },
    "valid_value_ratio": 1
  }
}

Another way to look at the data

SELECT
    f.key::varchar as column_name,
    f.value:"recommendation":"privacy_category"::varchar as privacy_category,  
    f.value:"recommendation":"semantic_category"::varchar as semantic_category,
    f.value:"recommendation":"confidence"::varchar as confidence,
    f.value:"recommendation":"coverage"::number(10,2) as coverage,
    f.value:"details"::variant as details,
    f.value:"alternates"::variant as alternates
  FROM
  TABLE(FLATTEN(EXTRACT_SEMANTIC_CATEGORIES('customers')::VARIANT)) AS f;

Results:

DRIVERS_LICENSE field did not right away detect it as I had simply put fake data that did not follow the standard, but when I fixed it to match what a NY DMV number looks like or an Ontario, it was able to detect and classify it. I further added a tag to force things.

  --Lets try to fix Drivers License
  alter table customers
  modify column drivers_license
  set tag snowflake.core.semantic_category = 'DRIVERS_LICENSE';

Things to Note

Snowflake supports classifying data stored in all types of tables and views, including external tables, materialized views, and secure views. You can classify table and view columns for all supported data types except for the following data types: GEOGRAPHY, BINARY and VARIANT.

It does use compute power to classify. Use the following general guidelines to select a warehouse size:

No concern for processing time: x-small warehouse.
Up to 100 columns in a table: small warehouse.
101 to 300 columns in a table: medium warehouse.
301 columns or more in a table: large warehouse.

Shared Tags

This neat feature allows Tags to be set and seen by the consumer of shares; it becomes helpful in many contexts, including proper governance. The data-sharing provider can now share the tag and tagged object with the consumer. Tagging objects that require compliance, for example, Personal Identifiable-Information (PII) or Personal Healthcare-Information (PHI).

You can see how to implement here, for today, I will avoid showing you a technical demo as I feel it pretty straight forward on the documentation.

Let's say that contractually, you are forcing a BAA into the data consumer; this would be an easier way to propagate PHI data to the consumer without determining it manually. This way, the consumer can take their mandated precautions. This can be table or column-based tag values.

Imagine that you have data controls and governance using tags that flow from the raw data stage and continue through as data is shared to whatever partner, departments and teams that require it. The ability to share tags allow a much stronger posture across the data platform.

Conclusion

The benefits for an International Tag system are broad, you can influence the RBAC model, what data gets shared, and can drive policies, aka, what data to protect for example. You can see it is does a really good job at identifying data that fits the criteria but also it is a work in progress, so test things properly. For example, I could not get it to identify my drivers license field.

Coupled with Shared Tags, that can bring these tags into Snowflake Consumer accounts, and data sharing consumers. It is a super powerful way to track and maintain compliance needs for PII and PHI.

I’m Augusto Rosa, VP of Engineering for Infostrux Solutions. Thanks for reading my blog post. You can follow me on LinkedIn. Subscribe to Infostrux Medium Blogs https://medium.com/infostrux-solutions for the most interesting Data Engineering and Snowflake news.

Sources:

https://docs.snowflake.com/en/user-guide/data-sharing-provider#label-share-tag-references

DEV Community

New Features: Snowflake International Tags & Shared Tags

Newly Released Tag Support for International Tag Standard ISO-3166–1 alpha-2 and Ability to Share Tags

What is the standard ISO 3166–1 alpha-2?

What is Snowflake implementation?

Things to Note

Shared Tags

Conclusion

Sources:

Top comments (0)

Read next

Quick Tip: Secret Scopes For Azure Key Vaults

Handling Outliers 101: Why the IQR Method is Your Go-To Tool

The Importance of Data Quality in Effective Analytics

Transforming Data Linkage: An In-Depth Look at IntaLink