At re:Invent 2022, AWS announced its new data governance solution, Amazon DataZone. Although the tool is currently in preview, it is now available in the US East (Northern Virginia), US West (Oregon) and Europe (Ireland) regions.
Data Governance is one of the hottest topics on the market today. Several companies around the international market have pointed problems in this area, such as:
- Efficient data cataloging;
- Discovery and documentation of data that gives users autonomy for decision making;
- Data Literacy — sharing data knowledge across the organization allowing the operation to become increasingly data driven;
- Data Quality — Correct, reliable and consistent data;
- Data Availability — Data available at all times and failsafe processing and serving pipelines.
Then, a pool of tools appeared on the market with features that allow covering some of the challenges cited, especially those related to data cataloging. Informatica's tool is perhaps the best known among the licensed. Among the open source tools, I highlight Data Hub (www.datahubproject.io) developed on LinkedIn, Open Metadata (https://open-metadata.org/) and Amundsen (https://www.amundsen.io /) powered by Lyft. In addition to cataloging and discovering data artifacts, these tools allow for a view of data lineage, including technical documentation and business terms, and building relationships between data artifacts. Also, it is possible to register data owners, the people responsible for the data in those tools. This greatly facilitates access request and evaluation process (which today is a major bottleneck).
I always tell my friends: "It was about time for AWS to launch its data governance tool!". DataZone comes with the main features needed for a good data catalog, in addition to security features and access management integrated into the AWS environment. Let's take a look.
The first step to use DataZone is the creation of a data domain (oriented to a business vertical, for example, sales, logistics, finance). It is curious to note that AWS took great care in showing all the access policies granted to the DataZone, as the tool will have great power in the environment.
Policy details are available by clicking View permission details.
At this point, you can leverage IAM Identity Center (formerly AWS SSO) for granting domain access permissions. However, the domain must be deployed in the same region as the Identity Center. In my case, as you can see, this was not possible.
After a few seconds to create the domain, the access link becomes available. The tool's interface impresses with a beautiful design. The next step is to create a project within the existing domain. The tool already suggests a default profile that has a native connection with Athena, AWS Glue and S3.
Once the project is ready, we will publish data for consultation in the tool. When the project was created, DataZone automatically created two databases in Amazon Athena, one with the _pub_db suffix and one with the _sub_db suffix. The "pub" database will be used for data producing teams to publish their tables. In my case, I already had some glue crawlers configured to automatically map tables in S3 - the 3 tables from Brazil's Higher Education Census (Censo da Educação Superior) 2019 and one table from Brazil's National Student Performance Exam 2017, all public datasets from the Ministry of Education. I just edited the crawlers so that these tables were available in the "pub" database. After that, it is necessary to publish the data in DataZone. In the project interface, we click on Publish Data and we see the screen below:
After filling in the settings and the job has been successfully executed, on the project page we can access the tables brought in the "publishing" menu.
Note that all tables are still marked as draft, that is, they are not yet active to be viewed in the catalog. By clicking on the table name, it is possible to view various general information such as the location in S3 where the data is stored, region, data format, table schema and subscribers (if any). In the schema screen, we can edit the columns giving them a more readable name (the real name of the column is not changed and continues to be displayed in the interface) and a detailed description. A "readable" title and description can also be added to the table. After editing, it is necessary to click on set asset to active, so that the table can be consulted by other users of the catalog.
After activating the tables, they are available for consultation in the Data catalog page and also via search.
To request access to these artifacts, other users registered in another project could open a registration request as consumers (subscribers). From there, data publishers can authorize requests and set granular per-table permissions for their consumption.
DataZone is still in preview and is not intended by Amazon for production use. The tool already has several important features for data governance but there is still a lot to develop. Below, I write down some pros and cons of the tool so far.
- Easy integration with the AWS environment;
- Integrated security and access management with AWS Identity Center or IAM;
- Easy metadata ingestion;
- Creation of automated projects with CloudFormation (transparent to the user);
- Efficient data search;
- Glossary of business terms.
- The tool still does not have data lineage implemented, a very important feature to understand the construction of KPIs;
- It does not have integration with any data quality framework;
- It is not possible to register other data artifacts such as dashboards, charts, pipelines, etc.
In fact, Amazon DataZone is still a tool under development and it has an enormous potential. I look forward to the following development steps of this promising tool.