I sat the current AWS Certified Big Data - Specialty (BDS) and the new beta AWS Certified Data Analytics - Specialty (DAS-C01) exams within a 10 day period in December 2019. I passed the Big Data exam and I am still awaiting the results of the Data Analytics exam. Data Analytics was available to sit as a beta exam from December 10, 2019 to January 10, 2020. The Big Data exam is still current but will be discontinued and replaced by the Data Analytics exam in April 2020. Results for the beta exam will not be available until near this time.
There were many reasons for me to do this, personal challenge and technical development come to mind but the main reason was as an information gathering exercise. I work as a data engineering manager for a team that is currently migrating our on-premise stack to AWS. These certifications are on all our development paths. There are currently no generally available training materials for the Data Analytics exam. Therefore, the only way to give guidance to my team was to sit the exam myself and compare it to the materials available for the Big Data exam. This article reflects my own personal experience with the exams. It should not be used as a definitive guide to the differences between the two.
As published in the official AWS exam guides, here we can compare the domains covered by both exams.
|Domain||Data Analytics||Big Data||Difference|
|Domain 1: Collection||18%||17%||+1%|
|Domain 2: Storage and Data Management||22%||17%||+5%|
|Domain 3: Processing||24%||17%||+8%|
|Domain 4: Analysis and Visualization*||18%||29%||-11%|
|Domain 5: Security||18%||20%||-2%|
The Big Data exam guide lists Analysis and Visualisation as separate domains weighted as 17% and 12% respectively. The Data Analytics exam guide merges them into one domain.
The biggest shift is the reduction of weighting on the analysis domain. I was not asked any questions on Machine Learning which I feel accounts for much of this reduction. The original Big Data exam was launched in 2017 and reflected the 3 types of ML models available in AWS ML service at that time. That was a limited service compared to their current offering. Their Machine Learning services became so comprehensive that AWS was able to release a new AWS Certified Machine Learning - Specialty beta exam at re:Invent 2018. All Machine Learning topics have shifted to this exam and this allows the new Data Analytics exam to focus deeper on 5 consolidated domains rather spread across 6. The upshot for anyone sitting the new Data Analytics exam is that they now need to have a deeper knowledge of a smaller set of domains to achieve a pass.
The beta exam consisted of 85 questions over 4 hours. It was easy to book, just follow the same process on https://www.aws.training/ as for booking all other AWS exams. As it's a beta exam, it was available to book at a 50% discount up to January 10th. My result won't be available until 90 days after this date. Standard availability for the new exam is expected in April 2020.
In the following sections, I will document my experience with the Data Analytics exam and the differences with the Big Data exam. I will summarize my thoughts under each domain.
Based on the questions I got in the exam, you absolutely need to know Kinesis. I would say there was easily more than 10 questions that required in-depth knowledge of Kinesis. While Kinesis Data Streams and Kinesis Data Firehose both featured prominently in the materials for the Big Data exam, a big difference for the Data Analytics exam was the focus on Kinesis Data Analytics (KDA). There were several questions on KDA and where and how it integrates with Firehose, Data Streams and Lambda. Knowing when to use KDA versus Lambda is important.
The collection of application logs came up. Understanding how you can integrate the Cloudwatch logs agent with other services like Kinesis and the use case for storing logs in S3 would also gain you a couple of points.
I got 2 questions on Managed Streaming Kafka, one on a specific setting that needed to be changed after a new topic was added. The other was around integrating KDA with MSK. You can't.
The best way to load data into Redshift also came up in a number of questions. It's a columnar database so you need to look for options around bulk loading of data, not single row inserts. Copy from S3 is always the recommended option. Compressing the files in S3 will speed up ingestion into Redshift. And loading multiple files instead of one big file is better. Redshift consists of nodes and those nodes consist of slices. One file per slice or a multiple thereof is the fastest way to load data. How to use the manifest file with the Copy command also came up.
Collection of data via the IoT service came up a lot in the training materials for the Big Data exam. I did not get any questions on IoT in the Data Analytics exam.
- For windowing use cases, pick KDA over Lambda.
- Kinesis Data Streams does not integrate directly with any data storage. If one of the options in a question has Data streams writing data directly to S3/Redshift/Elasticsearch/Splunk, it is a wrong option. Firehose does integrate directly with these data destinations and would be the only option in this case.
- KDA works only Data Streams and Firehose as a source. It integrates with Data Streams, Firehouse and Lambda as a destination.
- Copy from S3 is always the best method to load data into Redshift.
There were a number of questions on the best file format for storage in S3. I always went with Parquet over any other option, the columnar option over the row-based option. Also you should flatten the json file before storing for use with Athena and Glue Catalog.
Using the Glue Data Catalog came up in a number of questions, both as part of the scenario and as an answer option. In my opinion, the Glue Data Catalog should always be used over the Hive Data Catalog. You also need to understand how the Data Catalog gets refreshed either automatically via Glue crawlers or manually via API calls to refresh catalog or create new partitions.
Choosing an option for storing application logs came up. If the scenario called for real-time data with visualisations, there was more than likely an Elasticsearch with Kibana option. If a more cost-effective option was needed, storing in S3 with analysis via Athena would work. I seen the Glue crawlers come up in this scenario where the schema of the logs was not fixed. Crawlers could be used to crawl the logs and update the Glue Data Catalog.
Redshift is a great option for storing fixed schema data if you need to perform complex aggregations and analytics on the data. Understanding how to structure your tables in terms of distribution and sort keys for performance will help with a lot of questions. Redshift Spectrum allows you to create a view on data stored in S3 within your Redshift cluster. This allows you to keep newer, hotter data within Redshift and still archive data off to S3 without losing access to it. This scenario came up a couple of times and it's easy enough to answer if you keep this in mind, newer data in Redshift, older data in S3 accessible with Redshift Spectrum.
EMR as an option for storing data came up. I think there are a number of reasons for choosing EMR over S3 or Redshift:
- You have a large scale on-premise Hadoop cluster and you don't want to rewrite all your code. Use EMR with EMRFS in this instance. Use Consistent View if it makes sense. It generally does.
- You are using a very specific Apache project that needs tight integration with HDFS.
- Scale. If you are processing terabytes of data per day or storing multiple petabytes of data in general, EMR will be a better option than Redshift.
I don't recall a scenario where you wouldn't use EMRFS but if you had to use HDFS on EBS volumes, it would be for high throughput, low latency processing on direct attached storage.
- When facing a choice between S3 or Redshift for storing data, watch out for phrases like complex analytics/queries or aggregations. This should indicate Redshift as the answer.
- Redshift distribution keys. Use Key for fact tables and All for dimension tables that don't update regularly.
- Redshift Spectrum is a good option if you want to analyse data stored in Redshift and S3. It is not a standalone service in that you cannot call it without having a Redshift cluster. Therefore if the option is to only use Redshift Spectrum without utilising a Redshift cluster, it's the wrong option.
- In the previous Big Data exam, there was a big focus on DynamoDB. In the Data Analytics exam, there was one specific question out of 85 on DynamoDB. It was in reference to using DynamoDB streams to fork a feed of data out to Redshift. For myself, while I enjoyed learning about DynamoDB, it always felt a bit strange to be in a Big Data exam. It is covered extensively in the Developer exam and maybe AWS have realised that it fits better there.
- EMRFS Consistent View uses DynamoDB in the background to maintain the state of individual objects across AZ's in S3. There was one question around how to deal with a consistency issue that looked like it was actually an issue with DynamoDB throughput. Too many objects listed for DynamoDB to handle at default capacity allocated.
There were a lot of questions on AWS Glue in relation to transforming data files in S3 in the following scenarios:
- Too many small files in S3. Use Glue to merge them into bigger files.
- Flattening json files and converting them to Parquet format. Glue is the perfect option for this once if the files are already at rest in S3.
Also with Glue, you need to understand the scenarios on when to use Pyspark or the Python Shell. There were some low-level questions that would have required you to have hands on experience with Glue. These were in relation to the use of DynamicFrames for holding data from multiple sources in memory and joining them together. There was also a question on performance tuning a Glue process by tuning specific named parameters.
There were a lot of questions on Redshift performance. Basically, there were a number of scenario based questions where multiple use cases are competing for resources on a Redshift cluster. Based on the scenario, you needed to pick between WLM (auto or manual), Short Query Acceleration, Concurrency Scaling or even an Elastic Resize to add and remove capacity as the best option. The scenario could be Data Scientists running large complex queries, competing with Data Analysts who just want to run short queries. Or your batch jobs that are running at the same time every day when your CEO needs his latest report. To be honest, I struggled with these questions and I am not sure if I picked correctly. I would say there were at least 5 questions in this area so it's definitely worth diving into.
Separately but related, there were two questions on when to use classic vs elastic resize for Redshift. One was around changing node type due to a change in business requirements. And another on just on-line expansion of the cluster without any downtime.
- Glue is a batch tool and cannot be used for immediate transformation of row level data. If the scenario calls for that, you're looking at the Lambda option in that case. This can be specifically for Kinesis or trigger events on S3. Glue doesn't integrate directly with Kinesis but could consume data written to S3 by Kinesis Data Firehose.
- Use classic resize if changing Redshift node types. Use elastic resize if you are not changing node type and need to maintain uptime of cluster.
- Glue is a good option for calling ML APIs. Redshift is not.
There were a lot of questions on Quicksight, some specific to the types of visualisations that are available. I had a question on the visualisation of geo data points for which the Quicksight map graph was an option. Tableau on EC2 also came up in one question.
Presto was an option for one of the questions I was asked where you needed to run a short query against multiple data sources.
There was one question where the scenario was to use EMR for analysis. While it didn't say it specifically, I took this as the use of either Jupyter Notebook or Apache Zeppelin within EMR as the option.
- The previous Big Data exam and training materials would have covered non-AWS topics like D3.js and Highcharts. They did not come up in my Data Analytics exam.
Understand the encryption options for data in S3 and which ones gives gives the customer more control and/or management. Also understand where you would use S3 encryption versus the KMS. SSE = Server Side Encryption = encryption of data directly on AWS. CSE = Client Side Encryption = encryption before you send data to AWS. How to manage KMS keys between regions also came up.
There were a number of questions on how to manage multi-region access to data in S3. This was in relation to the AWS Glue Catalog and Athena. If you had data in two regions, how do you make that accessible in one Athena or Glue Data Catalog service? Cross-region replication was generally an option but is there a way to give Athena in one region access to S3 data in another?
Securing access to S3 data for Quicksight was also a topic. If you are using Quicksight with Athena and the Glue Data Catalog for reporting on S3 data, at what layer do you enforce security?
Securing data in flight and at rest in EMR also came up. There were questions on Kerberos and integration with AD for executing specific tasks on the EMR cluster. There were very difficult and it would require a more in-depth deep dive to cover this area.
There was one question on setting up encryption on the EBS root volumes used in an EMR cluster. Two of the options were manually installing HDFS on EC2 or using a bootstrapped script on each node when it starts. I went with the option of creating a new AMI that could be re-used each time a new node is added to the cluster. This way it would happen by default each time a node was added through auto-scaling or otherwise.
A scenario around auditing all actions on Redshift also came up. There were some complicated answer options for this one around Enhanced VPC Routing and VPC Flow Logs but Redshift has an audit logging feature that would fulfil it. Sometimes the obvious answer is the easiest.
As it's a beta exam, I won't get my results until 90 days from the end of the beta period in April 2020. Before I started to write all this down, I was hopeful of a pass as it did seem easier than the Big Data exam. Now, I am not so sure. However, I won't know for certain until I get my score. I promise to update once I get my score. If I don't pass, I'll definitely be taking it again.
For me, I believe the new certification is a good move by AWS. I will be recommending to my team to wait for the standard availability of the Data Analytics certification exam. There will hopefully be AWS and 3rd party training resources for the exam by then. As a data engineering team, the new exam is definitely more focused on the tools we currently use or plan to use as part of our migration to AWS this year. The differentiation between data engineering and machine learning allows teams to specialise more rather than generalise.
One suggestion I would have is to split the Data Analytics certification even further. My team focuses solely on data engineering, we work with tools like Redshift, Glue, S3 and Kinesis. We prepare data for secure consumption by services like Athena, Tableau, Quicksight and Presto but we don’t use them. I think there is scope for a further split along these lines. A separate Data Engineering Certification to focus on the Collection, Storage and Data Management and Processing domains and a Data Analytics Certification to focus on the Analysis and Visualisation domain could be an option for the future. Both should focus on Security as everybody’s watching.