Introduction
I sat the current AWS Certified Big Data - Specialty (BDS) and the new beta AWS Certified Data Analytics - Specialty (DAS-C01) exams within a 10 day period in December 2019. I passed the Big Data exam and I am still awaiting the results of the Data Analytics exam. Data Analytics was available to sit as a beta exam from December 10, 2019 to January 10, 2020. The Big Data exam is still current but will be discontinued and replaced by the Data Analytics exam in April 2020. Results for the beta exam will not be available until near this time.
There were many reasons for me to do this, personal challenge and technical development come to mind but the main reason was as an information gathering exercise. I work as a data engineering manager for a team that is currently migrating our on-premise stack to AWS. These certifications are on all our development paths. There are currently no generally available training materials for the Data Analytics exam. Therefore, the only way to give guidance to my team was to sit the exam myself and compare it to the materials available for the Big Data exam. This article reflects my own personal experience with the exams. It should not be used as a definitive guide to the differences between the two.
As published in the official AWS exam guides, here we can compare the domains covered by both exams.
Domain | Data Analytics | Big Data | Difference |
---|---|---|---|
Domain 1: Collection | 18% | 17% | +1% |
Domain 2: Storage and Data Management | 22% | 17% | +5% |
Domain 3: Processing | 24% | 17% | +8% |
Domain 4: Analysis and Visualization* | 18% | 29% | -11% |
Domain 5: Security | 18% | 20% | -2% |
The Big Data exam guide lists Analysis and Visualisation as separate domains weighted as 17% and 12% respectively. The Data Analytics exam guide merges them into one domain.
The biggest shift is the reduction of weighting on the analysis domain. I was not asked any questions on Machine Learning which I feel accounts for much of this reduction. The original Big Data exam was launched in 2017 and reflected the 3 types of ML models available in AWS ML service at that time. That was a limited service compared to their current offering. Their Machine Learning services became so comprehensive that AWS was able to release a new AWS Certified Machine Learning - Specialty beta exam at re:Invent 2018. All Machine Learning topics have shifted to this exam and this allows the new Data Analytics exam to focus deeper on 5 consolidated domains rather spread across 6. The upshot for anyone sitting the new Data Analytics exam is that they now need to have a deeper knowledge of a smaller set of domains to achieve a pass.
The beta exam consisted of 85 questions over 4 hours. It was easy to book, just follow the same process on https://www.aws.training/ as for booking all other AWS exams. As it's a beta exam, it was available to book at a 50% discount up to January 10th. My result won't be available until 90 days after this date. Standard availability for the new exam is expected in April 2020.
In the following sections, I will document my experience with the Data Analytics exam and the differences with the Big Data exam. I will summarize my thoughts under each domain.
Domain 1: Collection
Based on the questions I got in the exam, you absolutely need to know Kinesis. I would say there was easily more than 10 questions that required in-depth knowledge of Kinesis. While Kinesis Data Streams and Kinesis Data Firehose both featured prominently in the materials for the Big Data exam, a big difference for the Data Analytics exam was the focus on Kinesis Data Analytics (KDA). There were several questions on KDA and where and how it integrates with Firehose, Data Streams and Lambda. Knowing when to use KDA versus Lambda is important.
The collection of application logs came up. Understanding how you can integrate the Cloudwatch logs agent with other services like Kinesis and the use case for storing logs in S3 would also gain you a couple of points.
I got 2 questions on Managed Streaming Kafka, one on a specific setting that needed to be changed after a new topic was added. The other was around integrating KDA with MSK. You can't.
The best way to load data into Redshift also came up in a number of questions. It's a columnar database so you need to look for options around bulk loading of data, not single row inserts. Copy from S3 is always the recommended option. Compressing the files in S3 will speed up ingestion into Redshift. And loading multiple files instead of one big file is better. Redshift consists of nodes and those nodes consist of slices. One file per slice or a multiple thereof is the fastest way to load data. How to use the manifest file with the Copy command also came up.
Collection of data via the IoT service came up a lot in the training materials for the Big Data exam. I did not get any questions on IoT in the Data Analytics exam.
Specific tips
- For windowing use cases, pick KDA over Lambda.
- Kinesis Data Streams does not integrate directly with any data storage. If one of the options in a question has Data streams writing data directly to S3/Redshift/Elasticsearch/Splunk, it is a wrong option. Firehose does integrate directly with these data destinations and would be the only option in this case.
- KDA works only Data Streams and Firehose as a source. It integrates with Data Streams, Firehouse and Lambda as a destination.
- Copy from S3 is always the best method to load data into Redshift.
Domain 2: Storage and Data Management
There were a number of questions on the best file format for storage in S3. I always went with Parquet over any other option, the columnar option over the row-based option. Also you should flatten the json file before storing for use with Athena and Glue Catalog.
Using the Glue Data Catalog came up in a number of questions, both as part of the scenario and as an answer option. In my opinion, the Glue Data Catalog should always be used over the Hive Data Catalog. You also need to understand how the Data Catalog gets refreshed either automatically via Glue crawlers or manually via API calls to refresh catalog or create new partitions.
Choosing an option for storing application logs came up. If the scenario called for real-time data with visualisations, there was more than likely an Elasticsearch with Kibana option. If a more cost-effective option was needed, storing in S3 with analysis via Athena would work. I seen the Glue crawlers come up in this scenario where the schema of the logs was not fixed. Crawlers could be used to crawl the logs and update the Glue Data Catalog.
Redshift is a great option for storing fixed schema data if you need to perform complex aggregations and analytics on the data. Understanding how to structure your tables in terms of distribution and sort keys for performance will help with a lot of questions. Redshift Spectrum allows you to create a view on data stored in S3 within your Redshift cluster. This allows you to keep newer, hotter data within Redshift and still archive data off to S3 without losing access to it. This scenario came up a couple of times and it's easy enough to answer if you keep this in mind, newer data in Redshift, older data in S3 accessible with Redshift Spectrum.
EMR as an option for storing data came up. I think there are a number of reasons for choosing EMR over S3 or Redshift:
- You have a large scale on-premise Hadoop cluster and you don't want to rewrite all your code. Use EMR with EMRFS in this instance. Use Consistent View if it makes sense. It generally does.
- You are using a very specific Apache project that needs tight integration with HDFS.
- Scale. If you are processing terabytes of data per day or storing multiple petabytes of data in general, EMR will be a better option than Redshift.
I don't recall a scenario where you wouldn't use EMRFS but if you had to use HDFS on EBS volumes, it would be for high throughput, low latency processing on direct attached storage.
Specific tips
- When facing a choice between S3 or Redshift for storing data, watch out for phrases like complex analytics/queries or aggregations. This should indicate Redshift as the answer.
- Redshift distribution keys. Use Key for fact tables and All for dimension tables that don't update regularly.
- Redshift Spectrum is a good option if you want to analyse data stored in Redshift and S3. It is not a standalone service in that you cannot call it without having a Redshift cluster. Therefore if the option is to only use Redshift Spectrum without utilising a Redshift cluster, it's the wrong option.
- In the previous Big Data exam, there was a big focus on DynamoDB. In the Data Analytics exam, there was one specific question out of 85 on DynamoDB. It was in reference to using DynamoDB streams to fork a feed of data out to Redshift. For myself, while I enjoyed learning about DynamoDB, it always felt a bit strange to be in a Big Data exam. It is covered extensively in the Developer exam and maybe AWS have realised that it fits better there.
- EMRFS Consistent View uses DynamoDB in the background to maintain the state of individual objects across AZ's in S3. There was one question around how to deal with a consistency issue that looked like it was actually an issue with DynamoDB throughput. Too many objects listed for DynamoDB to handle at default capacity allocated.
Domain 3: Processing
There were a lot of questions on AWS Glue in relation to transforming data files in S3 in the following scenarios:
- Too many small files in S3. Use Glue to merge them into bigger files.
- Flattening json files and converting them to Parquet format. Glue is the perfect option for this once if the files are already at rest in S3.
Also with Glue, you need to understand the scenarios on when to use Pyspark or the Python Shell. There were some low-level questions that would have required you to have hands on experience with Glue. These were in relation to the use of DynamicFrames for holding data from multiple sources in memory and joining them together. There was also a question on performance tuning a Glue process by tuning specific named parameters.
There were a lot of questions on Redshift performance. Basically, there were a number of scenario based questions where multiple use cases are competing for resources on a Redshift cluster. Based on the scenario, you needed to pick between WLM (auto or manual), Short Query Acceleration, Concurrency Scaling or even an Elastic Resize to add and remove capacity as the best option. The scenario could be Data Scientists running large complex queries, competing with Data Analysts who just want to run short queries. Or your batch jobs that are running at the same time every day when your CEO needs his latest report. To be honest, I struggled with these questions and I am not sure if I picked correctly. I would say there were at least 5 questions in this area so it's definitely worth diving into.
Separately but related, there were two questions on when to use classic vs elastic resize for Redshift. One was around changing node type due to a change in business requirements. And another on just on-line expansion of the cluster without any downtime.
Specific tips
- Glue is a batch tool and cannot be used for immediate transformation of row level data. If the scenario calls for that, you're looking at the Lambda option in that case. This can be specifically for Kinesis or trigger events on S3. Glue doesn't integrate directly with Kinesis but could consume data written to S3 by Kinesis Data Firehose.
- Use classic resize if changing Redshift node types. Use elastic resize if you are not changing node type and need to maintain uptime of cluster.
- Glue is a good option for calling ML APIs. Redshift is not.
Domain 4: Analysis and Visualization
There were a lot of questions on Quicksight, some specific to the types of visualisations that are available. I had a question on the visualisation of geo data points for which the Quicksight map graph was an option. Tableau on EC2 also came up in one question.
Presto was an option for one of the questions I was asked where you needed to run a short query against multiple data sources.
There was one question where the scenario was to use EMR for analysis. While it didn't say it specifically, I took this as the use of either Jupyter Notebook or Apache Zeppelin within EMR as the option.
Specific tips
- The previous Big Data exam and training materials would have covered non-AWS topics like D3.js and Highcharts. They did not come up in my Data Analytics exam.
Domain 5: Security
Understand the encryption options for data in S3 and which ones gives gives the customer more control and/or management. Also understand where you would use S3 encryption versus the KMS. SSE = Server Side Encryption = encryption of data directly on AWS. CSE = Client Side Encryption = encryption before you send data to AWS. How to manage KMS keys between regions also came up.
There were a number of questions on how to manage multi-region access to data in S3. This was in relation to the AWS Glue Catalog and Athena. If you had data in two regions, how do you make that accessible in one Athena or Glue Data Catalog service? Cross-region replication was generally an option but is there a way to give Athena in one region access to S3 data in another?
Securing access to S3 data for Quicksight was also a topic. If you are using Quicksight with Athena and the Glue Data Catalog for reporting on S3 data, at what layer do you enforce security?
Securing data in flight and at rest in EMR also came up. There were questions on Kerberos and integration with AD for executing specific tasks on the EMR cluster. There were very difficult and it would require a more in-depth deep dive to cover this area.
There was one question on setting up encryption on the EBS root volumes used in an EMR cluster. Two of the options were manually installing HDFS on EC2 or using a bootstrapped script on each node when it starts. I went with the option of creating a new AMI that could be re-used each time a new node is added to the cluster. This way it would happen by default each time a node was added through auto-scaling or otherwise.
A scenario around auditing all actions on Redshift also came up. There were some complicated answer options for this one around Enhanced VPC Routing and VPC Flow Logs but Redshift has an audit logging feature that would fulfil it. Sometimes the obvious answer is the easiest.
Conclusion
As it's a beta exam, I won't get my results until 90 days from the end of the beta period in April 2020. Before I started to write all this down, I was hopeful of a pass as it did seem easier than the Big Data exam. Now, I am not so sure. However, I won't know for certain until I get my score. I promise to update once I get my score. If I don't pass, I'll definitely be taking it again.
For me, I believe the new certification is a good move by AWS. I will be recommending to my team to wait for the standard availability of the Data Analytics certification exam. There will hopefully be AWS and 3rd party training resources for the exam by then. As a data engineering team, the new exam is definitely more focused on the tools we currently use or plan to use as part of our migration to AWS this year. The differentiation between data engineering and machine learning allows teams to specialise more rather than generalise.
One suggestion I would have is to split the Data Analytics certification even further. My team focuses solely on data engineering, we work with tools like Redshift, Glue, S3 and Kinesis. We prepare data for secure consumption by services like Athena, Tableau, Quicksight and Presto but we don’t use them. I think there is scope for a further split along these lines. A separate Data Engineering Certification to focus on the Collection, Storage and Data Management and Processing domains and a Data Analytics Certification to focus on the Analysis and Visualisation domain could be an option for the future. Both should focus on Security as everybody’s watching.
Top comments (17)
Hi Tom,
Thanks for writing this post. I found it really interesting. I've been studying for the Big Data one, as I'm a solution Architect who focuses on the data side, and will be taking it in March or April.
I was wondering apart from using AWS and having hands on experience, is there any theory or docs on the theroy part that you used (there's no official study guide book)?
I've been focusing more on the theory around these services:
For collection (Kinesis streams/firehose), storage (S3/Dynamodb), processing (EMR/SageMaker/Lambda/Glue), Analysis (ES Kibana/Redshift/Athena), Visualization (Quicksight) and security (IAM/KMS/Cloudtrail).
Did you focus on more or any specific services in more detail?
Thanks for your insight.
Hi Lawrie,
I used the courses on Big Data Speciality preparation course on the acloud.guru/ site. There is a similar course on Linux Academy. To be honest, I couldn't imagine how you could pass the exam without the help of one these courses.
You have a good list there. In addition, the cloud guru and linux academy courses also cover off (SQS, IoT, Data Pipeline, AWS ML (multiclass v binary v regression models). I didn't get any questions myself on IoT or Data Pipeline but that doesn't mean you shouldn't study it.
Do you know that the Big Data exam will be closed off in April and replaced by the Data Analytics exam?
aws.amazon.com/blogs/big-data/high...
It may be worth waiting for this one and new courses to come online instead of pursuing the Big Data exam.
Thanks,
Tom
Hi Tom,
Thanks a lot for your reply and helpful links.
Lawrie
Congratulations, this article is priceless and absolutely sure will help everyone with this tips. Thanks a lot!
A question for you:
When i see a question talking about application in "near real time" i can choose Kinesis Firehose or since the Kinesis Producer Library operates with "mini batches" it's also near real time?
Hi Marcel, delighted to hear it helps. That was my main reason for writing it.
A bit confused with your question. I think you're asking for the difference between Kinesis Firehose and Kinesis Data Streams. Either can be considered a near real time option. When you are discussing Kinesis Producer Library, it is used to put data on Kinesis Data Streams. Kinesis Client Library or Kinesis Firehose can then be used to consume data from the same stream. I don't believe you can use KPL with Kinesis Firehose.
Hi Tom,
I forgot to write after I took the exam in early May. So, I decided to go for the Big Data Specialty one and passed. I took it at home on Person Vue, because of the current situation. I realise now that I should of done what you done and also taken the Big Data Analytics one within 2 weeks, because I would've still been in exam mode.
What helped me was a couple of resources: Stefan's and Frank's course on udemy, acloudguru, aws faqs/ reinvent and aws online training videos, and the O'reilly online Big Data Specialty course by Noah Gift. And for the practice exams I used BrainCert, which was brilliant. Because the practice questions were really helpful.
What came up was: Redshift (no spectrum), Dynamodb/dax/streams, EMR, Kinesis streams/firehose (no kda), EMR, lambda, ES, QuickSight, S3 and kms/IAM.
(no glue, sagemaker, IoT, aml, datapipeline, or amazon msk.)
Thanks for this article Tom.
Does anyone use Python, SQL, Java, open source or only aws for the data pipelines?
And has anyone taken the AWS DB specialty, SA Pro or Security?
Good Luck All,
Lawrie
Hi Nick
B - there are no associate level exams related to data/data analysis but the Solutions Architect Associate exam would give you a very good understanding of AWS architectures underpinning Redshift, EMR and other Data services. You will learn about VPCs, IAM and S3 just to name a few
C - I was referring to any training resources your employer/university/other may have available. I am lucky in that my employer makes a number of training services available to us
D - I have never used Qwiklabs, only Pluralsight, LinuxAcademy, Cloudguru and AWS's own tutorials
Thanks,
Tom
Hi Nick, thanks for reading my article. Glad you found it helpful. In answer to your questions
1) I think if you study hard and complete the practise, you could pass. The practise is crucial. You need to get hands on with the services to ensure you pass
2) In relation to the materials, I think we will all have to wait a little longer. The AWS Certified Big Data - Specialty is being replaced in April 2020 with the AWS Certified Data Analytics - Specialty exam. I haven't not seen any materials from any of the normal vendors for the Data Analytics course yet.
aws.amazon.com/blogs/big-data/high....
If you have access to any of the vendors for free, run through the AWS Certified Big Data - Specialty courses. It will definitely help but it won't be enough for you to pass. Otherwise, concentrate on learning EMR, Redshift and AWS Glue and getting hands on with those. Do you have any other certifications that you are thinking of doing? It might be worth looking at one of Associate level exams before taking on the Speciality
If you have any more questions, just let me know
Tom
They recently extended the expiryation date for AWS Certified Big Data - Speciality exam to Jul 01, 2020. When you try to schedule that exam it shows the new expiration date
Any opinion if I should take the current exam before the expiration?
Yep, I think with the current situation, taking the Big Data - Speciality exam could make sense. Supporting materials for Data Analytics could be delayed now with the lockdown in place. Let me know how it goes for you.
Thanks,
Tom
Great post, found it quite detailed and thorough. Thanks for sharing your insights. Planning to give it a shot in the coming days!
Thank you for sharing your experience. I've found it very helpful.
Thanks Vitali, good to hear
Just got my results from AWS and I passed. Despite the current global situation, I can't help but be happy with that. Stay safe everyone and wash your hands
Congrats! Tom, since AWS Big Data Specialty certification is extended to July 1, 2020 would you recommend Big Data or Data Analytic Certification? In term of exam success.
Congratulations on passing exam
Some comments may only be visible to logged-in visitors. Sign in to view all comments.