I recently passed the AWS Machine Learning Specialty exam and received quite a few questions on how to prepare for it. So I thought I would write a few tips from a data scientist perspective. This is by no means a comprehensive AWS ML exam guide; rather, I focused on the several areas that I feel that I should not lose any points, or I often got them wrong in practice exams.
The exam guide from AWS says you will be tested on four domains: Data Engineering, Exploratory Data Analysis, Modeling, and Machine Learning Implementations and Operations. The Modeling portion is worth 36% of the entire exam; this is the section that tests your practical machine learning knowledge in applying the right ML solution to a real-world business problem. Also, being an AWS exam, this section has a lot of questions on SageMaker and other AWS ML services.
I purchased this course on udemy by Frank Kan and Stephane Maarek and found it to be very helpful. I actually found this exam required much less preparation than the AWS solutions architect associate exam that I passed last spring, mainly because I work on ML problems on a daily basis as a data scientist, and I generally do not have to worry too much about setting up networks and security. So, if you have a similar background like me, the AWS ML exam is worth taking. The preparation would help you get a better understanding of what AWS offers. Here I summarize five tips that may help you pass the ML exam:
1. Know your distributions
This is really the basic knowledge of statistics/machine learning. You will get tested on them, and you should be able to answer these quickly. The question will explain a scenario and ask you which distribution would best describe it. As a statistics major, these are questions that I found the easiest, but it also means that you should not lose any points on these questions --- since other exam takers probably also get full scores in these.
2. The Amazon Kinesis Family
If you are like me, who is more used to handle "offline" datasets, you may not have experience in dealing with real-time data, which is what the AWS Kinesis family of tools is about. The kinesis family of services, together with AWS Glue, will make up the majority of the data engineering domain (20%) of the ML test. I summarize a few key points that confused me in the beginning:
- Only Kinesis Data Firehose can load streaming data into S3; it can also provide data compression (for S3), as well as data conversions to Parquet/ORC.
- While Kinesis Data Firehose cannot provide data transformation (such as CSV->JSON), it can do so via AWS Lambda.
- Kinesis Analytics is mainly used for real-time analytics via SQL; there are two custom AWS ML SQL functions: RANDOM_CUT_FOREST (for detecting anomaly) and HOTSPOTS (for identifying dense regions).
- Kinesis Analytics will use IAM permissions to access streaming sources and destinations.
- There is also S3 analytics, which is not to be confused with Kinesis Analytics. S3 Analytics is used for storage class analysis.
3. AWS Glue
AWS Glue is another AWS tool that I like --- I just think it is so cool that it automatically crawls my parquet files on S3 and automatically generates a schema. When we have a large amount of data in Parquet hosted on S3, it was very convenient to query Glue generated Athena tables for data exploration. I discovered a few more tricks that Glue can do when I was studying for the exam:
- In addition to the standard data transformation such as DropFields, filter, and join, it also comes with an AWS custom algorithm FindMatches ML, which identifies potential duplicated records, even though the two records may not match exactly.
- Glue will set up elastic network interfaces to enable the job to connect to other resources securely
- Glue can run Spark jobs, and it supports both Python 2.7 and Python 3.6.
4. Security
Again, your mileage may vary, but for me, I am more used to be told by IT that I can't do this or that due to security concerns, rather than having to worry about it too much myself. While the ML exam does not have as many questions on security as the solutions architect exam, you don't want to lose points on this. At a minimum, you will need to know security works with AWS S3; also, security around using SageMaker, how to protect your data to and from SageMaker.
5. Amazon SageMaker
I have some experience with Amazon SageMaker and love it for making a few things very convenient, such as providing the most popular ML frameworks pre-installed in containers. I just don't love its cost that much but, I digress 😔. Amazon SageMaker, being the flagship AWS fully managed ML service, will be tested heavily in the ML exam. Other than security that I mentioned above, some of the things you will need to know include:
- Understand all built-in algorithms from SageMaker; you will be tested on these! Also, understand which algorithm can be speeded up by multi-core, multi-instance, or GPU.
- SageMaker can only get its data from S3, and there's pipe mode or file mode; usually, pipe mode speeds up things when the data is very large.
- Hyperparameter tuning in SageMaker, understand what the options are, and understand how automatic model tuning works with SageMaker.
- Knowing that using Amazon Elastic Inference (EI) with an Amazon SageMaker hosted endpoint can speed up inference time, but you can only attach the EI during instance launch and not after deployment.
Additional Tip: Take two exams for the price of one
From the cost perspective, I would also recommend taking the AWS solutions architect associate certification exam before you take the machine learning specialty one (that's what I did). The AWS solutions architect associate exam costs $150, and after you pass the exam, you will receive a coupon for half price for your next exam. The AWS Machine Learning Specialty exam is $300 and would be $150 after applying the discount. So for $300 you could get both AWS certificates! Good luck!
Top comments (0)