Lester Sim

Posted on Oct 10, 2022

AWS Machine Learning Certification: Exam Notes

#aws #cloud #machinelearning #ai

Disclaimer: The opinions expressed here are my own and I'm not writing on behalf of AWS or Amazon.

The AWS Machine Learning - Specialty Certification covers a wide spectrum of topics from data engineering to exploratory data analysis to model training and deployment. Here are some quick notes I've gathered to prepare for the certification:

AWS AI Services

Beneficial for developers who want to add AI into their applications through API calls instead of developing and training their own ML models from scratch.

Amazon Textract

Extract text from scanned documents using Optical Character Recognition (OCR).

Documents

Returns text, forms, tables and query responses.

Expenses

Extracts data from invoices/receipts eg. vendor name, invoice/receipt date, invoice/receipt number, item name, item price, item quantity, total amount.

Amazon Comprehend

Extract entities, key phrases, language, personal identifiable information (PII), and sentiments from text.

Entities

Extract entities from text documents eg. people, places, locations.

Using AWS Console:

Using AWS API:

Key Phrases

Extract the key phrases (one or more words) from text documents.

Using AWS Console:

Using AWS API:

Sentiment

Predict the overall sentiment of the text - positive, negative, neutral, mixed.

Using AWS Console:

Using AWS API:

Language

Predict the dominant language of the entire text. Amazon Comprehend can recognize 100 languages.

Using AWS Console:

Using AWS API:

Personally Identifiable Information (PII)

List out entities in your input text that contain personal information eg. address, bank account number, or phone number.

Using AWS Console:

Using AWS API:

Vision

Amazon Rekognition

Analyze images and videos to identify objects, people, text, scenes, and activities.

Label Detection

Extract labels of objects, concepts, scenes, and actions in your images.

Facial Analysis

Detect faces and retrieve facial attributes in an image eg. facial expressions, accessories, facial features, etc.

Face Comparison

Compare faces within a set of images with multiple faces in them. Compares the largest face in the source image (reference face) with up to 100 faces detected in the target image (comparison faces), and generate a similarity score.

Other AWS AI Services

Amazon Lex: Build conversational interfaces using voice/text as input
Amazon Polly: Text to speech
Amazon Transcribe: Speech to text
Amazon Translate: To different languages

Domain 1: Data Engineering

AWS Glue

https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

Serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources.
Data Sources: S3, RDS, JDBC, DynamoDB, Kinesis Data Streams, Apache Kafka
Data Targets: S3, RDS, JDBC
Crawlers: Automatically infer database and table schema from your source data, storing the associated metadata in the AWS Glue Data Catalog.
ETL Programming Languages: PySpark (Python), Scala
FindMatches Transform: Use this machine learning transformation step to identify duplicate or matching records. Eg. matching customers/products/improve fraud detection, etc.

Amazon Athena

https://docs.aws.amazon.com/athena/latest/ug/what-is.html

Serverless, interactive query service to query data and analyze big data in Amazon S3 using standard SQL.
Integration with AWS Glue: AWS Glue crawlers automatically infer database and table schema from data in S3 and store the associated metadata in AWS Glue Data Catalog. This catalog lets the Athena query engine know how to find, read, and process the data you want to query.
When to use Amazon Athena vs Redshift vs EMR: https://docs.aws.amazon.com/athena/latest/ug/when-should-i-use-ate.html

Amazon Kinesis

https://docs.aws.amazon.com/kinesis/index.html

Kinesis Video Stream

Stream live video data, optionally store it, and make the data available for consumption both in real time and on a batch or ad hoc basis.

Kinesis Data Stream

Collect and process large streams of data records in real time.

Reading from Data Streams (Consumers): Using Kinesis Data Analytics, Kinesis Data Firehose, Lambda, EC2

Kinesis Data Firehose

ETL service that captures, transforms, and delivers streaming data to data lakes, data stores, and analytics services. Buffers incoming streaming data to a certain size or for a certain period of time before delivering it to destinations.

Use Lambda to do data transformation for each buffered batch/convert file format. Eg. Apache Parquet more efficient to query than JSON format.
Delivery Stream Destination: S3, Redshift, Elasticsearch, Splunk, HTTP endpoint, etc

Kinesis Data Analytics

Continuously read and analyze data from a connected streaming source in real-time.

Source: Kinesis Data Stream, Kinesis Data Firehose
Destination: 1/ Kinesis Data Stream, 2/ Kinesis Data Firehose, 3/ Lambda
Runtime: SQL, Apache Flink
Aggregate/Analytical Functions: Hotspots, Random Cut Forest, etc

Domain 2: Exploratory Data Analysis

Data Labelling: AWS Ground Truth (Data labeling service using human annotators from Amazon Mechanical Turk or your own private workforce)
Feature Engineering: 1 hot encoding, binning, outliers, normalization, PCA dimension reduction. For text: TF-IDF, Bag of Words, N-Gram.
Know the different types of data visualization: Histogram, scatter plot, box plot, correlation heatmap, hierarchical plot, etc.

Domain 3: Modelling

https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

Supervised Learning Algos: XGBoost, k-NN, Linear Learner, DeepAR Forecasting, Object2Vec,
Unsupervised Learning Algos: K-Means, PCA, Random Cut Forest
Text Analysis Algos: BlazingText, Sequence-to-Sequence, LDA, Neural Topic Model (NTM)
Image Processing Algos: MXNet, TensorFlow, Object Detection, Semantic Segmentation (pixel level)
Evaluation of ML Models: Confusion Matrix, AUC-ROC, Accuracy, Precision, Recall, F1 Score, RMSE
Overfitting Solutions: 1/ Use fewer features, 2/ Decrease n-grams size, 3/ Increase amount of regularization used, 4/ Increase amount of training data examples
Underfitting Solutions: 1/ Add new domain-specific features, 2/ Add more Cartesian products, 3/ Increase n-grams size, 4/ Decrease amount of regularization used, 5/ Increase amount of training data examples
Hyperparameter Tuning: Random Search, Bayesian Search
How SageMaker Studio works: https://aws.amazon.com/blogs/machine-learning/dive-deep-into-amazon-sagemaker-studio-notebook-architecture/
SageMaker Studio Notebooks vs SageMaker Notebook Instances: https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-comparison.html

Domain 4: Machine Learning Implementation & Operations

Real-time Inference: Create a HTTPS endpoint if you require a persistent endpoint for apps to call to get inferences
Batch Transform: Preprocess datasets, run inferences from large datasets, does not require a persistent endpoint.
SageMaker Neo: Automatically optimizes machine learning models for inference on cloud instances and edge devices to run faster with no loss in accuracy.
SageMaker Elastic Inference (EI): Speed up the throughput and decrease the latency of getting real-time inferences from your deep learning models that are deployed as SageMaker hosted models, but at a fraction of the cost of using a GPU instance for your endpoint
Track and monitor SageMaker metrics using: 1/ AWS Console, 2/ CloudWatch, 3/ SageMaker Python SDK APIs

This is only a brief summary of the core topics I found to be important and definitely not exhaustive. Please refer to https://aws.amazon.com/certification/certified-machine-learning-specialty/ for the full set of topics to prepare.

DEV Community

AWS Machine Learning Certification: Exam Notes

AWS AI Services

Amazon Textract

Documents

Expenses

Amazon Comprehend

Entities

Key Phrases

Sentiment

Language

Personally Identifiable Information (PII)

Vision

Amazon Rekognition

Label Detection

Facial Analysis

Face Comparison

Other AWS AI Services

Domain 1: Data Engineering

AWS Glue

Amazon Athena

Amazon Kinesis

Kinesis Video Stream

Kinesis Data Stream

Kinesis Data Firehose

Kinesis Data Analytics

Domain 2: Exploratory Data Analysis

Domain 3: Modelling

Domain 4: Machine Learning Implementation & Operations

Top comments (0)

Read next

Llama 3 8B is better than Llama 2 70B

Workshop: make your first AI app in a few clicks with Python+Ollama+llama3

Instalando e Configurando o Servidor de E-mail Carbonio CE na AWS (Substituto do Zimbra)

Mastering LangChain: Part 1 - Introduction to LangChain and Its Key Components