DEV Community

Introduction to SageMaker Ground Truth

The truth is out there, well, most of the time. In cases that it's not, use Amazon SageMaker Ground Truth. Let's understand why.

Labeled data is an essential ingredient for particular forms of machine learning, specifically supervised learning algorithms. During the training phase, the supervised learning algorithm will measure the accuracy of the model by generating predictions, and comparing them to a known label associated with the data. A typical example of this is image classification. When training an image classification model, labeled images are used, whereby each image contains one or many labels, indicating what is contained within the image, For example, a person, car, dog, cat, et cetera.

Image description

The MNIST, CIFAR-10, and ImageNet are all examples of public domain datasets that have already been labeled, and are often used for training. During the training phase, checks can be performed to see if the predictive classification performed on an image matches the associated label. Iterations or epochs of training continue until such time that the predictions reach a desired level of accuracy.

To date, the process of labeling has been time consuming, with limited tooling to aid the job. To help expedite and improve the experience, Amazon SageMaker Ground Truth has been added to the SageMaker portfolio. Amazon's SageMaker Ground Truth is a labeling service which provides both automatic and human workforce labeling features. With GroundTruth, you simply

  1. upload your unlabeled data sets into an S3 bucket.
  2. create your manifest file with pointers to each of the images. 3. place the manifest file within the same S3 bucket.

Using the Ground Truth console, create a Labeling Workforce. A Labeling Workforce represents the human workforce, who performs the labeling itself. There are currently three options:

  • Public: A team of global on demand workers, powered by Amazon Mechanical Turk.
  • Private: A team of workers from your organization.
  • Vendor: A selection of experienced vendors that specialize in providing data labeling services.

Finally, we are ready to create a labeling job. A labeling job represents the actual labeling exercise that you need to be performed. The key configuration requirements that need to be specified are:

  • Job Name
  • Input Dataset Location
  • S3 bucket location of the manifest file
  • Output Dataset Location
  • an S3 bucket location to receive the labeling data
  • the Dataset Object Selection and this allows you to either label the entire dataset, a random sample, or filtered selection of the data
  • the Task Type you select a Task Type from a list of Task Types, including Image Classification, Bounding Box, Text Classification, Semantic Segmentation, or use your own Custom Task Type
  • Workers you can select the human workforce required to perform the job;
  • the Bounding Box Labeling Tool where you configure the UI labeling tool that will be used by the workers, and this includes providing helper text in the form of instructions and guidance, et cetera

Okay, so now the labeling job has been created, the chosen workforce will be invited to begin he process of labeling. Notifications are provided in the form of an email containing the URL to the Ground Truth labeling tool. If automatic labeling has been enabled for your job, Ground Truth will analyze and perform the labeling. Otherwise, the configured human workforce will use the Ground Truth tooling, and perform the labeling activity.

When the labeling job has been completed, the job owner or requester can visualize each image with its assigned label within the Ground Truth console. Finally, each image label is serialized back into the original manifest file, against the corresponding image.



GitHub
LinkedIn
Facebook
Medium

Top comments (0)