DEV Community

Jimmy Guerrero for Voxel51

Posted on • Originally published at Medium

Annotation is dead

Author: Jason Corso (Professor of Robotics and EECS at University of Michigan | Co-Founder and Chief Science Officer @ Voxel51)

Image description

Depiction of annotation is dead (Generated by DALL-E).

I remember manually painting semantic class labels pixel-by-pixel — “ok, this is a bicycle…still, a bicycle, now the road…”

It was mid-spring 2010 and I was sitting in my library-basement lab with a small group of my graduate students. We had decided that the motion in the earlier CamVid dataset was too simple for general study of general semantic video understanding, before this was a problem people cared about. So, we had selected the dataset because it was already widely used and had diverse motion. We were going to be creating the first pixel-wise labeled semantic video understanding dataset with more than just foreground-background labels (we used the 24-classes from MSRC). Exciting. But the actual work of annotating. So exhausting. Pixel-by-pixel. Hours of labor for each quite short video (see Figure 1 below for some examples).

We weren’t alone, however. The need for annotation was very strong.

Annotation has been a key need for many people in machine learning for nearly two decades. Not only is it the bread and butter for supervised machine learning, it has a compelling “social” academic element as well. Some time around the release of the Caltech-101 dataset, maybe before, it seems to have become socially accepted that benchmarks and challenge problems through a mix of well-curated datasets and appropriate evaluation metrics were important elements of academic research. This evolution ultimately led to hundreds of datasets, e.g., for examples in computer vision, and has had a significant role in the current AI boom.

I often argue that annotation companies were the first wave of startups able to capitalize on the new era of unstructured AI. Annotation is a well-defined problem. It addresses a key need in supervised machine learning; and it is so onerous that no engineer or scientist wants to do it themselves, modulo me and my research group circa 2010 apparently! And, it has clearly helped catalyze much of the recent growth we are seeing in AI —data is at least half the challenge, after all.

Yet, I think that sector is in trouble. And, it’s not a bad thing, either. I mean, ultimately, did we really need such large annotated datasets in the first place? Humans are pretty dang capable and we don’t need such large annotated datasets. Let me explain.

Disclaimer: This article is my opinion only. I do not work in annotation, sell annotation, or have any other stake in annotation. I am co-founder and chief scientist at a company that is adjacent to annotation, however: Voxel51 A full bio appears at the end of this article.

Image description

Figure 1 Here is a snapshot of the output of that laborious manual annotation effort for semantic video labels from 2010. The semantic dataset is still available

What is annotation and why was it important?

Annotation describes the work of manually creating the desired output of a yet-to-be-developed machine learning system. Basically, one decides on the raw data input, such as an image or a text-snippet. For a collection of these raw data inputs, one then manually creates the desired output, such as a classification of the image as a sports-field or an operating room, detection-based bounding boxes of all pedestrians at an intersection, or perhaps whether the text snippet has a gerund in it. Given this “labeled data,” then one is able to train the parameters of a machine learning method, which essentially is a function that can take an instance of the raw data input and automatically create the desired output. This is classically known as supervised machine learning. No, it is not the only way to do machine learning, but for a long time, it has been the dominant way.

The full process of annotation has three parts.

  1. Reducing the machine learning problem
    Usually annotation begins by machine learning experts reducing a machine learning problem statement, like road-scene understanding to support decision-making in autonomous vehicles, along with a pre-acquired dataset or a technical system description of the apparatuses to acquire such a dataset. The experts then generate a protocol that very clearly specifies what and how to label along with the format of the label (an amazing nuisance to practitioners). For example, a classification of each frame of a video into one of k classes such as highway road, urban street, rural street, etc. The protocols can sometimes be rather complicated themselves, as the complexity of the problem itself increases.

  2. Annotation
    Second, given the protocol, the actual annotation work is carried out. Often, one will first annotate a small set and check the understanding of the protocol with the machine learning experts, to avoid unnecessary labor.

  3. Quality assurance
    After the data is annotated, then the third step is some form of quality assurance. For moderate datasets this is often done over each and every sample, but for larger ones it is impractical to review each sample. A common way to do it is to have multiple annotators annotate the same data and then check for consistency. Extra cost but it’s hard to argue with the consensus.

In my experience, even though we’d like these three steps to be executed once in a chain, they are generally refined multiple times to achieve a usable dataset.

How does the annotation happen in the real world?

As you can imagine, there are many ways to annotate datasets. The most obvious way is through a service provider. Some are well known, like Scale, Labelbox, Sama, and V7; some have other core businesses but perform annotation on the side. Interestingly, some of these work through an API and can be tightly integrated into ML software systems. Others are “batch” oriented where collections of images are connected to an annotation service and returned. Some service providers instead are more aptly described as workforce managers, and generally are a bit more flexible, but risky.

If instead you have a cohort on hand — -i.e., you do not want to send your data to a third party for any reason, or perhaps you have energetic undergrads — -then you could alternatively consider local, open-source annotation such as CVAT and Label Studio. Finally, nowadays, you might instead work with Large Multimodal Models to have them annotate your data; more on this awkward angle later.

High-quality annotation is both expensive and hard to come by
I have directly consumed annotation services, worked with open source annotation tools, used LMMs to understand visual data, and talked with hundreds of company-teams that rely on high-quality annotations to get their job done. The two most striking things I find about the annotation market is that (a) high-quality annotation is both expensive and hard to come by and (b) most teams spread their annotation budget quite broadly across different providers and open-source-staff-users to reduce risk.

Image description

Figure 2 Here is an example of a frontier annotation problem: segmentation of weeds from bonafide crops. This is a sample from the CED-Net-Crops-and-Weeds dataset. Red is weed and blue is crop. Source: CED-Net on GitHub

What does the future of annotation look like?

I began this article with an admittedly bold claim: Annotation is dead.

Let’s first assume that annotation is still needed, leading to the simpler “How will annotation work evolve in the coming years?”

Where machines and algorithms fear to tread

There are two key areas where humans will continue to play a key role in annotation for AI.

First, human annotation is critical on the frontier: the part of our collective capabilities that we do not yet have a strong handle on methodologically. The frontier varies quite broadly across domains. In one domain with limited publicly-studied data, this may mean simple annotations like detection-boxes. In more complex domains, this moves more into fine-grained, detail-oriented annotation tasks. I include such an example of a more complex domain in Figure 2 above.

It is not yet clear whether the market size of this “frontier-work” is still expanding, has reached its max, or is already contracting. It is plausible to think the frontier may at some point be exhausted.

Gold standard evaluation datasets

Second, human annotation is needed for gold standard evaluation datasets. Whether these are for performance evaluation in-house or future government compliance vetting, there will forever be a need to define gold standard datasets. Humans will do this work. Forever. Furthermore, I expect a greater future need and perhaps even legislation around compliance for AI that involves gold standard industry and problem-specific benchmark datasets. (Now that is going to be an industry for the future…)

The uniquely human factor of data

In addition to these areas, there is an aspect of human labor that I think will continue, but it’s not really annotation — -it’s more of the actual science of machine learning when one values the role a dataset plays in the overall capability of the resulting system (more on that in a future article).

I am talking about data quality. Data quality can have many different faces. For example, this could be specific errors in annotation. It could be the overall sampling of a set of possible images in a collection, e.g., for autonomous driving, only capturing scenes from North America but then trying to deploy in Australia. For some of these quality issues, it seems that they may be able to be sufficiently ignored, more data seems to be better than perfect data in this case. But, for distributional quality, it takes human labor.

Any system out there that tells you they can automatically assess and fix some notion of such distributional quality is simply lying to you. In order to do that they’d have to have some model of all data. That model doesn’t exist. Even for 16x16 binary images, for example, that space has about as many possible permutations (images) as there are atoms in the known universe, let alone a distribution over them which would need to be enumerated (most are noisy images that are irrelevant while very few actually do capture possible 16x16 images we are likely to find in the world). To actually improve the quality of a dataset, it takes work: to find corner cases, to find gaps in the collected data, to ensure the protocol is sufficient, etc. This is the work humans do so well. But, it takes time and tools.

The Future of Annotation: No Humans

Now, however, let’s relax the assumption that human annotation is still needed. Where does that leave us?

Take a large-x-model and learn how to distill it down to a minimum size model that maintains a certain set of capabilities and forgets the rest of its knowledge. This implications of this are just awesome


The first thing that comes to mind is auto-annotation. In other words, drop your data off at Name-Your-Favorite-Large-Multimodal-Model and pick it up with the labels. Auto-annotation seems pretty useless to me. I mean, if an algorithm out there can autolabel your data for you, then you really want the algorithm, no? Ok, fine there may be special cases, where you need to set up a completely different system in a new operating environment, such as low-power edge-applications. But, these situations are fewer than the general need. What excites me about this, however, is the creation of a whole new field. Take a large-x-model and learn how to distill it down to a minimum size model that maintains a certain set of capabilities and forgets the rest of its knowledge. The implications of this are just awesome.

There is another angle to this auto-annotation gambit, which I think is reasonable. This would be the case where you are planning to leverage an existing model to decide what elements of your data to actually process. There are quite a few ways I could imagine doing this. One that comes to mind is a classifier that will help you select only relevant data samples from a set for later annotation. For example, if you are a robotics company that will deploy robots into hospitals, such as Moxi in Figure 3, then you probably want to remove potential data that are not from hospitals. A second that comes to mind is using a model that can help with searching through a data lake for similar data to one or more queries. This is more like filtering than auto-annotation, but that’s just semantics.

Image description

Figure 3 An example of a state of the art hospital robot, named Moxi. Source: Diligent Robotics.

Weak and self-supervision

What about weak and self-supervision? Absolutely. All in. The head of machine learning at a not-to-be-named automotive company told me last year (paraphrasing): “we decided not to spend any more money on annotation, we are pushing the boundary on what we can do without manual annotation.” This statement sums up what I think about weak- and self-supervision; i.e., they are the future. It gets even better in multimodal settings when one modality can be used as the automatic guidance signal of another modality. This, in my view, aligns much more closely with the way humans learn. Exploration, prediction, verification. At the very least, it is an existence proof.


So, is annotation dead? Ok, I admit to being a bit dramatic. Perhaps it’s not dead yet, but some parts are dying and clearly others are evolving. I think that evolution is going to be quicker than expected.

One conclusion to be drawn from this analysis is that human annotation is going to be needed for a long time, at least to create gold-standard compliance datasets. This may be, but the market size for such work is likely significantly less than current needs. This is especially so given the significant advances in weak- and self-supervised learning where only raw data is needed.

If I was in annotation, a market that was currently reported to be $800M in 2022, I would not be worried too much. In the short term, slow followers, frontier needs, QA and compliance are likely to support the supply, but I would be sure to capitalize on the evolution of the demand to bring the annotation capabilities of tomorrow.

So, perhaps the real question is what is the half-life of the annotation market? What do you think? Add your comments here or reach out directly.

The article grew from an original LinkedIn post that suggested the need for a longer-form article.


I want to thank all of the annotators who have traded their time for pushing machine learning forward, especially those who did it without monetary compensation. The field would not be where it is today without you. I also want to thank Jerome Pasquero, Evan Shlom and Stuart Wheaton who asked deep questions about the original LinkedIn post that led to this article.


Corso is Professor of Robotics, Electrical Engineering and Computer Science at the University of Michigan and Co-Founder / Chief Science Officer of the AI startup Voxel51. He received his PhD and MSE degrees at The Johns Hopkins University in 2005 and 2002, respectively, and the BS Degree with honors from Loyola College In Maryland in 2000, all in Computer Science. He is the recipient of a U Michigan EECS Outstanding Achievement Award 2018, Google Faculty Research Award 2015, the Army Research Office Young Investigator Award 2010, National Science Foundation CAREER award 2009, SUNY Buffalo Young Investigator Award 2011, a member of the 2009 DARPA Computer Science Study Group, and a recipient of the Link Foundation Fellowship in Advanced Simulation and Training 2003. Corso has authored more than 150 peer-reviewed papers and hundreds of thousands of lines of open-source code on topics of his interest including computer vision, robotics, data science, and general computing. He is a member of the AAAI, ACM, MAA and a senior member of the IEEE.

Copyright 2024 by Jason J. Corso. All Rights Reserved.
No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law. For permission requests, write to the publisher via direct message on X/Twitter at JasonCorso.

Top comments (2)

salika_dave profile image
Salika Dave

There is a lot to learn for all of us from this article! Thank you for sharing your insights here on DEV 😄
I find it interesting that we can use LLMs to annotate data. But do you think we could use LLMs when we are trying to annotate erroneous data from LLMs itself?

jasoncorso profile image
Jason Corso

Thanks! This is a great point. I'd probably want to write a short follow-on to the article on LLMs. There is a lot of potential here, but it's hard to understand the risks of bias and hallucinations in LLMs. I think there are related potential angles for LLMs in direct evaluation.