Jimmy Guerrero for Voxel51

Posted on Apr 26 • Originally published at Medium

From the Kitchen to the Lab: Why Cooking Became AI’s Favorite Dish

#computervision #ai #datascience #machinelearning

Author: Jason Corso (Professor of Robotics and EECS at University of Michigan | Co-Founder and Chief Science Officer @ Voxel51)

An in-depth look at why the cooking domain has been a key ingredient in computer vision and machine learning for the last decade

Figure 1: With hundreds of hours of video and thousands of recipes, cooking-based datasets have secured a serious foothold in computer vision and machine learning research. Why? Source: DALL-E.

It wouldn’t be crazy if you mistakenly believed that Julia Child was actually a computer vision researcher. Seriously. In the last decade, there have been more than a dozen research datasets with hundreds of video hours, many thousands of frames, and a million diverse recipes focusing on computer vision and machine learning. Sports and general human motion had the community covered the decade before with UCFSports and HMBD51. But cooking takes the cake for the last decade, literally.

Why has cooking become such a popular topic for computer vision and even broader research areas in AI, like Robotics? And if this is a real trend, is it a good thing? As a participant in this cooking-for-computer-vision community, I make my best effort to objectively answer these questions in the rest of this essay.

First, I review some history, making an effort to catalog the lineage of innovations that various key cooking-oriented datasets have made over the last decade.

I then discuss why I think cooking has garnered so much interest in the research community. This analysis is grouped into four categories:

Visual complexity — the cooking domain is steeped with visual richness, with fine-grained articulation, rich visual vocabulary including transparent and granular objects, significant and dynamic occlusions, and much more.
Procedural, goal-oriented — the cooking domain has heaps of structure, including but not limited to the stepwise nature of actual recipes, the potential for various kinds of mistakes, and the commonality of instructional content.
Ambiguities in semantics — cooking itself requires a nuanced understanding of a set specialized language and its associated practice; it is mired in ambiguity that spans multiple modalities making it quite useful for the study of interaction.
Opportunities — far from fully tapped, the cooking domain presents numerous opportunities for future investigation that push its frontier and impact even further.

Summarily, it is my view that cooking is uniquely well suited for a driving application domain in AI. It is visually complex, temporal, procedural, goal-oriented, and instructional. Yet, it is also universal to humanity and significantly easier to work with than other domains that may also have these properties, such as surgical procedures.

The History of Cooking in Computer Vision and AI

The kitchen is a communal place, often replacing the actual living room, as we call it in the US, for family gatherings. At least, this is the way it is in my family. So, it was no surprise more than a decade ago that when I saw Ross Messing’s tracklets talk with people doing everyday things in the kitchen, my interest was piqued (see Figure 2 below for examples from this work). This early food-oriented work came from maybe five years of activity recognition that primarily focused on full-body motions.

Figure 2: Examples from the Messing Activities of Daily Living Dataset (ICCV 2009).

Sidebar: Even though we call the problem activity recognition, we are really talking about video clip classification. Mostly, the motion itself is not frequently too relevant to how the community studied activity recognition in that era. One would need to control the background, content and overall context quite carefully to ensure motion mattered. This was attempted by one of the earliest activity recognition datasets, the KTH dataset (see example in Figure 3), but was quickly replaced by “richer” sporting and everyday activities, e.g., UCFSports and HMBD51.

Figure 3: Examples from the early KTH human actions dataset (ICPR 2004). Source: KTH dataset.

However, I recently learned that this was far from the earliest use of the kitchen as a setting for computer vision research (thank you, Dima Damen). In 1990, there was a paper that explored how to estimate the expansion of food being cooked in a microwave oven. Not a dataset paper, and probably not even the first time a kitchen setting was used in a computer vision study but nonetheless very interesting as a flag in the ground.

Fast-forward to today, and one can easily find a decade of cooking-oriented datasets in computer vision that collectively contain hundreds of hours of visual content and over a million unique recipes. The first of these modern datasets was published at CVPR 2012, and the most recent in TPAMI 2022, with a handful of them still in active use. The interest in food-related research is not unique to cooking itself. For example, the fifth most downloaded dataset on HuggingFace is Food101. Let me catalog many of these below noting any unique attributes they each brought to the community. I’ve ordered them chronologically from the first iteration of any dataset with multiple iterations, and I only include video-based datasets, which are the most common.

MPII Cooking 2 is a 2015 dataset containing about 27 hours of video with annotations for 88 verbs from an exocentric, or third-person viewpoint. The subjects were recorded in a lab setting and executed one of 14 recipe types. This dataset replaces the original 2012 MPII Cooking dataset.
50 Salads is a 2013 dataset that contains over 4 hours of cooking-oriented action units, combinations of an action being performed with or on a certain object such as slicing-tomato. There are 51 unique action unit types in the dataset. Importantly, this is an egocentric, or first-person viewpoint dataset. In Figure 4 below, I include a sample from 50 Salads to demonstrate the novelty of the RGBD and accelerometer data that is unique to this dataset among the list here. However, this type of multimodal sensing existed for the cooking domain before 50 Salads, e.g., CMU MMAC.

Figure 4: An example from the 50 Salads dataset showing the accelerometer data underneath the video and the depth frame. Source: YouTube example.

YouCookII is a 2018 dataset (from my research group) that contains nearly two dozen instances of each of 89 recipes from cuisines around the globe, totalling more than 176 hours of content. The long-running videos — some are more than ten minutes — contain diverse, unscripted content (the dataset is browsable). 67 of the most frequently occurring objects are also annotated. A unique aspect of the YouCookII dataset is that the videos were farmed from YouTube by querying the recipe name; then, each video was watched by an annotator and natural language descriptions were used to describe each recipe step. These types of natural language descriptions were used in an early video captioning paper at CVPR 2013, which was also the first iteration of this YouCook series.
Breakfast is a 2014 dataset spanning 10 recipes and 48 action units across 77 hours of video, also from an exocentric, or third-person viewpoint. This dataset was created to facilitate the study of an action grammar underlying goal-oriented activities, given the procedural structure to the general activity of cooking (more on this later).
GTEA Gaze+ is a 2015 dataset that emphasizes modeling of egocentric-specific features for action understanding, enabling the study of the more nuanced elements that separate egocentric from the more common exocentric video. It has three variations with the largest containing about nine hours of video spanning 44 actions.
Epic-Kitchens was originally released in 2018 with 55 hours of custom-acquired egocentric cooking content and was later superseded by Epic-Kitchens 100, increasing the total footage to 100 hours of high-speed video. The largest video corpus of cooking to date, Epic-Kitchens contains more than 100 verb and 300 noun classes independently annotated (i.e., not in action units); see Figure 5 below for a visual example. Interestingly, despite strictly containing cooking content, there are no recipe annotations in the corpus, potentially limiting its utility for goal-oriented and instructional video understanding. Despite this observation, the rich annotation scheme used for both versions of the dataset has enabled it to be a gold standard corpus for myriad video understanding problems, such as semi-supervised video object segmentation, action detection, multi-instance retrieval, and more (for the full list, visit the challenges webpage).

Figure 5: A snapshot from the original Epic-Kitchens dataset, which remains one of computer vision’s most widely used datasets. Source: ECCV 2018 paper.

TastyVideos V2 is a 2022 dataset building on an earlier 2019 version, and hence the most recent one on this list. This work took the community in a different direction than the earlier ones. Estimated to have about 60 hours of content, this dataset focuses on the new problem of zero-shot prediction — given a recipe, which you have not before seen, carried out to a certain step, can you predict what will happen next? TastyVideos V2 has 4022 unique recipe videos with temporal boundary annotations and ingredient lists captured from a modified third-person viewpoint (above the workspace).

The datasets I described above are all video-based and support various video-level problems in instructional, goal-oriented vision, while also being in the cooking domain. I may have missed some, including the more recent Ego4D https://ego4d-data.org/, which has some cooking-relevant content, although it is broader. I also did not include other very relevant-to-cooking datasets that are image-based, such as the Recipe1M, which includes more than 1M detailed recipes with 800K images.

As far as I know, this list includes the key datasets still in use today. Epic-Kitchens is the most widely used egocentric cooking dataset. YouCookII is the standard for exocentric cooking. Recipe1M is the largest-scale image-and-language dataset related to cooking. I await what new contributions will be made in this exciting domain!

Why Is Cooking Content So Useful to Computer Vision and AI?

It’s really surprising that the research community at least considered cooking as a plausible domain. As I opened the essay, we all eat, most of us cook, and we tend to associate cooking with community — all positive things. In fact Wrangham argues that cooking food is an essential element to the physiological evolution of our species in “Catching Fire: How Cooking Makes Us Human.”

But, to see so much emphasis — I’d argue it is the most impactful driving-domain of the last decade; these dataset papers alone have more than 4500 citations collectively — is a surprise. It is a surprise even to me, someone who contributed to its rise. As a scientist, I thought it was time to take a high-level look at what makes cooking such a useful domain. In the sections below, I provide plausible explanations.

Visually Complex

Cooking involves manipulating ingredients with one’s hands or tools in a sequence of steps. This manipulation is fine-grained and often highly complex, and hence, it presents quite a challenging visual signal. Given the tight workspace in typical cooking videos, the rich, fine-grained articulation leads to significant self-occlusion, especially in an egocentric setting. In exocentric settings, the fine-grained objects and actions are often minuscule in pixel resolution, creating a need to leverage context to solve visual recognition tasks.

But, perhaps even more importantly, cooking is messy. Cooking involves working with various liquids and granular substances. Detecting the difference between salt and sugar is sometimes impossible from vision alone, for example. Detecting the level to which a transparent measuring cup has been filled with, say, water, is hard. Even verifying whether or not oil has yet been added to a pan is not necessarily easy (see Figure 6 below).

Figure 6: The cooking domain presents some nuanced visual recognition challenges that have placed it on the edge of capability for the last decade. Which of these pans have oil in them? It’s harder than you think. Source https://images.google.com

In addition to the inherent visual complexity in the cooking domain, the widespread potential use of egocentric video makes it attractive, as the community has been on an upward trajectory of interest in first-person viewpoint application areas like teleoperation, augmented reality, and robotics. As you probably noticed, half of the datasets listed above (three of six, as the TastyVideos datasets use a modified ego-exo viewpoint) use an egocentric viewpoint. Egocentric video understanding is significantly different from the more common exocentric content. Even more than the obvious challenges of shakiness, jitter, and saccade-like movements of the head (where the camera is typically, but not always, mounted), the challenge of having the visual recognition model maintain a sufficient persistent representation of the environment is critical.

Procedural, Goal-Oriented

Cooking, as an activity, inherently entails various actions in one of a small set of certain sequences. One does not cook in an instant, even with a microwave (as we saw with that historical paper from 1990), rendering the raw use of images only suspect. In other words, cooking takes time. This temporal nature of the process is less common in many of the other recognition challenges the vision community has studied, such as direct object detection, scene classification, and 3D reconstruction. The temporal nature of the problem is inherent in the choice of cooking and methods naturally focus on it.

Cooking has become such a popular application domain for computer vision and machine learning because it is visually complex, uniquely temporal, procedural, goal-oriented, and instructional while also being universal to humanity and significantly easier to work with than other domains that may also have these properties, such as surgical procedures.

Yet, there is more. Not only is it temporal, but it is also procedural. The chef wants to prepare the dish. To prepare the dish, she must understand the ingredients and the steps in the recipe. Then, she must carry out the sequential actions necessary to achieve those ends. Easier said than done, of course. The procedural nature of the scenario requires the modeling and inference of preconditions, postconditions, and perceptual evidence, all good challenges from an AI perspective, and, yet, not well captured in many other areas of computer vision like recognition of what sporting content is in a video. Computer vision challenges like detecting when one recipe step is over and the next has begun or whether there are enough tomatoes in the bowl to stop that step and move on to the next one are concrete instances of broader challenges that test the state of the art in visual perception.

Importantly, the procedure we are talking about has an inherent goal: to prepare an edible, if not good, dish. This goal-oriented nature of cooking necessarily adds structure and guidance to any computational modeling of the cooking procedure that may be performed. This is similar to some other domains, such as building furniture or repairing an engine. But, it is dissimilar to many other domains that involve more continuous planning based on instantaneous observations, such as the desire to avoid crashing into a pedestrian in autonomous driving.

What’s more, some cooking videos are also instructional because the person in the video intends to teach the viewer how to achieve the task. Only recently has the community started to look at such instructional content as a domain of its own, such as this workshop at CVPR 2018 I co-organized following up on reasons like these I’m mentioning here when working with YouCookII. Instructional video has a unique situational dialogue between the viewer and the “teacher” in the video. The dialogue is nuanced; the way the teacher interacts with the materials takes approaches specific to the instructional nature. The vastness of instructional videos in the cooking domain makes it special.

Furthermore, not only is it goal-oriented in the sense that the chef wants to prepare the dish, but more so, the chef wants to prepare a good dish. Hence, the goal-oriented nature of the cooking domain makes it a suitable testbed for mistake detection. In other words, has the chef used salt instead of sugar in the cupcakes? Or has the chef diced the tomatoes instead of slicing them? Details matter. Both in space and in time! In my view, the ill-defined anomaly detection problem takes on a more concrete face when one thinks about it like mistake detection in cooking.

Let me close this section with a theory. Cooking has become such a popular application domain for computer vision and machine learning because it is visually complex, uniquely temporal, procedural, goal-oriented, and instructional while also being universal to humanity and significantly easier to work with than other domains that may also have these properties, such as surgical procedures.

Ambiguities in Semantics

Like language, cooking is often ambiguous. Picture the confusion when a recipe casually instructs one to “season to taste” or to select “a medium onion,” leaving the uninitiated to navigate the vast sea of subjectivity. Or even the generality of certain dish names. Does your household put banana slices onto PB&Js? Ambrosia is another one like this. Maybe it has oranges, coconut, marshmallows, etc.

These are natural ambiguities of a nuanced domain. Which, you guessed it, is an awesome reason for its research popularity. The multimodal nature of procedural video grounded in goal-oriented natural language text makes cooking a perfect application for exploring multimodal models, whether classical through grammars like we saw in the 50 Salads example or through modern large language models.

Opportunities

These three areas group the most common reasons for the rise of cooking as a key application domain in computer vision, machine learning and AI, at least the reasons I am aware of. Below, I note some more speculative possibilities I have not seen in the literature. Some are more concrete than others, including one that even begs for a certain new dataset, and I tried to focus on what’s technologically possible now rather than, for example, enabling an AI to taste or smell what’s cooking!

Compositionality — Cooking itself is a compositional activity. One takes ingredients, combines them, combines them more, operates on them, and combines them more to eventually yield the dish. Modeling such compositionality in machine learning is often seen as critical to attaining the flexibility and generalization ability for next-level capabilities. Whether or not that compositionality should be explicit is a key question, but, in any case, cooking provides a ripe domain for modeling and understanding compositionality.
RLHF — Each chef adds their unique flair, even for widely known recipes. Hence, the infinitude of possible recipes presents quite an interesting challenge for generation and adaptation using reinforcement learning from human feedback (RLHF). An interesting and real challenge here is how to get the typical feedback used in RLHF without actually being able to produce the dishes. Or, I suppose there is a future in which a robotic chef could produce the variations and operate within a feedback loop.
Cultural Modeling — Enhancing the ability for future AI systems to better sense human cultural diversity is critical to building trust across the globe. Yet cultural diversity is so broad. Perhaps the diversity in cooking — actual recipes, common dishes, typical ways in which certain dishes are made — may present a sufficiently rich landscape to support the study of diversity from a computational perspective.
Skill Assessment — As Gordon Ramsey says, “However amazing a dish looks, it is always the taste that lingers in your memory.” Cooking is hard. Rather, cooking is easy, but cooking well is hard. Objective, computer-or-AI-based skill assessment has been emphasized in certain domains, like laparoscopic surgery, but is still generally a nascent field. In many domains, like the surgical ones, getting a reasonable amount of data is difficult, if not impossible. Perhaps cooking could provide a proxy domain for a rigorous treatment of skill assessment. Who is going to make the dataset that includes annotations of dish quality?! That would be amazing.

Closing

The average contemporary human consumes about 90,000 meals in a lifetime. Food is central to our existence. Recall the pivotal scene from the Matrix when Neo first meets the crew of the Matrix and is presented with the rather unappetizing bowl of slop that is “a single-celled protein combined with synthetic aminos, vitamins and minerals; everything the body needs.” In that scene, the food acts as Neo’s vehicle, bringing him into the stark reality of the sacrifices the crew makes to stay in the real world rather than enjoy the illusions within the matrix.

So, as I said earlier, I am not surprised that the computer vision community explored cooking as an application domain. Whether or not we fully appreciated the richness of the domain when we began this exploration, it is a certain win that cooking has become central to a wide swath of research problems. From the multimodal richness and the goal-oriented nature of cooking to the numerous not-yet explored challenges like compositionality and cultural diversity modeling, cooking is likely to be a mainstay in computer vision, machine learning and AI for years to come.

As I said earlier in the essay, cooking has become such a popular application domain for computer vision and machine learning because it is visually complex, uniquely temporal, procedural, goal-oriented, and instructional while also being universal to humanity and significantly easier to work with than other domains that may also have these properties, such as surgical procedures.

Interestingly, most of the reasons for cooking being a useful research domain are not necessarily unique to cooking. Other domains, such as surgical video, engine repair and robotics, have many similar properties. Yet, cooking is so immediate: no fancy equipment, minimal safety concerns, everyday items, etc. — making it an excellent proxy for many other domains. There is evidence that in some of these domains, such as robotic-cooking and robotics in everyday activities are indeed quite interesting and already being investigated.

Acknowledgements

Thank you to all of the chefs who post their videos on YouTube, to all of the subjects who were recorded during the creation of these datasets, and to all of the annotators who enriched the pixels with attributes making them more useful for research. Thank you to my colleagues Harpreet Sahota, Jacob Marks, Filipos Bellos, Dan Gural, and Michelle Brinich for reading early versions of this essay and providing insightful feedback.

Biography

Jason Corso is Professor of Robotics, Electrical Engineering and Computer Science at the University of Michigan and Co-Founder / Chief Science Officer of the AI startup Voxel51. He received his PhD and MSE degrees at Johns Hopkins University in 2005 and 2002, respectively, and a BS Degree with honors from Loyola University Maryland in 2000, all in Computer Science. He is the recipient of the University of Michigan EECS Outstanding Achievement Award 2018, Google Faculty Research Award 2015, Army Research Office Young Investigator Award 2010, National Science Foundation CAREER award 2009, SUNY Buffalo Young Investigator Award 2011, a member of the 2009 DARPA Computer Science Study Group, and a recipient of the Link Foundation Fellowship in Advanced Simulation and Training 2003. Corso has authored more than 150 peer-reviewed papers and hundreds of thousands of lines of open-source code on topics of his interest including computer vision, robotics, data science, and general computing. He is a member of the AAAI, ACM, MAA and a senior member of the IEEE.

Disclaimer

This article is provided for informational purposes only. It is not to be taken as legal or other advice in any way. The views expressed are those of the author only and not his employer or any other institution. The author does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by the content, errors, or omissions, whether such errors or omissions result from accident, negligence, or any other cause.

Copyright 2024 by Jason J. Corso. All Rights Reserved.
No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law. For permission requests, write to the publisher via direct message on X/Twitter at JasonCorso.

DEV Community

From the Kitchen to the Lab: Why Cooking Became AI’s Favorite Dish

The History of Cooking in Computer Vision and AI

Why Is Cooking Content So Useful to Computer Vision and AI?

Visually Complex

Procedural, Goal-Oriented

Ambiguities in Semantics

Opportunities

Closing

Acknowledgements

Biography

Disclaimer

Top comments (0)

Read next

Exploring Bark, the Open Source Text-to-Speech Model

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models