Raise your hand if you have never come across the “lack of data” problem while working on ML projects.
The unavailability or scarcity of training data is indeed one of the most serious challenges in ML and specifically in NLP. A problem that gets harder when the data you need has to be labeled. When no other shortcut works for you, the only alternative is to tag your data... At this point, we imagine the enthusiasm on your face!
But don’t put you off! Read the post and discover how we impressively reduced the time and cost of the tagging process.
We worked within the food context, but the approach can be easily extended to many different cases.
The entities we want to tag are:
INGREDIENT: apples, cheese, yogurt, hot peppers…
QUANTIFIER: one, 2, ¾, a couple of….
UNIT of measurements: oz, g, lb, liter, cups, tbsp...
We used a variant of the IOB schema to tag the entities, where B-, I- tags indicate the beginning and intermediate positions of entities. O is the default tag.
We speeded up the ingredient tagging process with TagINGR, a semi-automatic tool which works:
1. matching items in the recipes with those in a list of ingredients;
2. adding the tag INGREDIENT when an item is both in the list and in the recipe.
In part 1, the recipe_tagger function tokenizes words and declares some variables:
def recipe_tagger(lang, desc_ingr_list, recipe): # Part 1 tokenized_ingr_list = [tokenize(lang, el) for el in desc_ingr_list] for ingr_token_list in tokenized_ingr_list: ingr, tag_ingr = "", "" for ingr_token in ingr_token_list: for n, token in enumerate(ingr_token): if n == 0: ingr, ingr_tag = str(token).lower()+r'\t[A-Z]+\tO\n', str(token).lower()+" B-INGREDIENT\n" else: ingr, ingr_tag = ingr+str(token).lower()+r'\t[A-Z]+\tO\n', ingr_tag+str(token).lower()+" I-INGREDIENT\n"
In part 2, it tags the ingredients:
if len(ingr_tokens) == 1: #Part 2 if re.search(ingr, recipe) and re.search(r'\t[NN][A-Z]*\tO', re.search(ingr, recipe).group()): text = re.sub(r'\n'+str(ingr), "\n"+str(ingr_tag), str(text)) else: if re.search(ingr, recipe): text = re.sub(r'\n'+str(ingr), "\n"+str(ingr_tag), str(text)) recipe = re.sub(r'\n(.*)\t.*\t(.*)',"\n\\1 \\2", recipe) return recipe
Once ingredients were tagged, we can easily tag quantities and units. We first individuated some entity patterns and then tagged them using a set of regex:
All very well, but… how did we build the list? what assures us it is complete? what does NN mean in the code? these and other questions will be answered in the medium. Go read it!
When Food meets AI: the Smart Recipe Project
a series of 6 amazing articles
Table of content
Part 1: Cleaning and manipulating food data
Part 1: A smart method for tagging your datasets
Part 2: NER for all tastes: extracting information from cooking recipes
Part 2: Neither fish nor fowl? Classify it with the Smart Ingredient Classifier
Part 3: FoodGraph: a graph database to connect recipes and food data
Part 3. FoodGraph: Loading data and Querying the graph with SPARQL