DEV Community

Lorraine Lee
Lorraine Lee

Posted on • Updated on

Playing with a public dataset, and playing with food

One rather large and rather useful dataset is called FoodData Central, provided by the United States Department of Agriculture.
I downloaded the December 2019 release of the full download of all data types in CSV format, which comes to 67MB. I imported this data into Sqlite with this quick and dirty script

#!/bin/bash

for file_name in $(ls *.csv); do
  table_name="${file_name/.csv/}"
  $(sqlite3 $1 '.mode csv' ".import $file_name $table_name" '.exit')
  echo "supposedly added $table_name to database in $1"
done
Enter fullscreen mode Exit fullscreen mode

This of course resulted in a database in which all columns are of TEXT type. I changed the primary keys to INTEGER PRIMARY KEY AUTOINCREMENT and foreign keys to INTEGER using a nifty GUI tool called DB Browser for SQLite. (No, you can't change column types retroactively from the Sqlite command prompt.) I also changed the type of the amount column in the food_nutrient table (a many-many relation between foods and nutrients) to REAL. I suppose I could have used post-GRE-SQL (i.e. graduate-level SQL). I once set up a post-GRE-SQL database, but it was all cargo cult CLI use on my part, and a little scary.

Anyhoo, FoodData Central (like its predecessor, the Food Composition Database) lists 328543 foods and lab assays for amounts of 233 nutrients. Not all foods were tested for all nutrients, but it seems a pretty dense grid from what I've looked at of it. This analysis includes breakdown of protein content by amino acid, so I decided to see how this dataset might help with protein complementarity math. I wanted to do some calculations based on recommended daily intake of the essential amino acids, and for that I need to treat combined amount of phenylalanine and tyrosine as if it were a single nutrient. So I made a backup copy of the Sqlite data file I had prepared, and undertook to add some features to my working copy, starting with the creation of a table of intake recommendations, a row in the nutrient table to represent Phenylalanine+Tyrosine, entries in food_nutrient to represent that sum (I know, Boyce Codd, but bear with me).

create table essential_aa (
  id integer primary key autoincrement,
  nutrient_id integer,
  who_recommendation real,
  usda_recommendation real
);

insert into nutrient(id, name, unit_name, rank) values
  (1230, 'Phenylalanine+Tyrosine', 'G', 17050);

insert into food_nutrient(fdc_id, nutrient_id, amount)
  select a.fdc_id, 1230, a.amount+b.amount
  from food_nutrient a, food_nutrient b
  where a.fdc_id = b.fdc_id
  and a.nutrient_id = 1217
  and b.nutrient_id = 1218;

-- source https://en.wikipedia.org/wiki/Essential_amino_acid#Recommended_daily_intake
insert into essential_aa(nutrient_id, who_recommendation, usda_recommendation) values
  (1221, 10, 14),
  (1212, 20, 19),
  (1213, 39, 42),
  (1214, 30, 38),
  (1215, 10.4, 13.173),
  (1216, 4.1, 5.827),
  (1211, 15, 20),
  (1210, 4, 5),
  (1219, 26, 24),
  (1230, 25, 33);
Enter fullscreen mode Exit fullscreen mode

Now to have some fun with queries. A chain is as strong as its weakest link, and a dietary protein (a polypeptide chain) is as strong as the amount of whatever essential amino acid is supplied in the least sufficient amount:

-- limiting amino acid
select food.description food,
    min(food_nutrient.amount/essential_aa.who_recommendation) pct,
    nutrient.name limiting_aa
  from food, food_nutrient, essential_aa, nutrient
  where food.fdc_id=food_nutrient.fdc_id
  and food_nutrient.nutrient_id = nutrient.id
  and essential_aa.nutrient_id = nutrient.id
  and food_nutrient.nutrient_id=essential_aa.nutrient_id
  group by food
  order by pct desc
  limit 10;
Enter fullscreen mode Exit fullscreen mode
Egg, white, dried, stabilized, glucose reduced|0.188666666666667|Leucine
Egg, white, dried|0.175333333333333|Leucine
Egg, white, dried, powder, stabilized, glucose reduced|0.1692|Lysine
Egg, white, dried, flakes, stabilized, glucose reduced|0.157933333333333|Lysine
SPORT CERTIFIED GRASS FED WHEY PREMIUM PROTEIN ISOLATE, CHOCOLATE|0.145673076923077|Methionine
SUGAR FREE 100% WHEY PROTEIN POWDER|0.1417|Histidine
NUTRIBIOTIC, RAW ORGANIC RICE PROTEIN PLAIN|0.136433333333333|Lysine
ORGANIC RAW PROTEIN POWDER|0.136433333333333|Lysine
100% WHEY PROTEIN POWDER|0.1344|Histidine
WHEY PROTEIN POWDER|0.1333|Histidine
Enter fullscreen mode Exit fullscreen mode

No real surprises here; I came looking for protein and found egg whites and protein supplements.

Then I decided to look into net protein utilization, which I'm assuming to be the ratio of the amount of the limiting amino acid to the total amount of protein.

1003 is the nutrient id representing total protein. In theory, this should be the sum of the amounts of the various amino acids, and therefore a function of other data in the database. So, in my own defense, when I created Phenylalanine+Tyrosine above, it's not as if I was despoiling a perfectly normalized database.

-- most balanced protein
select food.description food, min(a.amount/essential_aa.who_recommendation/b.amount) pct
  from food, food_nutrient a, food_nutrient b, essential_aa
  where food.fdc_id=a.fdc_id
  and food.fdc_id=b.fdc_id
  and a.nutrient_id=essential_aa.nutrient_id
  and b.nutrient_id=1003
  group by food
  order by pct desc
  limit 10;
Enter fullscreen mode Exit fullscreen mode
Bratwurst, pork, cooked|0.00410219032668012
Butter, whipped, with salt|0.0039820806371329
Sausage, smoked link sausage, pork|0.0034287915757031
Sweet potato leaves, raw|0.00305220883534137
Dutch brand loaf, chicken, pork and beef|0.00294230769230769
Cream, sour, cultured|0.00289884046381447
Fish, whitefish, eggs (Alaska Native)|0.00286843670199741
Toppings, butterscotch or caramel|0.00282201169119129
Babyfood, cereal, oatmeal, with bananas, dry|0.00278860569715142
Sausage, pork, turkey, and beef, reduced sodium|0.0027077497665733
Enter fullscreen mode Exit fullscreen mode

Like it says in Wikipedia, "A complete protein contains all the essential amino acids in the right balance; meat, milk and eggs are complete protein sources for humans."

Now I'm not one to venture into the Diet Wars, but I must admit to being curious as to what are the foods that offer the most net protein utilization per calorie, so I ran a query for that:

-- highest % calories from net protein
select food.description food, min(a.amount/essential_aa.who_recommendation/b.amount) pct
  from food, food_nutrient a, food_nutrient b, essential_aa
  where food.fdc_id=a.fdc_id
  and food.fdc_id=b.fdc_id
  and a.nutrient_id=essential_aa.nutrient_id
  and b.nutrient_id=1008
  group by food
  order by pct desc
  limit 10;
Enter fullscreen mode Exit fullscreen mode
Egg, white, dried, stabilized, glucose reduced|0.000528478057889823
Fish, haddock, cooked, dry heat|0.00050982905982906
Fish, haddock, raw|0.000506237006237006
Egg, white, raw, fresh|0.000500986193293886
Egg, white, raw, frozen, pasteurized|0.0005
Egg substitute, liquid or frozen, fat free|0.000495138888888889
Elk, free range, roast, eye of round, cooked (Shoshone Bannock)|0.000471933471933472
Eggs, Grade A, Large, egg white|0.000467272727272727
Beef, round, top round roast, boneless, separable lean only, trimmed to 0" fat, select, cooked, roasted|0.000466666666666667
Elk, free range, roast, eye of round, raw (Shoshone Bannock)|0.000460491889063318
Enter fullscreen mode Exit fullscreen mode

Basically what I was looking for here is foods that get almost all their calories from protein.
I'd expect foods high in protein and very low in fat and carbs, so not surprisingly, egg whites and fish, and apparently elk are pretty lean, mean creatures.

I'm also not one to venture into debates over the advisability of vegan eating, but I'm nevertheless curious about such things, and the database does come with some classification into categories, so I tried this:

-- highest net protein per calorie, vegan
select food.description food, min(a.amount/essential_aa.who_recommendation/b.amount) pct
  from food, food_nutrient a, food_nutrient b, essential_aa
  where food.fdc_id=a.fdc_id
  and food.fdc_id=b.fdc_id
  and a.nutrient_id=essential_aa.nutrient_id
  and b.nutrient_id=1008
  and food.food_category_id in (2, 9, 11, 12, 20)
  group by food
  order by pct desc
  limit 10;
Enter fullscreen mode Exit fullscreen mode
Seaweed, spirulina, raw|0.0004
Seaweed, spirulina, dried|0.000347701149425287
Pumpkin leaves, raw|0.000263157894736842
Turnip greens, frozen, cooked, boiled, drained, with salt|0.000250574712643678
Turnip greens, frozen, cooked, boiled, drained, without salt|0.000250574712643678
Turnip greens, frozen, unprepared|0.000243939393939394
Alfalfa seeds, sprouted, raw|0.00024247491638796
Seeds, cottonseed flour, low fat (glandless)|0.000234302594995366
Spinach, cooked, boiled, drained, with salt|0.000229933110367893
Spinach, cooked, boiled, drained, without salt|0.000229933110367893
Enter fullscreen mode Exit fullscreen mode

Admittedly I was expecting more seeds and less greens. Maybe there's something off in my query. Then again maybe seeds would float to the top had I order by total protein to calorie ratio (descending). Maybe high protein and high net protein are two different animals in the vegetable kingdom.

Top comments (0)