My machine-learning pet project. Part 2. Preparing my dataset

#machinelearning #imagerecognition

My pet-project is about food recognition. More info here.

Data sets in tutorials vs data sets in the wild. From Towards AI on Twitter

First thing that came to my mind was to scrape stuff from some instagram account. Have you seen how many recipes are there??? Millions. And they have descriptions, from which I could extract labels. I thought it would be ez. I managed to scrape about 10 posts using this:

instaloader profile nytcooking --no-videos --no-metadata-json --slide 1 --post-filter='date_utc >= datetime(2012, 5, 21)' --sanitize-paths

So far so good. But then I tried to scrape a bit more, e.g. posts for 3 months, and started getting 429 Too many requests errors. Creating new instagram profiles didn't help. How could I, alone, beat an army of well paid developers? I needed another approach.

I chose one of the recipe websites that I used to visit. It has good photos and descriptions and it's easy to scrape. I picked Scrapy to do the job. It's actively supported (last commit - 5 days ago), has good documentation and a readable code.

I saved one sample webpage to my desktop and launched scrapy shell:

scrapy shell ../Desktop/test.html

It helped me to prepare a bunch of selectors like this:

recipe.xpath('./p[@class="material-anons__des"]//text()').get()

Then I created a project template using command like this:

scrapy startproject myproject [project_dir]

I took a look at their example repo at github.

The only thing I have struggled for some time was item pipelines. At one point I copied some stuff from stackoverflow to settings.py. And scrapy started to complain, smth like file A has an error now. I went to scrapy github and found a nice little comment in file A about what really should be in the settings file now. So that's what I added to settings.py:

FILES_STORE = './csv'
IMAGES_STORE = './images'
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1
}
FEEDS={
    './csv/items-%(batch_id)d': {
        'format': 'csv'
    }
}

These settings allowed me to download images and make a *.csv file with parsed titles,descriptions and image paths urls without writing any extra lines of code. In my opinion, Scrapy is a very powerful instrument, so I totally recommend to use it when you want to scrape some data (not from instagram).

The next post is going to be about brushing up scraped data and labelling it.

DEV Community

My machine-learning pet project. Part 2. Preparing my dataset

Data sets in tutorials vs data sets in the wild. From Towards AI on Twitter

Top comments (0)

Read next

A beginner's guide to the Flux-1.1-Pro model by Black-Forest-Labs on Replicate

Distill Large Language Models Into Compact AI With LLM-Neo

Language Models Get Introspective: Learning About Their Own Capabilities

The 2024 Nobel Prize in Physics: An Achievement for AI - More Career Opportunities