DEV Community

Natalia D
Natalia D

Posted on • Updated on

My machine-learning pet project. Part 2. Preparing my dataset

My pet-project is about food recognition. More info here.

Image description

Data sets in tutorials vs data sets in the wild. From Towards AI on Twitter

First thing that came to my mind was to scrape stuff from some instagram account. Have you seen how many recipes are there??? Millions. And they have descriptions, from which I could extract labels. I thought it would be ez. I managed to scrape about 10 posts using this:

instaloader profile nytcooking --no-videos --no-metadata-json --slide 1 --post-filter='date_utc >= datetime(2012, 5, 21)' --sanitize-paths
Enter fullscreen mode Exit fullscreen mode

So far so good. But then I tried to scrape a bit more, e.g. posts for 3 months, and started getting 429 Too many requests errors. Creating new instagram profiles didn't help. How could I, alone, beat an army of well paid developers? I needed another approach.

I chose one of the recipe websites that I used to visit. It has good photos and descriptions and it's easy to scrape. I picked Scrapy to do the job. It's actively supported (last commit - 5 days ago), has good documentation and a readable code.

I saved one sample webpage to my desktop and launched scrapy shell:

scrapy shell ../Desktop/test.html
Enter fullscreen mode Exit fullscreen mode

It helped me to prepare a bunch of selectors like this:

recipe.xpath('./p[@class="material-anons__des"]//text()').get()
Enter fullscreen mode Exit fullscreen mode

Then I created a project template using command like this:

scrapy startproject myproject [project_dir]
Enter fullscreen mode Exit fullscreen mode

I took a look at their example repo at github.

The only thing I have struggled for some time was item pipelines. At one point I copied some stuff from stackoverflow to settings.py. And scrapy started to complain, smth like file A has an error now. I went to scrapy github and found a nice little comment in file A about what really should be in the settings file now. So that's what I added to settings.py:

FILES_STORE = './csv'
IMAGES_STORE = './images'
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1
}
FEEDS={
    './csv/items-%(batch_id)d': {
        'format': 'csv'
    }
}
Enter fullscreen mode Exit fullscreen mode

These settings allowed me to download images and make a *.csv file with parsed titles,descriptions and image paths urls without writing any extra lines of code. In my opinion, Scrapy is a very powerful instrument, so I totally recommend to use it when you want to scrape some data (not from instagram).

The next post is going to be about brushing up scraped data and labelling it.

Top comments (0)