DEV Community

Cover image for The Easy Way to Scrape Instagram Using Python Scrapy & GraphQL

The Easy Way to Scrape Instagram Using Python Scrapy & GraphQL

Ian Kerins on August 06, 2020

After e-commerce monitoring, building social media scrapers to monitor accounts and track new trends is the next most popular use case for web scra...
Collapse
 
drakula2k profile image
Vlad

Looks like Instagram doesn't work via Scraper API anymore. But it still works on webscraping.ai

Collapse
 
djk50 profile image
djk50

instagram.com/explore/tags//

Do you know how to get the posts for a tagged username. Simply replacing the former URL with this new one doesn't seem to work.

Collapse
 
drakula2k profile image
Vlad

instagram.com/explore/tags/sport/?... works for hashtags, and instagram.com/nike/tagged/?__a=1 works for username, but this one requries login

Collapse
 
karisjochen profile image
karisjochen

Do you mind sharing how you adjusted the code to use webscraping.ai instead? Thanks!

Collapse
 
drakula2k profile image
Vlad • Edited

Sure, here it is gist.github.com/Drakula2k/035cc5bd...
I also fixed a couple of bugs there

Thread Thread
 
karisjochen profile image
karisjochen

Thanks so much for sharing! After making the changes I am unfortunately still getting blocked by the robots.txt file. Is this code still working for you?

Thread Thread
 
drakula2k profile image
Vlad

Yes, it's working. You can disable the robots.txt check by setting ROBOTSTXT_OBEY = False on your settings.py. It works via an API so there is no need for the robots.txt check.

Thread Thread
 
karisjochen profile image
karisjochen

incredible, thank you! It worked! So is it always a good idea to set the ROBOTSTXT_OBEY = False considering we dont want to be stopped?

Thread Thread
 
drakula2k profile image
Vlad

Yes, ROBOTSTXT_OBEY is good when you're building something like a search engine and it may request all sorts of random URLs posted on the Internet. In that case, using robots.txt is good to skip non-public pages.

But if you're requesting particularly defined URLs or using an API, robots.txt is not so useful and may block access to the API.

Thread Thread
 
kaiwangyu profile image
kaiwangyu

thanks a lot, I learned a ton from your code... but im still get confused by the query_hash. may I ask how do you get this constant for this tpye of query,pls?

Thread Thread
 
drakula2k profile image
Vlad

Open Inspector in Chrome, visit Instagram and scroll through the posts, you'll see the same GraphQL queries with query_hash.
I'm not sure what query_hash value means exactly, but they're static for each type of query it seems.

Thread Thread
 
kaiwangyu profile image
kaiwangyu • Edited

Ohhh, I see, it's a constant number(every time drop-down the perfil), but for me it's a diferent number, not 'e769aa130647d2354c40ea6a439bfc08', by the way, thank you so much, I am beginner on Scrapy, and do you sugguest any book or tutorial to learn advanced project based on Scrapy, I already bought this book .

Kai
Merry Chrismas
Regards

Thread Thread
 
drakula2k profile image
Vlad

They may have changed something, but the old value still works too, it seems.
I'm not a specialist in Scrapy, but generally, I'd read official docs (docs.scrapy.org/en/latest/) and then start doing some projects using it and learn from them.

Collapse
 
abbas53333 profile image
abbas53333 • Edited

It works like A Charm. Thank you sooooooo much. but i have 2 questions:
1) How do we include the User name to identify the posts to which username.
2) How can we get the Basic information suck as Name Bio Handle Number of followers Number of following and Media Count ?

if this works for all those information i might need to subscrive to Scrap Api 1+Million xD

Thank you in Advance

Collapse
 
Sloan, the sloth mascot
Comment deleted
Collapse
 
abbas53333 profile image
abbas53333

Hey there, you have to download python and install something called Scrapy its an application for Python i would recommend to look some Videos on youtube to learn and i suggest to start by following this tutorial 25 episodes
youtube.com/watch?v=ve_0h4Y8nuI&li... this channel is very good follow it and you shall start!
Have a good day

Collapse
 
ghostgardens profile image
GhostGardens

Hi there! Great post...it answers a lot of questions.

Small thing, though: the "likes" count & comment count isn't working properly. I'm assuming it's due to the near-constant moving target of Instagram changing their page. Any hints on how to resolve this?

Thanks very much for your time!

Collapse
 
jacksonbull87 profile image
jacksonbull87

the likes count isn't working for me either. its just giving me NaN values. Any idea on how to fix this?

Collapse
 
mayankbali profile image
Mayank Bali

Hey this code is giving me Error

Ignoring response <403 https://api.webscraping.ai/html?api_key=45299f85b2302dd84a9f53e5a799114e&proxy=residential&timeout=20000&url=https%3A%2F%2Fwww.instagram.com%2Fnike%2F%3Fhl%3Den>: HTTP status code is not handled or not allowed

Can Anyone help me out here?

Collapse
 
iankerins profile image
Ian Kerins

The code in the article is designed to use scraperapi.com as the proxy, you are using webscraping.ai. You need to adapt the code to use this proxy as the error suggests that they use a different authentication method for their API.

Collapse
 
vasana12 profile image
vasana12

Hi. This is a very helpful article.
What does the variable "first" in the dictionary mean? I am making a hashtag-based crawler. There is a problem setting the value of the "first" variable. Can you answer the criteria for setting?

Collapse
 
karisjochen profile image
karisjochen

It appears I am getting stopped by Instagram's robots.txt file. Any ideas on how to adjust the code to circumvent this?

Collapse
 
thedukeofnada profile image
thedukeofnada

Saved my life with this script. Is there a way to extract the actual user comments and not just the count? >username /text/date/time