DEV Community

loading...
Cover image for Scraping Facebook groups using Python? Avoid getting blocked with ProxyCrawl

Scraping Facebook groups using Python? Avoid getting blocked with ProxyCrawl

iankalvin profile image iankalvin ・5 min read

Scraping Facebook may sound easy at first, but I've tried several times crawling and scraping different Facebook groups and ended up getting errors and CAPTCHAs most of the time, or worst, banned. For a beginner like me, this is frustrating and could take a lot of time that could have been used for something more productive.

There are ways to solve or avoid such hindrance when scraping, like solving CAPTCHAs manually or even setting a timer on your script to scrape slower. Another way to get around this is by switching your IP every couple of minutes which can be done via proxy servers or a VPN but it takes considerably more time and effort to do so.

Luckily, I’ve found a perfect solution that can handle most issues we normally encounter when scraping. It can also be easily used and integrate into any of your scraping projects. ProxyCrawl offers an API that will allow you to easily scrape the web and it protects your web crawler against blocked requests, proxy failure, IP leak, browser crashes, and more. They are providing one of the best API that can be used by everyone, be it for small or big projects.

Getting Started

In this article, I want to share with you how I used ProxyCrawl to easily crawl Facebook groups using their Crawling API and built-in scraper. We will also tackle some useful parameter features like automatic scrolling to extract more data per API request.

I will be providing a very basic sample API call and code for Python 3 as well as discuss each part, which then can be used as a baseline for your existing or future projects. The scraper that I will be using can extract information like member count, usernames, member's posts, and much more in a public Facebook group.

Before we start, let’s have a list of things that we will use for this project:

Simple API Call

Now that you have an idea of what we will need to accomplish this task, we can get started.

First, it is important to know that every request to ProxyCrawl’s API starts with the following base part:

https://api.proxycrawl.com
Enter fullscreen mode Exit fullscreen mode

You will also need an authentication token for every request. ProxyCrawl will provide two kinds of token upon signing up. The normal token for generic requests, and the Javascript token which acts like a real browser.

In this case, we will be using the Javascript token since we will need the page rendered via javascript to properly scrape Facebook groups. A token can be inserted on our request as shown below:

https://api.proxycrawl.com/?token=USER_TOKEN
Enter fullscreen mode Exit fullscreen mode

To make an API call, you just need to add the URL (encoded) that you wish to crawl like the given example below:

https://api.proxycrawl.com/?token=JS_TOKEN&url=https%3A%2F%2Fwww.facebook.com%2FBreakingNews
Enter fullscreen mode Exit fullscreen mode

This simple line will instruct the API to fetch the full HTML source code of any website that you are trying to crawl. You can make this API request using cURL on your terminal or just open a browser and paste it into the address bar.

Now that I have explained the very basics of making an API call, we can then try to use this knowledge to scrape Facebook groups.

Depending on your project, getting the full HTML source code may not be efficient if you want to extract a particular data set. You can try to build your own scraper, however, if you are just starting or if you don’t want to spend your resources and time on building it yourself, ProxyCrawl has various readily available data scrapers that we can use to easily scrape data from supported websites like Facebook.

Using their data scraper, we can easily retrieve the following information on most Facebook groups:

title
type
membersCount
url
description
feeds including username, text, link, likesCount, commentsCount
comments including username and text

To get all the information mentioned above, we just need to pass two parameters. The &scraper=facebook-group alongside the &scroll=true parameter. Using these will return the result in JSON format.

https://api.proxycrawl.com/?token=JS_TOKEN&url=https%3A%2F%2Fwww.facebook.com%2Fgroups%2F198722650913932&scraper=facebook-group&scroll=true
Enter fullscreen mode Exit fullscreen mode

Example output:
Alt Text

Scraping with Python

ProxyCrawl has compiled a collection of related pieces of code that we can use to write our simple API call in Python and anyone can freely use it. The below example is how we can utilize their Python library in this project.

First, make sure to download and install the ProxyCrawl API Python class. You can either download it from Github or use PyPi Python package manager. pip install proxycrawl

from proxycrawl.proxycrawl_api import ProxyCrawlAPI

api = ProxyCrawlAPI({'token': 'YOUR_TOKEN'})

response = api.get('https://www.facebook.com/groups/381067052051677',
                   {'scraper': 'facebook-group', 'scroll':'true'})

if response['status_code'] == 200:
    print(response['body'])

Enter fullscreen mode Exit fullscreen mode

Note at this instance we do not need to encode the URL since the library is encoding it already.

From this point on, using other parameters would be as easy as adding another option to the GET request.

Let us use the scroll_interval in this next example. This parameter will allow our scraper to scroll on a set time interval which in return will provide us more data as if we are scrolling down a page on a real browser. For example, if we set it to 20 then it will instruct the browser to scroll for 20 seconds after loading the page. We can set it for a maximum of 60 seconds, after which the API captures the data and brings it back to us.

from proxycrawl.proxycrawl_api import ProxyCrawlAPI

api = ProxyCrawlAPI({'token': 'YOUR_TOKEN'})

response = api.get('https://www.facebook.com/groups/381067052051677',
                   {'scraper': 'facebook-group', 'scroll': 'true', 'scroll_interval': 20})

if response['status_code'] == 200:
    print(response['body'])
Enter fullscreen mode Exit fullscreen mode

As you may have noticed with the code, we will get a response or status code each time we send a request to ProxyCrawl. The request is a success if we get 200 for pc_status and original_status. In some cases, the request may fail, which will have a different status code like 503 for example. However, ProxyCrawl does not charge for such cases, so if the requests failed for some reason, you can simply retry the call.

The example output below shows a successfully scraped public Facebook group.

Alt Text

Conclusion

There you have it. Scraping Facebook content in just a few lines of code. As of the moment, ProxyCrawl only offers a scraper for groups, but you can use the Crawling API if you wish to crawl other pages.

Remember, you can use any programming language that you are familiar with and this can be integrated into any of your existing systems. The ProxyCrawl API is stable and reliable enough that it can serve as a backbone to any of your app. They are also offering great support for all their products that is why I’m happy using their service.

I hope you have learned something new in this article. Do not forget to sign up at ProxyCrawl to get your token if you want to test this on your end. The first 2000 requests are free of charge, just make sure to use the links found on this guide. :)

Discussion

pic
Editor guide
Collapse
ashishshetty profile image
AlphaSierra

You could have tried with selenium (+ chromedriver) with beautifulsoup and requests.

Collapse
yellow1912 profile image
yellow1912

The problem is that most services like Facebook will try to block you if you go over rate limit. In the end you may still have to pay for proxy service like these to ensure you don't get yourself blocked. There are also free proxy out there but in my experience they are unreliable.

Collapse
ashishshetty profile image
AlphaSierra

That's why I said to use selenium. It fools servers to think that an actual user is browsing. Although this will not be feasable if your internet connection is too slow. Here checkout this project of mine where I have used selenium to scrape amazon: github.com/Shetty073/amazon-top-de...

Edit: Also instead of scraping please checkout facebook's API, you might get what you want easily without scraping.

Thread Thread
yellow1912 profile image
yellow1912

I'm not sure. Perhaps my use case is different. I scrap Instagram images for my users (scrap their own accounts). Since there are so many users, so many accounts, I always end up going over rate limit.

Collapse
greenroommate profile image
Haris Secic

I'm guessing they block you because it's against t&c so this post migh be illegal 🙂

Collapse
nastradacha profile image
NAS

I'm guessing proxycrawl isn't free

Collapse
v6 profile image
🦄N B🛡

Filed under "Further Evidence that the Internet is Forever."