DEV Community

loading...

How to Scrape Craigslist at Scale Using Python

proxiesapi profile image Mohan Ganesan Originally published at proxiesapi.com Updated on ・3 min read

This quick tutorial will show you how to do two things.

How to scrape jobs, events, services, gigs, resumes, community, housing, and more listings on Craigslist.
How to not get IP blocked by Craigslist.
To scrape Craigslist, we will be using the open-source Craigslist wrapper python-craigslist

First, we need to download and extract the python-craigslist module source from here https://github.com/juliomalegria/python-craigslist

Do not install it using the pip installer as recommended by the page. This is because we are going to change the source to enable it to scale to our benefit later.

Once extracted, open terminal and cd to the folder and install the code manually
Now import the module into your code.
Not that this imports the Housing section crawler into your code

Here are more subclasses you can import depending on your needs.
Going back to the housing subclass, here is how you can scrape housing listings in San Fransisco.
Save this in a file called craigslist_scraper.py and run it by.
This will return all the listings for the specific filters you have set,
Notice that the filter names and values are specific to the category. The library has a method that will tell you all available filters for a category. You can access it by calling.
So far, so good. But this library works only for small scale projects. If you scrape a lot of data, Craigslist will likely just block you. In that case, you need a rotating proxy that can handle proxy rotation, user-agent-spoofing to enable you to download large quantities of data.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world
With our automatic IP rotation
With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
With our automatic CAPTCHA solving technology
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

A simple API can access the whole thing.
Register for free and get your free API Key here before proceeding with the next steps.

Great. Now that you have a Proxies API AuthKey, you are all set.

Lets now edit the source code and ask the module to route all the requests through the Proxies API endpoint.

For that cd to the craigslist directory

Open the base.py file for editing

Here we are going to route all requests through the Proxies API endpoint.

This should happen whenever the requests_get function is called. Its called at two important points in the code.

Here is how we are going to change it.

This is the original.
We will now add the proxy endpoint.
This needs to happen at a couple of other places, and the dependencies need to be imported.

The best thing to do is to simply replace the entire code below with the original code in base.py
That should do the trick after running this with the command.
You should not face IP blocks again from Craigslist.

You can verify that the requests are passing through the Proxies API endpoint by going to your dashboard at https://app.proxiesapi.com/index.php and see the number of successful requests logged.

Discussion

pic
Editor guide