DEV Community

Waylon Walker
Waylon Walker

Posted on • Originally published at waylonwalker.com

Set User Agent on pandas read_csv

I keep a small cars.csv on my website for quickly trying out different pandas operations. It's very handy to keep around to help what a method you are unfamiliar with does, or give a teammate an example they can replicate.

Hosts switched

I recently switched hosting from netlify over to cloudflare. Well cloudflare does some work to block certain requests that it does not think is a real user. One of these checks is to ensure there is a real user agent on the request.

Not my go to dataset 😭

This breaks my go to example dataset.

pd.read_csv("https://waylonwalker.com/cars.csv")

# HTTPError: HTTP Error 403: Forbidden
Enter fullscreen mode Exit fullscreen mode

But requests works???

What's weird is, requests still works just fine! Not sure why using urllib the way pandas does breaks the request, but it does.

requests.get("https://waylonwalker.com/cars.csv")

<Response [200]>
Enter fullscreen mode Exit fullscreen mode

Setting the User Agent in pandas.read_csv

this fixed the issue for me!

After a bit of googling I realize that this is a common thing, and that setting the user-agent fixes it. This is the point I remember seeing in the cloudflare dashbard that they protect against a lot of different attacks, aparantly it treats pd.read_csv as an attack on my cloudflare pages site.

pd.read_csv("https://waylonwalker.com/cars.csv", storage_options = {'User-Agent': 'Mozilla/5.0'})

# success
Enter fullscreen mode Exit fullscreen mode

Now my data is back

Now this works again, but it feels like just a bit more effort than I want to do by hand. I might need to look into my cloudflare settings to see if I can allow this dataset to be accessed by pd.read_csv.

Discussion (0)