Are you interested in or want to get more experienced in web scraping?
This step-by-step guide explains how to use cURL or simply, curl, with proxy servers. It covers all the aspects, beginning from installation to explaining various options to set the proxy.
In this tutorial, we did not target any specific proxy service. Therefore, it should work with all proxy servers. All you need to know are server details and credentials.
Let’s get started!
What is cURL?
cURL is a command line tool for sending and receiving data using the url. Let’s look at the simplest example of using curl. Open your terminal or command prompt and type in this command and press Enter:
This will get the HTML of the page and print it on the console.
_curl https://www.google.com -I
_
This will print the document information.
HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
Installation
cURL is provided with many Linux distributions and with MacOS. Now it is provided with Windows 10 as well.
If your Linux distribution is not provided with it, you can install it by running the install command. For example, on Ubuntu, open Terminal and run this command:
sudo apt install curl
If you are running an older version of Windows, or if you want to install an alternate version, you can download curl from the official download page.
What you need to connect to a proxy
Irrespective of which proxy service you use, you will need the following information:
- proxy server address
- port
- protocol
- username (if authentication is required)
- password (if authentication is required)
In this tutorial, we are going to assume that the proxy server is 127.0.0.1, the port is 1234, the user name is user, and the password is pwd. We will look into multiple examples covering various protocols.
NOTE. If you are on a network that uses NTLM authentication, you can use the switch –proxy-ntlm while running curl. Similarly, –proxy-digest can be used for digest authentication. You can look at all the available options by running curl –help. This tutorial will have examples for the scenario when a username and password has to be specified.
The next section will cover the first curl proxy scenario, which happens to be the most common one – HTTP and HTTPS proxy with curl.
Using cURL with HTTP/HTTPS proxy
If you recall, we looked at using curl without a proxy like this:
This particular website is especially useful for testing out proxies as the output of this page is the origin IP address. If you are using a proxy correctly, the page will return an IP address that is different from your machine’s, that is, the proxy’s IP address.
There are multiple ways to run curl with proxy command. The next section will cover sending proxy details as a command line argument.
NOTE. All the command line options, or switches, are case sensitive. For example, -f instructs curl to fail silently, while -F denotes a form to be submitted.
Command line argument to set proxy in cURL
Open terminal and type the following command, and press Enter:
curl --help
The output is going to be a huge list of options. One of them is going to look like this:
-x, --proxy [protocol://]host[:port]
Note that x is small, and it is case-sensitive. The proxy details can be supplied using the -x or –proxy switch. Both mean the same thing. Both of the curl with proxy commands are the same:
curl -x "http://user:pwd@127.0.0.1:1234" "http://httpbin.org/ip"
or
curl --proxy "http://user:pwd@127.0.0.1:1234" "http://httpbin.org/ip"
NOTE. If there are SSL certificate errors, add -k (note the small k) to the curl command. This will allow insecure server connections when using SSL.
curl --proxy "http://user:pwd@127.0.0.1:1234" "http://httpbin.org/ip" -k
You may have noticed that both the proxy url and target url are surrounded in double quotes. This is a recommended practice to handle special characters in the url.
Another interesting thing to note here is that the default proxy protocol is http. Thus, following two commands will do exactly the same:
curl --proxy "http://user:pwd@127.0.0.1:1234" "http://httpbin.org/ip"
curl --proxy "user:pwd@127.0.0.1:1234" "http://httpbin.org/ip"
Using environment variables
Another way to use proxy with curl is to set the environment variables http_proxy and https_proxy.
Note that setting proxy using environment variables works only with MacOS and Linux. For Windows, see the next section which explains how to use _curlrc file.
If you look at the first part of these variable names, it clearly shows the protocol for which these proxies will be used. It has nothing to do with the protocol used for the proxy server itself.
- http_proxy – the proxy will be used to access addresses that use http protocol
- https_proxy – the proxy will be used to access addresses that use https protocol
Simply set the variables http_proxy to http proxy address and https_proxy to set https proxy address. Open terminal and run these two commands.
export http_proxy="http://user:pwd@127.0.0.1:1234"
export https_proxy="http://user:pwd@127.0.0.1:1234"
After running these two commands, run curl normally.
curl "http://httpbin.org/ip"
If you see SSL Certificate errors, add -k to ignore these errors.
Another thing to note here is that these variables apply system wide. If this behavior is not desired, turn off the global proxy by unsetting these two variables:
unset http_proxy
unset https_proxy
See the next section to set the default proxy only for curl and not system-wide.
Configure cURL to always use proxy
If you want a proxy for curl but not for other programs, this can be achieved by creating a curl config file.
For Linux and MacOS, open the terminal and navigate to your home directory. If there is already a .curlrc file, open it. If there is none, create a new file. Here are the set of commands that can be run:
cd ~
nano .curlrc
In this file, add this line:
proxy="http://user:pwd@127.0.0.1:1234"
Save the file. Now curl with proxy is ready to be used. Simply run curl normally and it will read the proxy from the .curlrc file.
curl "http://httpbin.org/ip"
On Windows, the file is named _curlrc. This file can be placed in the** %APPDATA%** directory.
To find the exact path of %APPDATA%, open the command prompt and run the following command:
echo %APPDATA%
This directory will be something like C:\Users<your_user>\AppData\Roaming. Now go to this directory, and create a new file _curlrc, and set the proxy by adding this line:
proxy="http://user:pwd@127.0.0.1:1234"
This works exactly the same way in Linux, MacOS, and Windows.
Ignore or override proxy for one request
If the proxy is set globally, or by modifying the .curlrc file, this can still be overridden to set another proxy or even bypass it.
To override proxy for one request, set the new proxy using -x or –proxy switch as usual:
curl --proxy "http://user:pwd@1.0.0.1:8090" "http://httpbin.org/ip"
If you want to bypass proxy altogether for a request, you can pass – noproxy followed by “*”. This instructs curl to not use proxy for all URLs.
curl --noproxy "*" "http://httpbin.org/ip"
If you have many curl requests to execute without a proxy, but not change system-wide proxy settings, the following section will show you exactly how to do that.
Bonus tip – turning proxies on and off quickly
This tip is dedicated only for advanced users. If you do not know what a .bashrc file is, you may skip this section.
You can create an alias in your .bashrc file to set proxies and unset proxies. For example, open a .bashrc file using any editor and add these lines:
alias proxyon="export http_proxy=' http://user:pwd@127.0.0.1:1234';export https_proxy=' http://user:pwd@127.0.0.1:1234'"
alias proxyoff="unset http_proxy;unset https_proxy"
After adding these lines, save the .bashrc and update the shell to read this .bashrc. To do this, run this command in the terminal:
. ~/.bashrc
Now, whenever you need to turn on the proxy, you can quickly turn on the proxy, run one or more curl commands and then turn off the proxies like this:
proxyon
curl "http://httpbin.org/ip"
curl "http://google.com"
proxyoff
cURL socks proxy
If the proxy server is using socks protocol, the syntax remains the same:
curl -x "socks5://user:pwd@127.0.0.1:1234" "http://httpbin.org/ip"
Similarly, socks4://, socks4a://, socks5:// or socks5h:// can be used depending on the socks version.
Alternatively, curl socks proxy can also be set using the switch –socks5 instead of -x. You can follow the same command, but use a different switch: username and password can be sent using the –proxy-user switch.
curl --socks5 "127.0.0.1:1234" "http://httpbin.org/ip" --proxy-user user:pwd
Again, –socks4, –socks4a or –socks5 can be used, depending on the version.
Summary
As a closing note, we can state that cURL is a very powerful tool for automation and is arguably the best command line interface in terms of proxy support. As libcurl works very well with php, many web applications use it for web scraping projects, making it a must-have for any web scraper.
Top comments (0)