Now we can get to the fun part, web scraping with axios. This is part three of the Learn to Web Scrape series. This is going to build on the previous two posts, using Cheeriojs to parse html and saving the data to csv.
The tools and getting started
This section I will include in every post of this series. It’s going to go over the tools that you will need to have installed. I’m going to try and keep it to a minimum so you don’t have to add a bunch of things.
Nodejs – This runs javascript. It’s very well supported and generally installs in about a minute. You’ll want to download the LTS version, which is 12.13.0
at this time. I would recommend just hitting next through everything. You shouldn’t need to check any boxes. You don’t need to do anything further with this at this time.
Visual Studio Code – This is just a text editor. 100% free, developed by Microsoft. It should install very easily and does not come with any bloatware.
You will also need the demo code referenced at the top and bottom of this article. You will want to hit the “Clone or download” button and download the zip file and unzip it to a preferred location.
Once you have it downloaded and with Nodejs installed, you need to open Visual Studio Code and then go File > Open Folder and select the folder where you downloaded the code.
We will also be using the terminal to execute the commands that will run the script. In order the open the terminal in Visual Studio Code you go to the top menu again and go Terminal > New Terminal. The terminal will open at the bottom looking something (but probably not exactly like) this:
It is important that the terminal is opened to the actual location of the code or it won’t be able to find the scripts when we try to run them. In your side navbar in Visual Studio Code, without any folders expanded, you should see a > src
folder. If you don’t see it, you are probably at the wrong location and you need to re-open the folder at the correct location.
After you have the package downloaded and you are at the terminal, your first command will be npm install
. This will download all of the necessary libraries required for this project.
Web scraping with axios
It’s web scrapin’ time! Axios is an extremely simple package that allows us to call to other web pages from our location. In our previous two posts in the series we used a sampleHtml
file from which we would parse the data. In this post, we use sample html no longer!
So…after doing all the setup steps above, we just type…
const axiosResponse = await axios.get('http://pizza.com');
const $ = cheerio.load(axiosResponse.data);
And we’re done. Post over.
Didn’t believe that it was really over? Okay, fine, you’re right. There is more that I want to talk about but essentially that is it. Get the axiosResponse
and pull the data from it and you’ve got your html that you want to parse with Cheeriojs.
Javascript promises
I don’t want to go into this in depth but a large tenant of Axios is that is promise based. Nodejs by default is asynchronous code. That means that it doesn’t wait for the code before it to finish before it starts the next bit of code. Because web requests can take varying times to complete (if a web page loads slowly, for example) we need to do something to make sure our web request is complete before we try to parse the html…that wouldn’t be there yet.
Enter promises. The promise is given (web request starts) and when using the right keywords (async/await in this case) the code pauses there until the promise is fulfilled (web request completes). This is very high level and if you want to learn more about promises, google it. There are a ton of guides about it.
The last part about promises the keyword part. There are other ways to signal to Nodejs that we are awaiting a promise completion but the one I want to talk quickly about here is async/await
.
Our code this time is surrounded in an async
block. It looks like this:
(async () => {
// awesome code goes here
})();
This tells node that within this block there will be asynchronous code and it should be prepared to handle promises that will block that code. Important! At the very end of the block it’s important to notice the ()
. That makes this function call itself. If you don’t have that there, you’ll run your script and nothing will happen.
Once we have a block, then inside that we can use awaits liberally. Like this:
(async () => {
const axiosResponse = await axios.get('http://pizza.com');
const $ = cheerio.load(axiosResponse.data);
// Search by element
const title = $('title').text();
console.log('title', title);
})();
Now. You can just enter in any url you are looking for within that axios.get('some url here')
and you’re good to go! Doing web scraping and stuff!
Looking for business leads?
Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!
The post Axios. Jordan Teaches Web Scraping appeared first on JavaScript Web Scraping Guy.
Top comments (0)