Front end developer. Keen with JavaScript, HTML5, CSS3, JAMstack, React, Gatsby, GraphQL, Web Accessibility and
UX/UI Design Principles
Trained with Vets Who Code https://vetswhocode.io
Then you have the document, yes. But you have not parsed or scraped anything. You also don't interpret the JavaScript of the page. You just get the static html. That's not what this tutorial is about ;)
Front end developer. Keen with JavaScript, HTML5, CSS3, JAMstack, React, Gatsby, GraphQL, Web Accessibility and
UX/UI Design Principles
Trained with Vets Who Code https://vetswhocode.io
I understand. I'm still a noob. I did a code challenge recently and without using any modules I had to figure out how to stdout the html of a webpage. You're right, it works great on a static site. I tried with
curl -s $1 | grep -Po '(?<=href=")[^"]*'
and I almost got everything. Thanks for the tutorial.
It's a different use case. With puppeteer you can also scrape content that's rendered with JavaScript on the client. A lot of applications are client side only. Scraping that is not possible using curl.
Also it's way easier to write DOM selectors than regular expressions. Imagine instead of just getting all links like in this simple example, getting all links of every first paragraph inside of a div if it's inside of an article tag. Good luck writing a regular expression for that. The selector is still easy to write and can be used within the page context with puppeteer.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Since there is no native way to do it, here is a small bash script to do it:
Then you have the document, yes. But you have not parsed or scraped anything. You also don't interpret the JavaScript of the page. You just get the static html. That's not what this tutorial is about ;)
I understand. I'm still a noob. I did a code challenge recently and without using any modules I had to figure out how to stdout the html of a webpage. You're right, it works great on a static site. I tried with
curl -s $1 | grep -Po '(?<=href=")[^"]*'
and I almost got everything. Thanks for the tutorial.
It's a different use case. With puppeteer you can also scrape content that's rendered with JavaScript on the client. A lot of applications are client side only. Scraping that is not possible using curl.
Also it's way easier to write DOM selectors than regular expressions. Imagine instead of just getting all links like in this simple example, getting all links of every first paragraph inside of a div if it's inside of an article tag. Good luck writing a regular expression for that. The selector is still easy to write and can be used within the page context with puppeteer.