What about Scraping Single Page Apps like angular or react apps? Does Goutte support's this? is this even possible using PHP? Is there anything that can do this? I've been looking for info in Client Side Rendered Scraping but there is little information.
Yes, it is in fact possible with PHP. The tools use for this are called headless browsers. Headless browsers act as regular browsers ( running javascript, etc. ) Using a headless browser, javascript rendered pages can be scraped. We combine Goutte's crawler with the response from a headless browser such as Selenium or PhantomJS and we are able to use all of Goutte's crawling functions. This is personally what I use for scraping those type of sites.
At scale, you're almost always better off avoiding headless browsers. Try using plain HTTP requests and parsing the HTML, the data loaded in SPAs is usually loaded from a JSON object in a tag somewhere. I wrote this extension that extracts the data for you:<br>
<a href="https://chromewebstore.google.com/detail/kjlhnflincmlpkgahnidgebbngieobod" rel="nofollow">https://chromewebstore.google.com/detail/kjlhnflincmlpkgahnidgebbngieobod</a></p>
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
What about Scraping Single Page Apps like angular or react apps? Does Goutte support's this? is this even possible using PHP? Is there anything that can do this? I've been looking for info in Client Side Rendered Scraping but there is little information.
Yes, it is in fact possible with PHP. The tools use for this are called headless browsers. Headless browsers act as regular browsers ( running javascript, etc. ) Using a headless browser, javascript rendered pages can be scraped. We combine Goutte's crawler with the response from a headless browser such as Selenium or PhantomJS and we are able to use all of Goutte's crawling functions. This is personally what I use for scraping those type of sites.
At scale, you're almost always better off avoiding headless browsers. Try using plain HTTP requests and parsing the HTML, the data loaded in SPAs is usually loaded from a JSON object in a tag somewhere. I wrote this extension that extracts the data for you:<br> <a href="https://chromewebstore.google.com/detail/kjlhnflincmlpkgahnidgebbngieobod" rel="nofollow">https://chromewebstore.google.com/detail/kjlhnflincmlpkgahnidgebbngieobod</a></p>