I read a lot of articles and when I don't have the time to through one I do three things:
- Save to Medium
- Save to Instapaper
- Save to Notion
So what happens after that is Notion/Medium/Instapaper scrape the page so that when I do have the time I don't have to look for the link rather just open one of those platforms and the article will be saved under my profile.
So my question is how are they able to achieve that?
Because each website HTML document is different, and my knowledge around scraping is that you have to know how the document structure looks like to collect relevant info. For example:
- You might opt to use class names like
.articlewhich will house the article body but then if another developer uses a
css-in-jslibrary then that won't be effective.
- Another way maybe the developer uses HTML5 tags like
<article>and houses the article body, [chrome reader mode works when you do it this way] inside but another developer might use
I asked one of my friends he mentioned they might be doing some analysis on the page and figuring out when an article starts and ends.