It didn’t take long to realize that it wasn’t just the APIs, but the RSS feed itself was limited. After publishing a bunch of stories to our publication, we realized that stories were being removed from the Sense search. Would you like to guess why? The RSS feed ONLY GIVES THE LATEST 10 STORIES!!! There’s a lesson in load testing for you.
I was now stuck with two problems:
I decided to take another look at possible solutions. My brain wouldn’t stop iterating “there’s got to be a better way than this.”
After doing some more digging and searching, I learned about PubSubHubbub. Yes, that’s really the name. And you would assume that is some sort of company or product name, but it’s actually a protocol that’s pretty similar to WebHooks.
Basically you subscribe to a PubSubHubbub feed giving a callback url. The PubSubHubbub provider then sends a POST request to the callback url to confirm the subscription.
Once this process is done, the provider will send a GET request to the callback URL any time new content is published to it. Sounds perfect!
This time I was skeptical however. APIs didn’t help. RSS was limiting. Surely there was going to be something wrong. My assumption proved correct after doing a test run of the process and learning that the GET request from Superfeedr (Medium’s PubSubHubbub provider) didn’t actually have the content of the article, making this useless.
Having learned this, I decided the best thing to do was to stick with the RSS feed at this point and simply remove the code that deleted stories from the db.
Since the RSS only shows the last 10 stories and the PubSubHubbub method is for future stories, I was left with two possible options. The first was to manually enter the information into the database, which as a coder would go against every impulse in my brain.
So the only option left is writing a script to download the data through HTTP requests. That involved doing the following:
- Find out where Medium lists stories for a publication (https://medium.com///)
- Analyze the HTML to determine how to pull out the link for each story
- Analyze the HTML of the story page to determine how to best get the data from it
1 and 2 were pretty easy by themselves and Medium was very helpful with the third by providing most of the data we needed in meta tags. As for the content itself, it was simply a matter of finding the right div. With all this information, I just needed to find an npm package that would help me parse through the HTML (thank you cheerio) and write the script.
We could be, or we could probably make things better. There still isn’t any code to remove stories if they have been deleted on Medium and there’s probably a way to use PubSubHubbub to use a subscription process instead of having a script running every once in a while.
If you’d like to have some fun, I encourage you to fork the repo, try to solve these problems and submit a pull request if you do.