I have been working on filtering feed URLS for telescope (issue 3688). I started by looking at the main blog hosts mentioned in the issue -
wordpress.com. For most of them, I was able to find blog URLs to test and find the feed URLs returned by the feed discovery service.
wordpress.com, I had to create an account, and play around with the UI to find out how posts were created, and how you could visit the site. Basically, once you add a post and publish it, you can click Visit Site to get redirected to the site URL. I used this URL to get a list of the feed URLs for
To find out the feed URLs that could be used for viewing posts, I simply viewed their contents in the browser. If this was unreadable, I downloaded the response into a file by navigating to the URL in a new tab in Firefox and used VS Code to open the file. Then, I used an XML formatter to make the contents of the file more readable and confirmed that the URL response had the posts for the blog.
Once I had collected a list of valid feed URLs for various hosts, I noticed that there were three patterns
I also found that there was an option to set up a custom domain for these blog hosts. Initially, my plan was to set some sort of a whitelist to only allow valid feed URLs. However, with custom domains, this could cause false positive or false negatives. So, I decided to use a blacklist filtering method instead. There were only a couple of feed URLs returned such as the wordpress comments feed:
I would simply add a function to filter out any feed URLs that matched the pattern for the URLs in the blacklist. For example, a feed URL which ends with
/comments/feed should not be returned.
Thus, I added a function to filter the feed URLs before returning them. Next, I need to test the sign-up process with various blog hosts to confirm that the feed URLs are returned correctly, and posts can be pulled successfully. I would also need to write some tests for the new function.