Recently, I faced a problem when I was creating my personal blog which was using Markdown for all my written content. Every blog was on Medium, and it was taking a lot of time to convert those blogs to Markdown files. Hence, I worked on this project.
This application is a simple way to generate markdown files using your blogs on Medium. This tool provides a solution for fetching HTML content from URLs and converting it to Markdown format using AWS Bedrock agents with any FM of your choice. It consists of two main components:
- A Flask-based API for fetching HTML content
- A Next.js application (using App Router) that handles the Bedrock integration for converting HTML to Markdown
Step 1 — Flask API for HTML Fetching
We are using Selenium to scrape the HTML from the webpage because the Medium ‘GET’ call for a medium-story page returns a partial result that doesn’t contain complete story content. So we render the page in headless Chrome, wait for it to load, and then get the HTML page. This will allow us to get the complete story in the HTML file.
Below is the simple code using Selenium and Flask to create a simple API that takes in the URL and returns the HTML body.
from flask import Flask, jsonify, request
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
app = Flask(__name__)
def get_website_html(url):
# Set up headless Chrome options
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
# Initialize the WebDriver with ChromeDriverManager
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)
driver.get(url)
print("Pausing for 5 seconds to allow the page to load...")
time.sleep(5) # Pause for 5 seconds
# Get the HTML content of the page
html_content = driver.page_source
# Close the WebDriver
driver.quit()
return html_content
@app.route('/get_html', methods=['POST'])
def get_html():
print('Fetching the HTML content of a website...')
# Get the URL from the request body
data = request.get_json()
if not data or 'url' not in data:
return jsonify({'error': 'URL is required in the request body'}), 400
url = data['url']
html_content = get_website_html(url)
# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
# Return the HTML content as a JSON response
return jsonify({'html': str(soup)})
if __name__ == '__main__':
app.run(debug=True)
You can host this flask server using an AWS EC2 instance or any other cloud provider. Just expose this API with some authorization for security reasons.
After we have the HTML, the next step is to create beautiful markdown content using the HTML body content. We will use Bedrock agent for this, with Claude as the FM.
Step 2 — Setup AWS Bedrock "Agent"
I have created a basic agent using the Bedrock console with the below prompt. You can modify the prompt further for better results.
You are a helpful assistant who takes HTML as input and then parses it
and returns the blog as a Markdown blog, and the blog should just contain
the main content body without the title and subtitle.
Also, remove the first image of the article from the markdown body,
as we are putting that in the header.
* In the main content markdown, just keep the main body, remove the title
and subtitle published date, etc.
On top of the markdown, add these things:
* decide on the title and description of the content
* categories can be travel or engineering
* remove the title and description from the main markdown body
* The image will be the first URL of the markdown blog
Sample to put on the top of the markdown
---
title: The Time When I Got Scammed in Georgia
description: A Reminder to Dodge Scams… Or Collect Them Like Souvenirs?
image: /images/blog/blog-post-4.1.png
date: 2024/6/28
authors:
- nomadic_bug
categories:
- travel
---
This prompt will help us get the beautiful markdown file in our desired format.
Step 3 — Next.js App with Bedrock Integration
I used Next.js with an app router to create the basic UI for this project. Below is the primary API to run the agent we have created earlier on AWS Bedrock.
The complete code is available here.
// Outline of what we are doing.
// Initialize Bedrock Agent Client with AWS credentials
Initialize BedrockAgentClient with:
region = "us-east-1"
access_key = AWS_ACCESS_KEY_ID from environment
secret_key = AWS_SECRET_ACCESS_KEY from environment
// Set up Agent details
agent_id = "your-agent-id"
agent_alias_id = "your-agent-alias-id"
// Function to invoke Bedrock Agent
Function InvokeBedrockAgent(session_id, input_text):
Create new InvokeAgentCommand with:
agent_id = agent_id
agent_alias_id = agent_alias_id
session_id = session_id
input_text = input_text
Send command to BedrockAgentClient
Return the completion from the response
// Main API Handler
Function HandlePostRequest(request):
Extract message and session_id from request body
If message is missing OR session_id is missing:
Return error response:
status = 400
message = "Please provide both a message and a sessionId."
response = InvokeBedrockAgent(session_id, message)
Return success response:
status = 200
data = {
response: response,
sessionId: session_id
}
The complete code is available on my Github.
Conclusion
This HTML Fetcher and Markdown Converter is a prototype project that converts web content into easily readable and editable Markdown format. My goal was to make this work, and it does. Some improvements can be made, but this project gave me an idea of how to start.
Thanks for reading my story. If you want to read more stories like this, I invite you to follow me.
Till then, Sayonara! I wish you the best in your learning journey.
Top comments (0)