Hello Coder,
This article presents a few practical code snippets to extract and process HTML information using an HTML Parser written in Python / BS4 library. Following topics will be covered:
- ✅ Load the Html
- ✅ Scan the file for assets: images, Javascript files, CSS files
- ✅ Change the path of an existing asset
- ✅ Update existing elements: change the src attribute of an image
- ✅ Locate an element based on the id
- ✅ Remove an element from the DOM tree
- ✅ Process an existing component: remove hardcoded text
- ✅ Save the processed HTML to a file
What is an HTML Parser
According to Wikipedia, Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The meaning of HTML parsing applied here means to load the HTML, extract and process the relevant information like head title, page assets, main sections and later on, save the processed file.
Parser Environment
The code uses BeautifulSoup library, the well-known parsing library written in Python. To start coding, we need a few modules installed on our system.
$ pip install ipython # the console where we execute the code
$ pip install requests # a library to pull the entire HTML page
$ pip install BeautifulSoup # the real magic is here
Load the HTML content
The file will be loaded as any other file, and the content should be injected into a BeautifulSoup object
from bs4 import BeautifulSoup as bs
# Load the HTML content
html_file = open('index.html', 'r')
html_content = html_file.read()
html_file.close() # clean up
# Initialize the BS object
soup = bs(html_content,'html.parser')
# At this point, we can interact with the HTML
# elements stored in memory using all helpers offered by BS library
Parse the HTML for assets
At this point, we have the DOM tree loaded in the BeautifulSoup object. Let's scan the DOM tree for Javascript files, the script nodes:
...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...
The code snippet that locates the Javascript has only a few lines of code. The BS library will return an array of objects and we can mutate each script node with ease:
for script in soup.body.find_all('script', recursive=False):
# Print the src attribute
print(' JS source = ' + script['src'])
# Print the type attribute
print(' JS type = ' + script['type'])
In a similar way, we can select and process the CSS nodes:
...
<link rel="stylesheet" href="css/bootstrap.min.css">
<link rel="stylesheet" href="css/app.css">
...
And the code ..
for link in soup.find_all('link'):
# Print the src attribute
print(' CSS file = ' + script['href'])
Parse the HTML for images
In this code snippet, we will mutate the node and change the src
attribute of the image node
...
<img src="images/pic01.jpg" alt="Bred Pitt">
...
for img in soup.body.find_all('img'):
# Print the path
print(' IMG src = ' + img[src])
img_path = img['src']
img_file = img_path.split('/')[-1] # extract the last segment, aka image file
img[src] = '/assets/img/' + img_file
# the new path is set
Locate an element based on the ID
This can be achieved by a single line of code. Let's imagine that we have an element (div or span) with the id 1234
:
...
<div id="1234" class="handsome">
Some text
</div>
And the code:
mydiv = soup.find("div", {"id": "1234"})
print(mydiv)
# delete the element
mydiv.decompose()
Remove the hard-coded text
This code snippet is useful for components extraction and translation to different template engines. Let's imagine that we have this simple component:
<div id="1234" class="cool">
<span>Html Parsing</span>
<span>the practical guide</span>
</div>
If we want to use this component in Php, the component becomes:
<div id="1234" class="cool">
<span><?php echo $title ?></span>
<span><?php echo $info ?></span>
</div>
Or for the Jinja2 (Python template engine)
<div id="1234" class="cool">
<span>{{ title }}</span>
<span>{{ info }}</span>
</div>
To void the manual work, we can use a code snippet to replace automatically the hardcoded texts and prepare the component for a specific template engine:
# locate the div
mydiv = soup.find("div", {"id": "1234"})
print(mydiv) # print before processing
# iterate on div elements
for tag in mydiv.descendants:
# NavigableString is the text inside the tag,
# not the tag himself
if not isinstance(tag, NavigableString):
print( 'Found tag = ' + tag.name ' -> ' + tag.text )
# this will print:
# Found tag = span -> Html Parsing
# Found tag = span -> the practical guide
# replace the text for Php
tag.text = '<?php echo $title ?>'
# replace the text for Jinja
tag.text = '{{ title }}'
To use the component, we can save the component to a file:
# mydiv is the processed component
php_component is the string representation
php_component = mydiv.prettify(formatter="html")
file = open( 'component.php', 'w+')
file.write( php_component )
file.close()
At this point, the original div is extracted from the DOM, with hard-coded texts removed, and ready to be used in a Php or Python project.
Save the new HTML
Now we have the mutated DOM in a BeautifulSoup object, in memory. To save the content to a new file, we need to call the prettify()
and save the content to a new HTML file.
new_dom_content = soup.prettify(formatter="html")
file = open( 'index_parsed.html', 'w+')
file.write( new_dom_content )
file.close()
HTML Parser - Use Cases
I'm using HTML parsing quite a lot, especially for tasks where manually work is involved:
- process HTML themes to be used in a new project
- extract hard-coded texts and extract components
- translate flat HTML themes to Jinja, Mustache or PUG templates
From time to time, I'm publishing free samples in this public repository.
Resources
- HTML Parser - supported by AppSeed
- HTML Parser - How to use Python BS4 to work less
- Developer Tools - Open-Source HTML Parser - related article
- BeautifulSoup Html Parser documentation
- HTML Parser - Convert HTML to Jinja2 and Php components - related blog article
Thank you! Btw, my (nick) name is Sm0ke and I'm pretty active also on Twitter.
Top comments (4)
hello, can anyone have a solution for how to parse multiple content of html pages to another html pages with the same link address?
please see this link:
stackoverflow.com/questions/661012...
Wow, BeautifulSoup makes that super easy! Do you ever find edge cases where it doesn't work well at all? Or does it manage to handle most sites that you've tried? Thanks!
Hello @chris ,
Based on my experience, BS was failing when I didn't respect the syntax or something similar. I remember a dummy case when I initialized the BS object using lxlml parser and the saved HTML had always a closing tag:
Sample:
<meta ...></meta>
It was my fault all the way :). Now I'm using html-parser to construct the BS objects.
Thank you for your interest.
Ah, makes sense. Thanks!