Beautiful Soup is a Python library for working with HTML and XML files. You can use it to navigate a HTML document, search it, extract data from it and even change the document structure. Let’s see how it works:
from bs4 import BeautifulSoup html = ''' <html> <head> <title>Beautiful Soup Hello World</title> </head> <body> <h1>Header</h1> <p>Paragraph 1</p> <p>Paragraph 2</p> <p>Paragraph 3</p> </body> </html> ''' soup = BeautifulSoup(html, 'html.parser') print(soup.title) print(soup.title.name) print(soup.title.text) print(soup.p.text) for paragraph in soup.find\_all('p'): print(paragraph.text) print(soup.get\_text())
It’s a really basic example, but before you can run it you first need to install Beautiful Soup:
pip install beautifulsoup4
While you’re at it, install another library as well:
pip install lxml
It’s a HTML/XML parser. Don’t worry about it.
Let’s start — import Beautiful Soup:
from bs4 import BeautifulSoup
Next, we need some HTML to work with:
html = ''' <html> <head> <title>Beautifulsoup Hello World</title> </head> <body> <h1>Header</h1> <p>Paragraph 1</p> <p>Paragraph 2</p> <p>Paragraph 3</p> </body> </html> '''
It is a basic HTML document stored in a Python string. Of course, working with HTML stored in a Python script is not very exciting, but this is a Hello, World, so hey.
Create an instance of the BeautifulSoup object, specifying the HTML document and the parser to be used (I said don’t worry about it):
soup = BeautifulSoup(html, 'html.parser')
Now we have our HTML parsed and stored in a variable named soup and we can play with it:
Use soup.title to access the HTML document’selement. This prints:
<title>Beautifulsoup Hello World</title>
Sometimes you don’t want the HTML tag:
and get just the element text:
Beautifulsoup Hello World
Our document has just oneelement so Beautiful Soup appropriately returns it if we use soup.title. But the documents has three
elements (paragraphs) so what happens when we try to pull the same trick?
It returns the first
element in the document:
If you want to get all paragraphs in a documents, well, just use find_all():
for paragraph in soup.find\_all('p'): print(paragraph.text)
find_all() returns all paragraphs in the document and you can iterate them using a simple for loop.
This is just scratching the surface with Beautiful Soup. At the end let’s see how simple it is to get all text (and only text) in the document:
As expected, this prints
Beautifulsoup Hello World Header Paragraph 1 Paragraph 2 Paragraph 3
You can find the full script in my Github. ttfn.