Vicente Maldonado

Posted on Aug 2, 2019 • Originally published at Medium on Aug 2, 2019

Beautiful Soup Hello World

#webscraping #html #beautifulsoup #parsing

Beautiful Soup is a Python library for working with HTML and XML files. You can use it to navigate a HTML document, search it, extract data from it and even change the document structure. Let’s see how it works:

from bs4 import BeautifulSoup

html = '''
    <html>
    <head>
        <title>Beautiful Soup Hello World</title>
    </head>
    <body>
        <h1>Header</h1>
        <p>Paragraph 1</p>
        <p>Paragraph 2</p>
        <p>Paragraph 3</p>
    </body>
    </html>
'''

soup = BeautifulSoup(html, 'html.parser')

print(soup.title)
print(soup.title.name)
print(soup.title.text)

print(soup.p.text)

for paragraph in soup.find\_all('p'):
    print(paragraph.text)

print(soup.get\_text())

It’s a really basic example, but before you can run it you first need to install Beautiful Soup:

pip install beautifulsoup4

While you’re at it, install another library as well:

pip install lxml

It’s a HTML/XML parser. Don’t worry about it.

Let’s start — import Beautiful Soup:

from bs4 import BeautifulSoup

Next, we need some HTML to work with:

html = '''
    <html>
    <head>
        <title>Beautifulsoup Hello World</title>
    </head>
    <body>
        <h1>Header</h1>
        <p>Paragraph 1</p>
        <p>Paragraph 2</p>
        <p>Paragraph 3</p>
    </body>
    </html>
'''

It is a basic HTML document stored in a Python string. Of course, working with HTML stored in a Python script is not very exciting, but this is a Hello, World, so hey.

Create an instance of the BeautifulSoup object, specifying the HTML document and the parser to be used (I said don’t worry about it):

soup = BeautifulSoup(html, 'html.parser')

Now we have our HTML parsed and stored in a variable named soup and we can play with it:

print(soup.title)

Use soup.title to access the HTML document’s

element. This prints:

<title>Beautifulsoup Hello World</title>

Sometimes you don’t want the HTML tag:

print(soup.title.text)

and get just the element text:

Beautifulsoup Hello World

Our document has just one

element so Beautiful Soup appropriately returns it if we use soup.title. But the documents has three

elements (paragraphs) so what happens when we try to pull the same trick?

print(soup.p.text)

It returns the first

element in the document:

Paragraph 1

If you want to get all paragraphs in a documents, well, just use find_all():

for paragraph in soup.find\_all('p'):
    print(paragraph.text)

find_all() returns all paragraphs in the document and you can iterate them using a simple for loop.

This is just scratching the surface with Beautiful Soup. At the end let’s see how simple it is to get all text (and only text) in the document:

print(soup.get\_text())

As expected, this prints

Beautifulsoup Hello World

Header
Paragraph 1
Paragraph 2
Paragraph 3

You can find the full script in my Github. ttfn.

DEV Community

Beautiful Soup Hello World

Top comments (0)

Read next

Building a Collapsible UI Component in Angular: From Concept to Implementation 🚀

Web Scraping Simplified: Extracting Article Titles with BeautifulSoup

The Ultimate Sticker List for Developers Showcasing Creativity and Identity

Create an Interactive Eraser Tool with HTML5 Canvas 🚀