In this article, we will make a small script to find all texts of a website with python.
To implement this we will use an amazing python library called Beautiful Soup and requests.
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
So let's get started...
First of all, we will create a virtual environment :
mkdir TextExtractor && cd TextExtractor
pip3 venv .venv
Then we activate this environment
source .venv/bin/activate
We install Beautiful Soup library
pip install beautifulsoup4==4.11.1
We install requests library
pip install requests
Then we create a file called main.py
In this file, we first import the BeautifulSoup Library and requests libraries
import requests
from bs4 import BeautifulSoup
We will take an example of a website for example Medium.com
We first will create a function to get content of a website with requests library
def get_page_content(page_url):
response = requests.get(page_url)
if response.status_code == 200:
return response.content
return None
We can get the content like this
content = get_page_content('https://medium.com/')
To parse the text we need to create the soup object like this
soup = BeautifulSoup(html)
We find all the elements that has text in this way:
tags_with_text = soup.find_all(text=True)
Then we can get the text list:
texts = [tag.text for tag in tags_with_text]
This will return a list of texts like this:
['', 'Medium – Where good ... find you.', '{"@context":"http:\\u...ght":168}}',...]
Here we see that we have a lot of script texts and texts that we don't want.
We need to ignore some tags,
TAGS_TO_IGNORE = ['script','style', 'meta']
and we get all texts with this one liner
texts = [tag.text.strip() for tag in tags_with_text if (tag.text and tag.name not in TAGS_TO_IGNORE)]
We can create a function to get all the texts from a page, it will be like this:
def get_texts_from_page(page_url):
content = get_page_content(page_url)
soup = BeautifulSoup(content, "html.parser")
tags_with_text = soup.findAll(text=True)
TAGS_TO_IGNORE = ['script','style', 'meta']
texts = [tag.text.strip() for tag in tags_with_text if (tag.text and tag.name not in TAGS_TO_IGNORE)]
return list(set(texts))
And the complete file main.py
import requests
from bs4 import BeautifulSoup
def get_page_content(page_url):
response = requests.get(page_url)
if response.status_code == 200:
return response.content
return None
def get_texts_from_page(page_url):
content = get_page_content(page_url)
soup = BeautifulSoup(content, "html.parser")
tags_with_text = soup.findAll(text=True)
TAGS_TO_IGNORE = ['script','style', 'meta']
texts = [tag.text.strip() for tag in tags_with_text if (tag.text and tag.name not in TAGS_TO_IGNORE)]
return list(set(texts))
# USAGE
content = get_texts_from_page('https://medium.com/')
ENJOY!
Reference:
pybuddy.com
Top comments (0)