Welcome to the third project in this series. In this project, you will automate the searching of user-supplied text patterns within text files.
To follow along with this project, you can access the project files and READMEs here.
Prerequisites
Before embarking on this project, make sure you have completed Part 2 of the series.
Project Structure:
This project follows a logical structure aimed at helping you grasp the concepts and build the skills progressively. Here is how we will approach it:
- Start Simple: We will begin with a straightforward approach to achieve basic functionality. Our primary goal is to make the code work as intended, even if it is not perfect initially.
- Refine and Enhance: Once your code is functional, we will start improving it. This phase will involve optimizing the code for better performance and usability.
- Challenge Yourself: As usual, there will be an exercise at the end of the project.
Open the project file in a separate tab and follow along with these steps.
Step 1: Understand the Project.
# provide the absolute path to the directory
# search the directory of any .txt files
# Open each file
# Search for lines that match the user-supplied regex
# print the matching result
Step 2: Import the Required Modules.
import os, re
Step 3: Create a Function.
Create a function with two parameters: dir_path
and user_supplied_regex
. This function will be used to perform the desired regex search.
def regex_search(dir_path, user_supplied_regex):
# Your code goes here
Step 4: Get the Absolute Path of the Directory.
Use os.path.abspath(dir_path)
to get the absolute path of the directory. Ensure you specify dir_path
as a raw string to prevent the character from escaping.
if __name__ == '__main__':
dir_path = r'C:\Users\Praise Idowu\Documents\blog-projects\Automating-the-boring-stuff-with-Python-blog-project\chapter-8\regex_search\folder1'
user_supplied_regex = None
regex_search(dir_path, user_supplied_regex)
Step 5: Loop Through the Files.
Use a for
loop to iterate through the filenames in the specified directory.
Step 6: Check if it is a File.
Use the os.path.isfile(filepath)
to check if the current item is a file.
Step 7: Check the File Extension.
Use os.path.splitext(filepath)
to split the file extension from the filename. It returns a tuple and you can access the extension using indexing.
Step 8: Check if it is a .txt
File and Print the Filename.
Use a conditional statement to check if the file has a .txt
extension. If it does, print the filename. Once your code is working correctly, comment out the print statement and move on to the next step.
Step 9: Open the File and Read its Content.
Open the file in read mode and store its content in a variable.
Step 10: Compile the User-Supplied Regex and Find Matches.
- Compile the user-supplied regex pattern using re.compile(user_supplied_regex, re.VERBOSE).
- Use re.findall() to find all occurrences of the regex pattern in the file's content.
- If matches are found, print them as the results.
import os, re
def regex_search(dir_path, user_supplied_regex):
# provide abs path
dir = os.path.abspath(dir_path)
# print(dir)
for filename in os.listdir(dir):
# print(filename)
if os.path.isfile(os.path.join(dir, filename)):
# print(filename)
split_file = os.path.plaintext(os.path.join(dir, filename))
# print(split_file)
# searches the directory for any .txt files
if split_file[1] == '.txt':
print(filename)
# It opens the folder
with open(os.path.join(dir, filename), 'r', encoding='utf-8') as f:
content = f.read()
# print(content)
# It searches for a line that matches the user-supplied regex
user_regex = re.compile(user_supplied_regex, re.VERBOSE)
for groups in user_regex.findall(content):
# it prints the result
print(groups)
if __name__ == '__main__':
dir_path = r'C:\Users\Praise Idowu\Documents\blog-projects\Automating-the-boring-stuff-with-Python-blog-project\chapter-8\regex_search\folder1'
# user_supplied_regex = r'web | scraping'
user_supplied_regex = r'''(
(https?://)[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?
)'''
regex_search(dir_path, user_supplied_regex)
The previous code works well, but it can be improved. Here are the updates that have been made:
- Removed repetition of
os.path.join
and created afile_path
variable to replace it. - Used
filename.endswith
to check for the '.txt' extension. It returns a boolean.
import os, re
def regex_search(dir_path, user_supplied_regex):
# Provide an absolute path
dir = os.path.abspath(dir_path)
# Create a regex pattern from the user-supplied regex
user_regex = re.compile(user_supplied_regex, re.VERBOSE)
for filename in os.listdir(dir):
file_path = os.path.join(dir, filename)
# Check if the file is a .txt file
if os.path.isfile(file_path) and filename.endswith('.txt'):
print(filename)
# Open and read the file
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
# Search for matches in the content
matches = user_regex.findall(content)
# Print the matched line
for match in matches:
print(match)
if __name__ == '__main__':
dir_path = r'C:\Users\Praise Idowu\Documents\blog-projects\Automating-the-boring-stuff-with-Python-blog-project\chapter-8\regex_search\folder1'
# user_supplied_regex = r'web | scraping'
user_supplied_regex = r'''(
(https?://)[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?
)'''
regex_search(dir_path, user_supplied_regex)
# TODO: use input() instead
Exercise 1
Instead of hard-coding dir_path
and user_supplied_regex
, use the input()
function to prompt the user to enter these values interactively.
Exercise 2
Depending on the regex the user supplies, it can print a string, lists, or tuples.
To make the code more flexible and handle different output scenarios, Tope has already implemented some code but needs assistance in completing it. You can improve the functionality by allowing the user to choose the output format (default, returns a string, list of tuples, or tuples) and providing an option for indexing or slicing if the user chooses to do that.
Additionally, Tope has specified match[0]
which isn't what we want, you can enable the user to specify the index they want to access in the results (e.g., 0, 1, etc.). After completing the code, don't forget to share your repository link in the comments section for feedback.
# Print the matched line
for match in matches:
# Extract and print all groups in the match
# print(type(match))
if isinstance(match, tuple):
print(match[0])
elif isinstance(match, list):
pass
else:
print(match)
Conclusion
In this project, you have gained valuable skills in text processing and pattern matching. We have come to the end of this series, moving forward you will write code to organize files.
If you have any questions, want to connect, or just fancy a chat, feel free to reach out to me on LinkedIn and Twitter.
Top comments (0)