Project: Regex Search

#python #automation #programming

Welcome to the third project in this series. In this project, you will automate the searching of user-supplied text patterns within text files.
To follow along with this project, you can access the project files and READMEs here.

Prerequisites

Before embarking on this project, make sure you have completed Part 2 of the series.

Project Structure:

This project follows a logical structure aimed at helping you grasp the concepts and build the skills progressively. Here is how we will approach it:

Start Simple: We will begin with a straightforward approach to achieve basic functionality. Our primary goal is to make the code work as intended, even if it is not perfect initially.
Refine and Enhance: Once your code is functional, we will start improving it. This phase will involve optimizing the code for better performance and usability.
Challenge Yourself: As usual, there will be an exercise at the end of the project.

Open the project file in a separate tab and follow along with these steps.

Step 1: Understand the Project.

# provide the absolute path to the directory
# search the directory of any .txt files
# Open each file
# Search for lines that match the user-supplied regex
# print the matching result

Step 2: Import the Required Modules.

import os, re

Step 3: Create a Function.
Create a function with two parameters: dir_path and user_supplied_regex. This function will be used to perform the desired regex search.

def regex_search(dir_path, user_supplied_regex):
    # Your code goes here

Step 4: Get the Absolute Path of the Directory.
Use os.path.abspath(dir_path) to get the absolute path of the directory. Ensure you specify dir_path as a raw string to prevent the character from escaping.

if __name__ == '__main__':
    dir_path = r'C:\Users\Praise Idowu\Documents\blog-projects\Automating-the-boring-stuff-with-Python-blog-project\chapter-8\regex_search\folder1'
    user_supplied_regex = None
    regex_search(dir_path, user_supplied_regex)

Step 5: Loop Through the Files.
Use a for loop to iterate through the filenames in the specified directory.

Step 6: Check if it is a File.
Use the os.path.isfile(filepath) to check if the current item is a file.

Step 7: Check the File Extension.
Use os.path.splitext(filepath) to split the file extension from the filename. It returns a tuple and you can access the extension using indexing.

Step 8: Check if it is a .txt File and Print the Filename.
Use a conditional statement to check if the file has a .txt extension. If it does, print the filename. Once your code is working correctly, comment out the print statement and move on to the next step.

Step 9: Open the File and Read its Content.
Open the file in read mode and store its content in a variable.

Step 10: Compile the User-Supplied Regex and Find Matches.

Compile the user-supplied regex pattern using re.compile(user_supplied_regex, re.VERBOSE).
Use re.findall() to find all occurrences of the regex pattern in the file's content.
If matches are found, print them as the results.

import os, re

def regex_search(dir_path, user_supplied_regex):
    # provide abs path
    dir = os.path.abspath(dir_path)
    # print(dir)
    for filename in os.listdir(dir):
        # print(filename)
        if os.path.isfile(os.path.join(dir, filename)):
            # print(filename)
            split_file = os.path.plaintext(os.path.join(dir, filename))
            # print(split_file)
            # searches the directory for any .txt files
            if split_file[1] == '.txt':
                print(filename)

                # It opens the folder
                with open(os.path.join(dir, filename), 'r', encoding='utf-8') as f:
                    content = f.read()
                # print(content)

                # It searches for a line that matches the user-supplied regex
                user_regex = re.compile(user_supplied_regex, re.VERBOSE)
                for groups in user_regex.findall(content):
                    # it prints the result
                    print(groups)



if __name__ == '__main__':
    dir_path = r'C:\Users\Praise Idowu\Documents\blog-projects\Automating-the-boring-stuff-with-Python-blog-project\chapter-8\regex_search\folder1'
    # user_supplied_regex = r'web | scraping'
    user_supplied_regex = r'''(
        (https?://)[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?
        )'''
    regex_search(dir_path, user_supplied_regex)

The previous code works well, but it can be improved. Here are the updates that have been made:

Removed repetition of os.path.join and created a file_path variable to replace it.
Used filename.endswith to check for the '.txt' extension. It returns a boolean.

import os, re

def regex_search(dir_path, user_supplied_regex):
    # Provide an absolute path
    dir = os.path.abspath(dir_path)

    # Create a regex pattern from the user-supplied regex
    user_regex = re.compile(user_supplied_regex, re.VERBOSE)

    for filename in os.listdir(dir):
        file_path = os.path.join(dir, filename)

        # Check if the file is a .txt file
        if os.path.isfile(file_path) and filename.endswith('.txt'):
            print(filename)

            # Open and read the file
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()

            # Search for matches in the content
            matches = user_regex.findall(content)

            # Print the matched line
            for match in matches:
                print(match)

if __name__ == '__main__':
    dir_path = r'C:\Users\Praise Idowu\Documents\blog-projects\Automating-the-boring-stuff-with-Python-blog-project\chapter-8\regex_search\folder1'
    # user_supplied_regex = r'web | scraping'
    user_supplied_regex = r'''(
        (https?://)[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?
        )'''
    regex_search(dir_path, user_supplied_regex)

# TODO: use input() instead

Exercise 1

Instead of hard-coding dir_path and user_supplied_regex, use the input() function to prompt the user to enter these values interactively.

Exercise 2

Depending on the regex the user supplies, it can print a string, lists, or tuples.

To make the code more flexible and handle different output scenarios, Tope has already implemented some code but needs assistance in completing it. You can improve the functionality by allowing the user to choose the output format (default, returns a string, list of tuples, or tuples) and providing an option for indexing or slicing if the user chooses to do that.

Additionally, Tope has specified match[0] which isn't what we want, you can enable the user to specify the index they want to access in the results (e.g., 0, 1, etc.). After completing the code, don't forget to share your repository link in the comments section for feedback.

# Print the matched line
for match in matches:
  # Extract and print all groups in the match
  # print(type(match))
  if isinstance(match, tuple):
     print(match[0])
  elif isinstance(match, list):
     pass
  else:
     print(match)

Conclusion

In this project, you have gained valuable skills in text processing and pattern matching. We have come to the end of this series, moving forward you will write code to organize files.

If you have any questions, want to connect, or just fancy a chat, feel free to reach out to me on LinkedIn and Twitter.