DEV Community

Cover image for πŸ“‚ Merge common words of two files in Python, practice file handling
Daniel Diaz for Developer Road

Posted on • Edited on • Originally published at developerroad.herokuapp.com

πŸ“‚ Merge common words of two files in Python, practice file handling

Hi there 😁

In this tutorial you will learn to select words from files, compare them and merge the common words in a new file.

You will practice πŸ”

  • File handling in Python
  • Sets in Python
  • Using function to re use code
  • Some basics command line instruction

Let's start 🐍

Code: Remember that all the code shown during these tutorials will be pushed to github.

GitHub logo Developer-road / merge-common-words

Program to merge the common words of two files into another ones

merge-common-words

Program to merge the common words of two files into another ones




First of all create a Python file, I'll use a long and descriptive file name merge.py πŸ˜†

Let's create that file with the terminal:

Note: if you are using windows, none of these commands will run, so just do all the things with a graphical interface, like a file manager of your code editor. If you are using Mac 🍎, or Linux 🐧 stay relaxed, and learn the terminal as best as you can.

Remember that Python is mainly a CLI programming language, so you will be using a lot the terminal, so my advice is to use the terminal whenever you can πŸ˜€

touch merge.py
Enter fullscreen mode Exit fullscreen mode

The touch command will create an empty file merge.py, in the directory you are in.

Now I'm going to call a graphical editor, from the terminal 😱, in this case I'll use vscode.

code ./
Enter fullscreen mode Exit fullscreen mode

Remember that in UNIX (don't be scared, it refers to Unix based OS, like Linux and Mac), ./ the dot refers to the current directory.

Now that I have Vscode set up and going let's start with the code.

Open a file in Python

First of all , we will use some text files in this tutorial, you can download them easily on github, by cloning the repository or just following the links below.

text_one.txt
text_two.txt

In Python you open a file with the function, open which takes as parameter the path of the file.

Make sure you've downloaded the text files, and type in the merge.py file the following code.

file1 = open("text_one.txt)
Enter fullscreen mode Exit fullscreen mode

This will create a variable file1, which is a file object, with the mode of "read" (by default).

If you try to print that variable you will get something like this:


print(file1)

# Result
# <_io.TextIOWrapper name='text_one.txt' mode='r' encoding='UTF-8'>
Enter fullscreen mode Exit fullscreen mode

That's because that variable is just a file object of the text file named, "text_one.txt"

But be carefully if the file doesn't exist, you will get an error 😱.

file1 = open("text_oness.txt")

# FileNotFoundError: [Errno 2] No such file or directory: 'text_oness.txt'
Enter fullscreen mode Exit fullscreen mode

So each time we're working with files, we should use some defensive programming, to avoid the error in case the file doesn't exist

try:
    file1 = open("text_oness.txt")
except FileNotFoundError:
    print("Sorry that file doesn't exist")
    exit()
Enter fullscreen mode Exit fullscreen mode

We're using the exit() function because if the file doesn't exist we won't have the variable to work with, so in that case we will terminate the execution.

If you want to get the content of a file in Python, you would use the {file_object}.read() function, that as it name says read for us the content of that file.

print(file1.read())

# Results
# Why should you learn to write programs?

# Writing programs (or programming) is a very creative 
# and rewarding activity. MORE TEXT .....
Enter fullscreen mode Exit fullscreen mode

Now that you know how to open files, and read them you should be able to create that comparison functionality.

First make sure we are able to open the two files.

try:
    file1 = open("text_one.txt")
    file2 = open("text_two.txt")
except FileNotFoundError:
    print("Sorry that file doesn't exist")
    exit()
Enter fullscreen mode Exit fullscreen mode

Now we will use the content of the files and the power of sets to get all the words of each file.

Sets in Python

A set in Python, is an unordered data structure, with a really special characteristic (Among others), it doesn't allow repeated elements 🀫.

We will use them, to be able to get the words, without repeating.

Creating a set in Python

To create a set in Python we will use the set() function.

empty_set = set()
Enter fullscreen mode Exit fullscreen mode

A set VS a list:

mylist = list("122333")
myset = set("122333")

print(mylist)
print(myset)

# List:
# ['1', '2', '2', '3', '3', '3']

# Set
# {'3', '2', '1'}
Enter fullscreen mode Exit fullscreen mode

As you see the list printed all the elements in order, but the set only printed the elements that were not repeated, and unordered.

Getting the words of a file

We are going to iterate through the words of the file with a for loop, and add them in a set for each file.

file1_words = set()

file2_words = set()


for word in file1.read().split():
    file1_words.add(word.lower())

for word in file2.read().split():
    file2_words.add(word.lower())

print(file1_words)
print(file2_words)
Enter fullscreen mode Exit fullscreen mode

Maybe the above code is a little bit confusing, but let's take a quick look on it.

First we initialize, two sets, one for the first file, the another for the second file.

Then we iterate with a for loop, each word of the file by calling file1.read().split().

In that part we use the method read(), that gives us the content of the file as a string, and then we use the string method split(), which gives us a list of the words in the file by splitting the string within each space.

So basically we are iterating over:

for word in list_of_words_of_the_file:
    code...
Enter fullscreen mode Exit fullscreen mode

Then we get the word, we make it lowercase to avoid word repeating, and add it to the file1_words set. Remember that in a set there can't be repeated elements, if a word is already in the set, then it won't be added.

We do the same thing for both files.

Running that piece of code, returns two sets with all the words of both files.

Challengue

But there is a problem and I challenge you πŸ”₯ to solve it.
There are some words that have special punctuation, for example (on and on, and your task is to figure out how to replace all the punctuation characters, so there won't be repeated words, but with special punctuation.

Reach me out on Twitter, or Instagram, if you achieve it.

Using sets to intersect common words

We are looking for common words in the sets we just created, and for that we will use Intersection.

Yeah that word may look scary since probably you've seen it in math, but don't worry. Intersection are just the common parts of two sets.

Intersection

Intersecting elements could be tedious πŸ™„, since with most data structures you would have to iterate through the variables that contain the elements, compare them select those that are repeated and append them in a new variable.

But with sets in Python, we have a special function that does all of that for us, {set1}.intersection({set2})

common_words = file1_words.intersection(file2_words)

print(common_words)
Enter fullscreen mode Exit fullscreen mode

There it is, now we can access the common words of the two files, in an one liner.

Writing files in Python

Writing files in Python, is not that hard. We open a file in write mode by using open("file/path", mode = "w").

This allow us to write to an existing file or in the case the file didn't exist, it creates a new one with the determinate file path.

So let's open our merge file.

merge_file = open("merge.txt", mode = "w")

# code_here ....

merge_file.close()
Enter fullscreen mode Exit fullscreen mode

Remember that every time we open a file in write mode, we must close it, after making the desired operations.

Now we're going to write all the common words to the new merge.txt file.

merge_file = open("merge.txt", mode = "w")

for word in common_words:
   word = word + ", "
   merge_file.write(word)

merge_file.close()
Enter fullscreen mode Exit fullscreen mode

We need to append a comma string at the end of each word, so we can differentiate the words in the file.

If you run this code, you will get the desired result.

try:
    file1 = open("text_one.txt")
    file2 = open("text_two.txt")
except FileNotFoundError:
    print("Sorry that file doesn't exist")
    exit()


file1_words = set()

file2_words = set()


for word in file1.read().split():

    file1_words.add(word.lower())

for word in file2.read().split():

    file2_words.add(word.lower())


common_words = file1_words.intersection(file2_words)

merge_file = open("merge.txt", mode="w")

for word in common_words:
    word = word + ", "
    merge_file.write(word)

merge_file.close()

Enter fullscreen mode Exit fullscreen mode

If you check the new "merge.txt" file, you will see all the common words between the file text_one.txt and text_two.txt

Congratulations πŸŽ‰, you just created a merge algorithm, and in base of a merge algorithm (Much more complex for sure), is how git works, to compare code files.

But wait a minute, this code is clunky and there are many parts where we repeated the same process.

So let's use the power of functions, to make our code reusable and more escalable code.

First let's make a function to open a file, and handling exceptions. That function will take 2 arguments, the first will be the file path, and the second an optional argument with the open mode of the file.

# Python function that handle exception while opening files

def open_file(file_path, open_mode="r"):

    try:
        file_handler = open(file_path, mode=open_mode)

    except FileNotFoundError:
        print(f"Sorry the file {file_path} doesn't exist")
        exit()

    except ValueError:
        print(f"Sorry the file {file_path} can't be opened with mode {open_mode}")
        exit()

    return file_handler            
Enter fullscreen mode Exit fullscreen mode

Then a function that let us catch the words of a file.

def get_file_words(file_path):

    file_words = set()

    read_file = open_file(file_path)

    for word in read_file.read().split():

        file_words.add(word.lower())
    return file_words

Enter fullscreen mode Exit fullscreen mode

Here as you may notice, we take as parameter the path of the file we're going to get the words from, and get the file handler through the open_file() function, we just created.

Lastly let's create a function merge, which will make the operations of getting and intersecting the common words, and writing those words in a file.

def merge(*filenames, merge_file="merge.txt"):

    list_of_file_words = []

    for filename in filenames:

        file_words = get_file_words(filename)
        list_of_file_words.append(file_words)

    common_words = set.intersection(*list_of_file_words)

    merge_write_file = open_file(merge_file, "w")

    for word in common_words:

        word = word + ", "

        merge_write_file.write(word)

    merge_write_file.close()

Enter fullscreen mode Exit fullscreen mode

Here we use the power of *args in python functions, that allow us to pass multiple filenames to the function if we want to merge more than two.

You can notice that I used a for loop to iterate over the *filename argument. That's because we can receive any number of filenames now, and thus our merge algorithm is more powerful now.

Compacting our code in a main() function

As a best practice in python, you can use a main() function, that will call and perform any operation your script does.

def main():

    file1 = "text_one.txt"

    file2 = "text_two.txt"

    merge(file1, file2, merge_file = "main_merge.txt")
Enter fullscreen mode Exit fullscreen mode

This main function calls the merge function and pass as parameters the file1 and file2 variables. Also we specified the merge_file parameter which tells the merge function which file it has to write in.

But as you may noticed we haven't call any function yet, so let's call the main function.

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

The __name__ variable deserves another blog post, but basically here you're telling to python:

  • Hey dude πŸ₯Έ, if this is the python file you're executing from terminal, run the main() function.

So the final code of this cool algorithm is this:

def open_file(file_path, open_mode="r"):

    try:
        file_handler = open(file_path, mode=open_mode)

    except FileNotFoundError:
        print(f"Sorry the file {file_path} doesn't exist")
        exit()

    except ValueError:
        print(f"Sorry the file {file_path} can't be opened with mode {open_mode}")
        exit()

    return file_handler            


def get_file_words(file_path):

    file_words = set()

    read_file = open_file(file_path)

    for word in read_file.read().split():

        file_words.add(word.lower())

    return file_words

def merge(*filenames, merge_file="merge.txt"):

    list_of_file_words = []

    for filename in filenames:

        file_words = get_file_words(filename)
        list_of_file_words.append(file_words)

    common_words = set.intersection(*list_of_file_words)

    merge_write_file = open_file(merge_file, "w")

    for word in common_words:

        word = word + ", "

        merge_write_file.write(word)

    merge_write_file.close()



def main():

    file1 = "text_one.txt"

    file2 = "text_two.txt"

    merge(file1, file2, merge_file="merge_main.txt")


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Conclusions:

In this tutorial you practiced:

  • File handling in Python
  • Sets and lists in Python
  • Iteration with for loops
  • Function definition and usage
  • Python main function best practices

If you found any error in this tutorial don't hesitate in contact me, or make a pull request in the Github repo

Follow me in My blog,
to get more awesome tutorials like this one.
Please consider supporting me on Ko-fi you help me a lot to
continue building this tutorials!.
ko-fi

Top comments (0)