DEV Community

loading...
Cover image for Flexible CSV Handling in Python with DictReader and DictWriter

Flexible CSV Handling in Python with DictReader and DictWriter

bowmanjd profile image Jonathan Bowman ・9 min read

The code in this article is also available in a public Github repo here.

Python is great for data wrangling. Do you find yourself engaging in the process of extracting data from one system, transforming it, and then loading it into another (sometimes called ETL)?. You may have encountered CSV files a time or two, then.

CSV files are often used as a vehicle to carry large simple tables of data. Python has built-in CSV handling capability. In this article, we will explore a flexible way of reading and saving CSV files using Python's csv.DictReader and csv.DictWriter.

Flexible field handling

In this instance, I use flexible to mean resiliency in the handler when dealing with dynamic columns.

Imagine this scenario: the system you are extracting from changes (a product upgrade, for instance) adding fields to the CSV export file. And, conveniently, adds them as columns, not at the end, but in the middle, or at the leftmost column. Oh, the agony, if you expect certain data to be in the 3rd column, and that data is now in the 4th! Is there a way to construct a handler in Python that can handle this sort of moving target gracefully?

Yes.

A CSV example

Let's take a look at some tabular data (in other words, data in a table):

Last name First name Middle name Role School Grade
Herman Preston Elliott Teacher Southside
Randolph Finnegan Braylon Student Eastside 12
Hood Easton Fernando Student Southside 7
Kemp Logan Kinsley Teacher Southside

The above data in CSV format, with a header row, should look like this:

Last name,First name,Middle name,Role,School,Grade
Herman,Preston,Elliott,Teacher,Southside,
Randolph,Finnegan,Braylon,Student,Eastside,12
Hood,Easton,Fernando,Student,Southside,7
Kemp,Logan,Kinsley,Teacher,Southside,
Enter fullscreen mode Exit fullscreen mode

If you would like, you can download the above CSV sample here.

Place the above in a file called sample.csv in your current working directory. You may also clone the repo for this article and follow along in that way.

Loading the CSV file

To load a CSV file in Python, we first open the file. For file handling in Python, I always like to use pathlib for its convenience and flexibility.

from pathlib import Path

inpath = Path("sample.csv")
Enter fullscreen mode Exit fullscreen mode

The above simply assigns the inpath variable to point to our file.

A convenient and safe way to open a file for reading or writing is by using the with statement, because it creates a "context manager" in which the file can be handled, and will automatically close when finished. In other words, Python will happily clean up after you:

with inpath.open("r") as infile:
    contents = infile.read()
Enter fullscreen mode Exit fullscreen mode

And that will work well for loading the entire contents of the file into the contents variable, then closing the file. We indicate that we intend to read from the file with the "r" mode (as opposed to "w" for writing).

Admittedly, it would be even easier to write contents = inpath.read_text(). However, that would not have demonstrated with and open very well! And it isn't relevant for CSV handling, as you will see in the next sentence.

With CSV handling in Python, however, we do not need to load the contents of the file into memory. Instead, we pass the file handle (infile in our example above) to a csv.DictReader. This gives us advanced row-parsing ability, and the memory savings that comes from loading the file line by line.

import csv
from pathlib import Path


inpath = Path("sample.csv")

with inpath.open("r", newline="", encoding="utf-8-sig") as infile:
    reader = csv.DictReader(infile)
    for row in reader:
        fullname= f"{row['First name']} {row['Middle name']} {row['Last name']}"
        print(fullname)
Enter fullscreen mode Exit fullscreen mode

You may also download the above code from here.

If you save the above code in a file called simplereader.py, assuming it is in the same directory as sample.csv, the following should be possible:

$ python simplereader.py
Preston Elliott Herman
Finnegan Braylon Randolph
Easton Fernando Hood
Logan Kinsley Kemp
Enter fullscreen mode Exit fullscreen mode

The python simplereader.py command runs the script we just wrote with Python.

Depending on your system, you may need to use python3 instead of python. If python -V doesn't show something like Python 3.8.5 (the 3 is the important part!), and python3 -V does, then use the latter instead.

Two additions in the above code are worth noting. When opening the file, we specify newline="" to enable universal newline support. The rationale is complex, and we won't dive into the many ways different operating systems historically handle newlines. Suffice it to say that you should generally specify newline="" when opening files for CSV handling. For more details, you may read about newline parameter in the offical docs.

The second addition is the character encoding.

The importance of character encoding

We dare not neglect/ignore/assume character encoding, to put it mildly. It only works in the above case because our sample.csv happens to only have only ASCII characters. As soon as we add people with names such as "Miloš" and "María", and especially 张伟, then we need to think more carefully. Literally care-fully, because these choices do indeed convey care for real people, as well as the quality and functioning of real data and user interfaces.

Saying "I won't use a character encoding" is not only technically impossible, it is akin to complaining, "Why can't everyone just speak American?"

In the above example, we specified utf-8-sig which will open and translate both vanilla UTF-8 text encodings, as well as files in which the UTF-8 encoding is signaled with a Beginning Of Message (BOM) character, such as in Microsoft Excel's new "CSV UTF-8" format.

With files saved in the older, and more common, Microsoft Excel CSV format, the character encoding is likely "cp1252" which is a latin encoding.

In summary, always designate the character encoding. If you don't know the encoding of a CSV file you did not create, encoding options may be "cp1252", "utf-8", "utf-8-sig", possibly "iso-8859-1" or others. You may need to experiment.

When necessary, automatic detecting of a file's character encoding is possible by using a Python library such as chardet or the speedier cchardet. But that deserves a separate article.

Writing a CSV file

Creating a new CSV file is similar to reading one, in that we open the file in a context manager using the with statement, and we need to pay careful attention to character encoding. This time, however, we will:

  • Open the file in "w" mode, for writing
  • Pass in the field (column) names when instantiating csv.DictWriter
  • write the header with writeheader() before writing any rows
  • write the row(s) with new dictionaries

The following code demonstrates the above:

import csv
from pathlib import Path


outpath = Path("out/output.csv")
outpath.parent.mkdir(exist_ok=True)  # ensure the "out" directory exists

with outpath.open("w", newline="", encoding="utf-8-sig") as outfile:
    writer = csv.DictWriter(outfile, ["First", "Last"])
    writer.writeheader()
    new_row = {"First": "Jane", "Last": "Smith"}
    writer.writerow(new_row)
Enter fullscreen mode Exit fullscreen mode

You may also download the above code here.

(Note that we used utf-8-sig as the encoding, so that it opens well as a UTF-8 encoded CSV in Microsoft Excel. If you do not want the BOM character at the beginning of the file, you might change the encoding to just utf-8.)

After saving the above as simplewriter.py and running it with python simplewriter.py there should be an out directory with output.csv in it. That file will have two rows: the header, and one data row:

First Last
Jane Smith

To add subsequent rows, you may call writer.writerow() again with new dictionaries.

Transforming data by reading from one file while writing to another

When handling CSV files, I find that I often want to extract data from one CSV file, transform it in some way, and then immediately create and populate a second CSV file with the reorganized/changed/cleaned data. I like to do this in a way that is simple, memory-efficient, and safe.

The solution: open two files (two context managers) in one with statement:

inpath = Path("sample.csv")
outpath = Path("out/transformed.csv")
args = {"newline": "", "encoding": "utf-8-sig"}
with inpath.open("r", **args) as infile, outpath.open("w", **args) as outfile:
    reader = csv.DictReader(infile)
    writer = csv.DictWriter(outfile, ["Firstname", "Lastname", "Username"])
Enter fullscreen mode Exit fullscreen mode

As you may already be aware, it is possible to pass a dictionary of keyword arguments to a Python function. Because both open() functions needed the same newline and encoding arguments, I built a dictionary with those then pass it to each open(), prefixing the necessary **. That way I don't repeat myself.

The above (incomplete) example can be filled out with row handling logic so that each row can be read, transformed, then written.

Let's look at a complete example:

import csv
from pathlib import Path


def transform():
    """Filters and transforms each row of input CSV, with output to a separate file."""
    inpath = Path("sample.csv")
    outpath = Path("out/transformed.csv")
    outpath.parent.mkdir(exist_ok=True)
    args = {"newline": "", "encoding": "utf-8-sig"}
    with inpath.open("r", **args) as infile, outpath.open("w", **args) as outfile:
        reader = csv.DictReader(infile)
        writer = csv.DictWriter(outfile, ["Firstname", "Lastname", "Username"])

        writer.writeheader()

        for row in reader:
            if row["Role"] == "Student":
                new_row = {
                    "Firstname": row["First name"],
                    "Lastname": row["Last name"],
                    "Username": f"{row['Last name']}{row['First name']}".lower(),
                }
                writer.writerow(new_row)


if __name__ == "__main__":
    transform()
Enter fullscreen mode Exit fullscreen mode

You can also download the code here.

Saving the above code as simpletransform.py and running it with python simpletransform.py should quickly create a new transformed.csv file in the out/ directory, with contents like this:

Firstname Lastname Username
Finnegan Randolph randolphfinnegan
Easton Hood hoodeaston

This demonstrates several common transformation tasks:

  • We renamed the fields (column headers) to our preference (no spaces!)
  • We filtered on a value: we only transformed and wrote out the rows that had the role of "Student"
  • We added values with a "Username" formula (Last name concatenated with First name, all lowercase

Clearly, your imagination is already considering much more complex and useful transformations of your data. Go for it.

Better reusability and command line processing

I don't think our CSV transformer is complete without better parameterization and command line support. Often, I want such a tool to be used repeatedly and flexibly, with meaningful feedback if I forget one of the inputs, and meaningful documentation when I utilize it again after months of languishing.

Python includes a command line parser called argparse. We will use that in the following final example:

"""Python CSV handling demo."""

__version__ = "0.1.0"

import argparse
import csv
import sys
from pathlib import Path


def transform(
    infilename, outfilename, inencoding="utf-8-sig", outencoding="utf-8-sig",
):
    """Filters and transforms each row, with output to a separate file."""
    inpath = Path(infilename)
    outpath = Path(outfilename)
    args = {"newline": ""}
    inargs = {"encoding": inencoding, **args}
    outargs = {"encoding": outencoding, **args}
    with inpath.open("r", **inargs) as infile, outpath.open("w", **outargs) as outfile:
        reader = csv.DictReader(infile)
        writer = csv.DictWriter(outfile, ["Firstname", "Lastname", "Username"])

        writer.writeheader()

        for row in reader:
            if row["Role"] == "Student":
                new_row = {
                    "Firstname": row["First name"],
                    "Lastname": row["Last name"],
                    "Username": f"{row['Last name']}{row['First name']}".lower(),
                }
                writer.writerow(new_row)


def run(args=None):
    """Execute as command line script."""
    if not args:
        args = sys.argv[1:]
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "inputfile", help="read from this CSV file",
    )
    parser.add_argument(
        "outputfile", help="save to this CSV file",
    )
    parser.add_argument(
        "-e",
        "--encoding",
        default="utf-8-sig",
        help="character encoding of input and output files",
    )
    parser.add_argument(
        "-ie", "--inputencoding", help="character encoding of input file",
    )
    parser.add_argument(
        "-oe", "--outputencoding", help="character encoding to use for output file",
    )
    args = parser.parse_args(args)

    inputencoding = args.inputencoding or args.encoding
    outputencoding = args.outputencoding or args.encoding

    transform(args.inputfile, args.outputfile, inputencoding, outputencoding)


if __name__ == "__main__":
    run()
Enter fullscreen mode Exit fullscreen mode

You might save the above file as csvdemo.py and run it with python csvdemo.py -h to see the command line options.

The following will parse our sample file again and write the results to out/sample_transformed.csv in a format easily readable by newer versions of Microsoft Excel:

python csvdemo.py -e utf-8-sig sample.csv out/sample_transformed.csv
Enter fullscreen mode Exit fullscreen mode

If you want to write it in Latin (Western European) encoding instead, you could try a different output encoding, while keeping the input encoding as UTF-8:

python csvdemo.py -ie utf-8-sig -oe cp1252 sample.csv out/sample_transformed_western.csv
Enter fullscreen mode Exit fullscreen mode

Feel free to read my tutorial on using argparse to explore that command line parsing tool further.

Adapt as you like, and consult the Python CSV docs as needed.

Enjoy building your own tooling for CSV handling! Good data to you.

Discussion (0)

pic
Editor guide