The code in this article is also available in a public Github repo here.
Python is great for data wrangling. Do you find yourself engaging in the process of extracting data from one system, transforming it, and then loading it into another (sometimes called ETL)?. You may have encountered CSV files a time or two, then.
CSV files are often used as a vehicle to carry large simple tables of data. Python has built-in CSV handling capability. In this article, we will explore a flexible way of reading and saving CSV files using Python's csv.DictReader
and csv.DictWriter
.
Flexible field handling
In this instance, I use flexible to mean resiliency in the handler when dealing with dynamic columns.
Imagine this scenario: the system you are extracting from changes (a product upgrade, for instance) adding fields to the CSV export file. And, conveniently, adds them as columns, not at the end, but in the middle, or at the leftmost column. Oh, the agony, if you expect certain data to be in the 3rd column, and that data is now in the 4th! Is there a way to construct a handler in Python that can handle this sort of moving target gracefully?
Yes.
A CSV example
Let's take a look at some tabular data (in other words, data in a table):
Last name | First name | Middle name | Role | School | Grade |
---|---|---|---|---|---|
Herman | Preston | Elliott | Teacher | Southside | |
Randolph | Finnegan | Braylon | Student | Eastside | 12 |
Hood | Easton | Fernando | Student | Southside | 7 |
Kemp | Logan | Kinsley | Teacher | Southside |
The above data in CSV format, with a header row, should look like this:
Last name,First name,Middle name,Role,School,Grade
Herman,Preston,Elliott,Teacher,Southside,
Randolph,Finnegan,Braylon,Student,Eastside,12
Hood,Easton,Fernando,Student,Southside,7
Kemp,Logan,Kinsley,Teacher,Southside,
If you would like, you can download the above CSV sample here.
Place the above in a file called sample.csv
in your current working directory. You may also clone the repo for this article and follow along in that way.
Loading the CSV file
To load a CSV file in Python, we first open the file. For file handling in Python, I always like to use pathlib for its convenience and flexibility.
from pathlib import Path
inpath = Path("sample.csv")
The above simply assigns the inpath
variable to point to our file.
A convenient and safe way to open a file for reading or writing is by using the with
statement, because it creates a "context manager" in which the file can be handled, and will automatically close when finished. In other words, Python will happily clean up after you:
with inpath.open("r") as infile:
contents = infile.read()
And that will work well for loading the entire contents of the file into the contents
variable, then closing the file. We indicate that we intend to read from the file with the "r" mode (as opposed to "w" for writing).
Admittedly, it would be even easier to write
contents = inpath.read_text()
. However, that would not have demonstratedwith
andopen
very well! And it isn't relevant for CSV handling, as you will see in the next sentence.
With CSV handling in Python, however, we do not need to load the contents of the file into memory. Instead, we pass the file handle (infile
in our example above) to a csv.DictReader
. This gives us advanced row-parsing ability, and the memory savings that comes from loading the file line by line.
import csv
from pathlib import Path
inpath = Path("sample.csv")
with inpath.open("r", newline="", encoding="utf-8-sig") as infile:
reader = csv.DictReader(infile)
for row in reader:
fullname= f"{row['First name']} {row['Middle name']} {row['Last name']}"
print(fullname)
You may also download the above code from here.
If you save the above code in a file called simplereader.py
, assuming it is in the same directory as sample.csv
, the following should be possible:
$ python simplereader.py
Preston Elliott Herman
Finnegan Braylon Randolph
Easton Fernando Hood
Logan Kinsley Kemp
The python simplereader.py
command runs the script we just wrote with Python.
Depending on your system, you may need to use
python3
instead ofpython
. Ifpython -V
doesn't show something likePython 3.8.5
(the3
is the important part!), andpython3 -V
does, then use the latter instead.
Two additions in the above code are worth noting. When opening the file, we specify newline=""
to enable universal newline support. The rationale is complex, and we won't dive into the many ways different operating systems historically handle newlines. Suffice it to say that you should generally specify newline=""
when opening files for CSV handling. For more details, you may read about newline
parameter in the offical docs.
The second addition is the character encoding.
The importance of character encoding
We dare not neglect/ignore/assume character encoding, to put it mildly. It only works in the above case because our sample.csv
happens to only have only ASCII characters. As soon as we add people with names such as "Miloš" and "María", and especially 张伟, then we need to think more carefully. Literally care-fully, because these choices do indeed convey care for real people, as well as the quality and functioning of real data and user interfaces.
Saying "I won't use a character encoding" is not only technically impossible, it is akin to complaining, "Why can't everyone just speak American?"
In the above example, we specified utf-8-sig
which will open and translate both vanilla UTF-8 text encodings, as well as files in which the UTF-8 encoding is signaled with a Beginning Of Message (BOM) character, such as in Microsoft Excel's new "CSV UTF-8" format.
With files saved in the older, and more common, Microsoft Excel CSV format, the character encoding is likely "cp1252" which is a latin encoding.
In summary, always designate the character encoding. If you don't know the encoding of a CSV file you did not create, encoding options may be "cp1252", "utf-8", "utf-8-sig", possibly "iso-8859-1" or others. You may need to experiment.
When necessary, automatic detecting of a file's character encoding is possible by using a Python library such as chardet or the speedier cchardet. But that deserves a separate article.
Writing a CSV file
Creating a new CSV file is similar to reading one, in that we open the file in a context manager using the with
statement, and we need to pay careful attention to character encoding. This time, however, we will:
- Open the file in "w" mode, for writing
- Pass in the field (column) names when instantiating
csv.DictWriter
- write the header with
writeheader()
before writing any rows - write the row(s) with new dictionaries
The following code demonstrates the above:
import csv
from pathlib import Path
outpath = Path("out/output.csv")
outpath.parent.mkdir(exist_ok=True) # ensure the "out" directory exists
with outpath.open("w", newline="", encoding="utf-8-sig") as outfile:
writer = csv.DictWriter(outfile, ["First", "Last"])
writer.writeheader()
new_row = {"First": "Jane", "Last": "Smith"}
writer.writerow(new_row)
You may also download the above code here.
(Note that we used utf-8-sig
as the encoding, so that it opens well as a UTF-8 encoded CSV in Microsoft Excel. If you do not want the BOM character at the beginning of the file, you might change the encoding to just utf-8
.)
After saving the above as simplewriter.py
and running it with python simplewriter.py
there should be an out
directory with output.csv
in it. That file will have two rows: the header, and one data row:
First | Last |
---|---|
Jane | Smith |
To add subsequent rows, you may call writer.writerow()
again with new dictionaries.
Transforming data by reading from one file while writing to another
When handling CSV files, I find that I often want to extract data from one CSV file, transform it in some way, and then immediately create and populate a second CSV file with the reorganized/changed/cleaned data. I like to do this in a way that is simple, memory-efficient, and safe.
The solution: open two files (two context managers) in one with
statement:
inpath = Path("sample.csv")
outpath = Path("out/transformed.csv")
args = {"newline": "", "encoding": "utf-8-sig"}
with inpath.open("r", **args) as infile, outpath.open("w", **args) as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, ["Firstname", "Lastname", "Username"])
As you may already be aware, it is possible to pass a dictionary of keyword arguments to a Python function. Because both
open()
functions needed the samenewline
andencoding
arguments, I built a dictionary with those then pass it to eachopen()
, prefixing the necessary**
. That way I don't repeat myself.
The above (incomplete) example can be filled out with row handling logic so that each row can be read, transformed, then written.
Let's look at a complete example:
import csv
from pathlib import Path
def transform():
"""Filters and transforms each row of input CSV, with output to a separate file."""
inpath = Path("sample.csv")
outpath = Path("out/transformed.csv")
outpath.parent.mkdir(exist_ok=True)
args = {"newline": "", "encoding": "utf-8-sig"}
with inpath.open("r", **args) as infile, outpath.open("w", **args) as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, ["Firstname", "Lastname", "Username"])
writer.writeheader()
for row in reader:
if row["Role"] == "Student":
new_row = {
"Firstname": row["First name"],
"Lastname": row["Last name"],
"Username": f"{row['Last name']}{row['First name']}".lower(),
}
writer.writerow(new_row)
if __name__ == "__main__":
transform()
You can also download the code here.
Saving the above code as simpletransform.py
and running it with python simpletransform.py
should quickly create a new transformed.csv
file in the out/
directory, with contents like this:
Firstname | Lastname | Username |
---|---|---|
Finnegan | Randolph | randolphfinnegan |
Easton | Hood | hoodeaston |
This demonstrates several common transformation tasks:
- We renamed the fields (column headers) to our preference (no spaces!)
- We filtered on a value: we only transformed and wrote out the rows that had the role of "Student"
- We added values with a "Username" formula (Last name concatenated with First name, all lowercase
Clearly, your imagination is already considering much more complex and useful transformations of your data. Go for it.
Better reusability and command line processing
I don't think our CSV transformer is complete without better parameterization and command line support. Often, I want such a tool to be used repeatedly and flexibly, with meaningful feedback if I forget one of the inputs, and meaningful documentation when I utilize it again after months of languishing.
Python includes a command line parser called argparse
. We will use that in the following final example:
"""Python CSV handling demo."""
__version__ = "0.1.0"
import argparse
import csv
import sys
from pathlib import Path
def transform(
infilename, outfilename, inencoding="utf-8-sig", outencoding="utf-8-sig",
):
"""Filters and transforms each row, with output to a separate file."""
inpath = Path(infilename)
outpath = Path(outfilename)
args = {"newline": ""}
inargs = {"encoding": inencoding, **args}
outargs = {"encoding": outencoding, **args}
with inpath.open("r", **inargs) as infile, outpath.open("w", **outargs) as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, ["Firstname", "Lastname", "Username"])
writer.writeheader()
for row in reader:
if row["Role"] == "Student":
new_row = {
"Firstname": row["First name"],
"Lastname": row["Last name"],
"Username": f"{row['Last name']}{row['First name']}".lower(),
}
writer.writerow(new_row)
def run(args=None):
"""Execute as command line script."""
if not args:
args = sys.argv[1:]
parser = argparse.ArgumentParser()
parser.add_argument(
"inputfile", help="read from this CSV file",
)
parser.add_argument(
"outputfile", help="save to this CSV file",
)
parser.add_argument(
"-e",
"--encoding",
default="utf-8-sig",
help="character encoding of input and output files",
)
parser.add_argument(
"-ie", "--inputencoding", help="character encoding of input file",
)
parser.add_argument(
"-oe", "--outputencoding", help="character encoding to use for output file",
)
args = parser.parse_args(args)
inputencoding = args.inputencoding or args.encoding
outputencoding = args.outputencoding or args.encoding
transform(args.inputfile, args.outputfile, inputencoding, outputencoding)
if __name__ == "__main__":
run()
You might save the above file as csvdemo.py
and run it with python csvdemo.py -h
to see the command line options.
The following will parse our sample file again and write the results to out/sample_transformed.csv
in a format easily readable by newer versions of Microsoft Excel:
python csvdemo.py -e utf-8-sig sample.csv out/sample_transformed.csv
If you want to write it in Latin (Western European) encoding instead, you could try a different output encoding, while keeping the input encoding as UTF-8:
python csvdemo.py -ie utf-8-sig -oe cp1252 sample.csv out/sample_transformed_western.csv
Feel free to read my tutorial on using argparse
to explore that command line parsing tool further.
Adapt as you like, and consult the Python CSV docs as needed.
Enjoy building your own tooling for CSV handling! Good data to you.
Top comments (0)