CSV files are extremely common. I run into them all the time at work when transferring data. At home, a CSV file is almost always an option when exporting a file from the bank or out of Google Sheets.
CSV = C omma S eparated V alues
We see them all the time but there seems to be some confusion about what a CSV file really is. When I talk to people about CSV files, I realize that most of us equate them with Microsoft Excel documents. This is probably because Excel is usually the default program for opening CSV files.
Here is the thing. A CSV file is simply a text file. That's it, nothing more. The file extension is usually
.csv but it doesn't have to be. Any file extension is simply a hint to what is contained in the file and doesn't actually determine or control the content type. The Wikipedia Article about CSV files states that:
"CSV" is not a single, well-defined format
As a result, in practice the term "CSV" might refer to any file that:
- is plain text using a character set such as ASCII, various Unicode character sets
- consists of records (typically one record per line), with the records divided into fields separated by delimiters
- where every record has the same sequence of fields
Here is an example. I have a file called
users.csv sitting on my desktop. In a spreadsheet program (Excel, Google Sheets, LibreOffice, Numbers, etc) this file would look like this.
Those nice columns and rows are nice and fancy. We think there is something magical going one. However, the contents of this file are really just:
user_id,first_name,last_name,email 234,Kevin,Crommer,email@example.com 235,Carl,Marx,firstname.lastname@example.org
You can open a CSV file with a plaintext editor to see the true contents. It is very similar to viewing the source of a web page and looking at the underlying HTML.
A spreadsheet programs reads the CSV file and displays to the user (you and me) a pretty version that makes it easy to see the columns and rows.
As you can see in my example, A CSV file typically has one row per record. In this case there are 4 columns. The headers of the columns are
- Each row in a CSV file is separated by line breaks.
- Columns are separated by a known character. They are almost always comma-separated but I have seen commas, tabs, pipes (
|) and other wierd stuff.
- If a field contains a comma, it should be surrounded by quotes
- If a field contains single quotes then it should be surrounded by double quotes
- If a field contains double quotes then it should be surrounded by single quotes
user_id,citation 987,'The Man said,"Who goes there?"'
The moral of the story here is:
Don't try to parse CSV files by manually reading a line of the file, splitting by commas into a list.
There are way too many ways to get it wrong. Instead, use an existing library to do this part. In Python, the
csv module is essential. It takes care of reading all of these possible things.
Here are the basic steps for reading a CSV file with the builtin
csv Python module.
- Open a CSV file for reading
- Pass the file handler to the CSV reader
- Iterate through the rows of the file
I will walk through each one in order, showing how I would read my example file from earlier.
ALl during this example, we are going to use the following CSV data. We will assume it is saved a file called
quotes.csv sitting next to our script.
quote,speaker "Don’t you call me a mindless philosopher, you overweight glob of grease!",C-3PO "We’re doomed.",C-3PO "But I was going into Tosche Station to pick up some power converters!",Luke Skywalker "Help me, Obi-Wan Kenobi. You’re my only hope.",Leia Organa "An elegant weapon for a more civilized age.",Obi-Wan Kenobi "I find your lack of faith disturbing.",Darth Vader "Mos Eisley spaceport. You will never find a more wretched hive of scum and villainy.",Obi-Wan Kenobi "You don’t need to see his identification…These aren’t the droids you’re looking for.",Obi-Wan Kenobi "It’s the ship that made the Kessel Run in less than 12 parsecs.",Han Solo "Sorry about the mess.",Han Solo "She may not look like much, but she’s got it where it counts, kid.",Han Solo "Governor Tarkin, I should’ve expected to find you holding Vader’s leash. I recognized your foul stench when I was brought on board.",Leia Organa "I felt a great disturbance in the Force. As if millions of voices suddenly cried out in terror and were suddenly silenced.",Obi-Wan Kenobi "I suggest a new strategy, Artoo: Let the Wookiee win.",C-3PO "Hokey religions and ancient weapons are no match for a good blaster at your side, kid.",Han Solo "That’s no moon. It’s a space station.",Obi-Wan Kenobi "Who’s the more foolish? The fool or the fool who follows him?",Obi-Wan Kenobi "Aren’t you a little short for a stormtrooper?",Leia Organa "Somebody has to save our skins. Into the garbage chute, flyboy!",Leia Organa "I got a bad feeling about this.",Han Solo "When I left you I was but the learner. Now I am the master.",Darth Vader "If you strike me down I shall become more powerful than you can possibly imagine.",Obi-Wan Kenobi "It’s not impossible. I used to bullseye womp rats in my T-16 back home, they’re not much bigger than 2 meters.",Luke Skywalker "Cover me, Porkins!",Biggs Darklighter "Use the Force, Luke.",Obi-Wan Kenobi "What an incredible smell you've discovered", Han Solo
Opening a file for reading is pretty simple, but I always use the
with command to do it. Doing it this way automatically closes the file. Without it you have to call
_file.close() yourself or you risk loosing data or corrupting the file. Python is forgiving but I just don't like taking the risk. Here is the first part.
import csv with open('./quotes.csv', 'r') as _filehandler:
With the filehandler opened, pass it to the csv reader. At this point you have two ways of reading rows. You can use
csv.reader to read the rows as simple lists. This method:
- reads every row as a unique object
- has no notion of column headers. The first line of the file is just another row of data
- gives you indexed access to each row
The second option, and the one I personally use the most, is
csv.DictReader. This reader:
- reads headers from first row
- returns each row as a dictionary where the keys come from first row
If you read it using
csv.reader then the example becomes:
import csv with open('./quotes.csv', 'r') as _filehandler: csv_file_reader = csv.reader(_filehandler)
If you read it using
csv.DictReader then we get:
import csv with open('./quotes.csv', 'r') as _filehandler: csv_file_reader = csv.DictReader(_filehandler)
There isn't a huge difference there but you get to decide. I personally prefer using dictionaries that remembering the proper index for the column I am looking for.
The final step is to iterate over each row, doing something with it. The pattern
for row in csv_file_reader: is very common. This creates an iterator 
Our example then becomes:
import csv with open('./example.csv', 'r') as _filehandler: csv_file_reader = csv.DictReader(_filehandler) for row in csv_file_reader: # Do something here print(row['quote']) print(row['speaker'])
row in this script is a dictionary. We can access the quote with
row['quote'] and the person who is speaking with
Et voila! There you go! I suspect that this pattern will become the basis of many scripting projects in the future.
I recorded a quick video. In it, I show how to create the file in PyCharm and then run it.
Iterators are pretty awesome when you understand what is happening. They allow for very efficient memory usage with large files, remote data, and cases where you don't know how much information will be processed. In the case of the example given here, it causes the reading to only read the file one row at a time. At no point is the entire CSV file loaded into memory. ↩︎