Filtering a CSV file is one of the first steps when trying to interpolate data for data science. This article will show how to use Leopards to filter a CSV file.
We will use this csv file to do our filtrations.
Install leopards
pip install leopards
Open csv file
We will use the build-in Python module csv
to load CSV
import csv
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
Example: Count How many of each gender is in the set
Let's see how many males/females are in the dataset
from leopards import Count
res = Count(rows,['Sex'])
print(res)
output:
dict_values([{'Sex': 'Male', 'count': 47}, {'Sex': 'Female', 'count': 53}])
Lets run the same code again.
res = Count(rows,['Sex'])
print(res)
output:
dict_values([])
Explanation
Why did it return result in the first run but not the second run?
DictReader is an iterator so with the first run, we were on the first item, so leopards was able to iterate over it but with the second run, the iterator was already consumed so it is like an empty list. so to solve the issue either load the rows in memory like below rows=[r for r in rows]
or recreate the iterator by rows = csv.DictReader(open('people-100.csv'), delimiter=",")
Example: Find people whose email is '@example.com'
from leopards import Q
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
res = Q(rows,Email__icontains='example.com')
for item in res:
print(item)
output:
{'Index': '2', 'User Id': 'f90cD3E76f1A9b9', 'First Name': 'Phillip', 'Last Name': 'Summers', 'Sex': 'Female', 'Email': 'bethany14@example.com', 'Phone': '214.112.6044x4913', 'Date of birth': '1910-03-24', 'Job Title': 'Phytotherapist'}
{'Index': '3', 'User Id': 'DbeAb8CcdfeFC2c', 'First Name': 'Kristine', 'Last Name': 'Travis', 'Sex': 'Male', 'Email': 'bthompson@example.com', 'Phone': '277.609.7938', 'Date of birth': '1992-07-02', 'Job Title': 'Homeopath'}
{'Index': '4', 'User Id': 'A31Bee3c201ef58', 'First Name': 'Yesenia', 'Last Name': 'Martinez', 'Sex': 'Male', 'Email': 'kaitlinkaiser@example.com', 'Phone': '584.094.6111', 'Date of birth': '2017-08-03', 'Job Title': 'Market researcher'}
23 more records
Example: from the people with Email '@example.com' Return how many are 'Male'
from leopards import Q
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
res = Q(rows,Email__icontains='example.com',Sex='Male')
print(len(list(res)))
output:
10
The above code is eqivalent to this
from leopards import Q
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
kwargs={"Email__icontains":'example.com' ,"Sex":'Male'}
res = Q(rows,**kwargs)
print(len(list(res)))
Example: from the people with email '@example.com' Return how many are 'Male' and not born in 1992.
from leopards import Q
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
kwargs={"Email__icontains":'example.com' ,"Sex":'Male', "Date of birth__nstartswith":"1992"}
res = Q(rows,**kwargs)
print(len(list(res)))
output:
9
Note: Index '3' is removed.
This can be written as follows using NOT
:
from leopards import Q
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
kwargs={"Email__icontains":'example.com' ,"Sex":'Male', "NOT": {"Date of birth__startswith":"1992"}}
res = Q(rows,**kwargs)
print(len(list(res)))
Example: Find the Males who emails are '@example.com' OR '@example.net'
from leopards import Q
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
kwargs = kwargs = {"Sex": "Male", "OR":[{"Email__icontains":'example.com'},{"Email__icontains":"example.net"}]}
res = Q(rows,**kwargs)
print(list(res)[:3])
output:
[{'Index': '1', 'User Id': '88F7B33d2bcf9f5', 'First Name': 'Shelby', 'Last Name': 'Terrell', 'Sex': 'Male', 'Email': 'elijah57@example.net', 'Phone': '001-084-906-7849x73518', 'Date of birth': '1945-10-26', 'Job Title': 'Games developer'},
{'Index': '3', 'User Id': 'DbeAb8CcdfeFC2c', 'First Name': 'Kristine', 'Last Name': 'Travis', 'Sex': 'Male', 'Email': 'bthompson@example.com', 'Phone': '277.609.7938', 'Date of birth': '1992-07-02', 'Job Title': 'Homeopath'},
{'Index': '4', 'User Id': 'A31Bee3c201ef58', 'First Name': 'Yesenia', 'Last Name': 'Martinez', 'Sex': 'Male', 'Email': 'kaitlinkaiser@example.com', 'Phone': '584.094.6111', 'Date of birth': '2017-08-03', 'Job Title': 'Market researcher'}]
Conculsion
This tutorial showed how easy it is to filter CSV and do some aggregations quickly using Leopards.
If you have any question, you can ask me.
Oldest comments (0)