DEV Community

Mohamed M El-Kalioby
Mohamed M El-Kalioby

Posted on

Work on CSV files with Leopards

Filtering a CSV file is one of the first steps when trying to interpolate data for data science. This article will show how to use Leopards to filter a CSV file.

We will use this csv file to do our filtrations.

Install leopards

pip install leopards
Enter fullscreen mode Exit fullscreen mode

Open csv file

We will use the build-in Python module csv to load CSV

import csv
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
Enter fullscreen mode Exit fullscreen mode

Example: Count How many of each gender is in the set

Let's see how many males/females are in the dataset

from  leopards import Count
res = Count(rows,['Sex'])
print(res)
Enter fullscreen mode Exit fullscreen mode

output:

dict_values([{'Sex': 'Male', 'count': 47}, {'Sex': 'Female', 'count': 53}])
Enter fullscreen mode Exit fullscreen mode

Lets run the same code again.

res = Count(rows,['Sex'])
print(res)
Enter fullscreen mode Exit fullscreen mode

output:

dict_values([])
Enter fullscreen mode Exit fullscreen mode

Explanation

Why did it return result in the first run but not the second run?
DictReader is an iterator so with the first run, we were on the first item, so leopards was able to iterate over it but with the second run, the iterator was already consumed so it is like an empty list. so to solve the issue either load the rows in memory like below rows=[r for r in rows] or recreate the iterator by rows = csv.DictReader(open('people-100.csv'), delimiter=",")

Example: Find people whose email is '@example.com'

from  leopards import Q
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
res = Q(rows,Email__icontains='example.com')
for item in res:
    print(item)
Enter fullscreen mode Exit fullscreen mode

output:

{'Index': '2', 'User Id': 'f90cD3E76f1A9b9', 'First Name': 'Phillip', 'Last Name': 'Summers', 'Sex': 'Female', 'Email': 'bethany14@example.com', 'Phone': '214.112.6044x4913', 'Date of birth': '1910-03-24', 'Job Title': 'Phytotherapist'}
{'Index': '3', 'User Id': 'DbeAb8CcdfeFC2c', 'First Name': 'Kristine', 'Last Name': 'Travis', 'Sex': 'Male', 'Email': 'bthompson@example.com', 'Phone': '277.609.7938', 'Date of birth': '1992-07-02', 'Job Title': 'Homeopath'}
{'Index': '4', 'User Id': 'A31Bee3c201ef58', 'First Name': 'Yesenia', 'Last Name': 'Martinez', 'Sex': 'Male', 'Email': 'kaitlinkaiser@example.com', 'Phone': '584.094.6111', 'Date of birth': '2017-08-03', 'Job Title': 'Market researcher'}
Enter fullscreen mode Exit fullscreen mode

23 more records

Example: from the people with Email '@example.com' Return how many are 'Male'

from  leopards import Q
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
res = Q(rows,Email__icontains='example.com',Sex='Male')
print(len(list(res)))
Enter fullscreen mode Exit fullscreen mode

output:

10
Enter fullscreen mode Exit fullscreen mode

The above code is eqivalent to this

from  leopards import Q
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
kwargs={"Email__icontains":'example.com' ,"Sex":'Male'}
res = Q(rows,**kwargs)
print(len(list(res)))
Enter fullscreen mode Exit fullscreen mode

Example: from the people with email '@example.com' Return how many are 'Male' and not born in 1992.

from  leopards import Q
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
kwargs={"Email__icontains":'example.com' ,"Sex":'Male', "Date of birth__nstartswith":"1992"}
res = Q(rows,**kwargs)
print(len(list(res)))
Enter fullscreen mode Exit fullscreen mode

output:

9
Enter fullscreen mode Exit fullscreen mode

Note: Index '3' is removed.

This can be written as follows using NOT:

from  leopards import Q
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
kwargs={"Email__icontains":'example.com' ,"Sex":'Male', "NOT": {"Date of birth__startswith":"1992"}}
res = Q(rows,**kwargs)
print(len(list(res)))
Enter fullscreen mode Exit fullscreen mode

Example: Find the Males who emails are '@example.com' OR '@example.net'

from  leopards import Q
rows = csv.DictReader(open('people-100.csv'), delimiter=",")
kwargs = kwargs = {"Sex": "Male", "OR":[{"Email__icontains":'example.com'},{"Email__icontains":"example.net"}]}
res = Q(rows,**kwargs)
print(list(res)[:3])
Enter fullscreen mode Exit fullscreen mode

output:

[{'Index': '1', 'User Id': '88F7B33d2bcf9f5', 'First Name': 'Shelby', 'Last Name': 'Terrell', 'Sex': 'Male', 'Email': 'elijah57@example.net', 'Phone': '001-084-906-7849x73518', 'Date of birth': '1945-10-26', 'Job Title': 'Games developer'}, 
{'Index': '3', 'User Id': 'DbeAb8CcdfeFC2c', 'First Name': 'Kristine', 'Last Name': 'Travis', 'Sex': 'Male', 'Email': 'bthompson@example.com', 'Phone': '277.609.7938', 'Date of birth': '1992-07-02', 'Job Title': 'Homeopath'},
{'Index': '4', 'User Id': 'A31Bee3c201ef58', 'First Name': 'Yesenia', 'Last Name': 'Martinez', 'Sex': 'Male', 'Email': 'kaitlinkaiser@example.com', 'Phone': '584.094.6111', 'Date of birth': '2017-08-03', 'Job Title': 'Market researcher'}]
Enter fullscreen mode Exit fullscreen mode

Conculsion

This tutorial showed how easy it is to filter CSV and do some aggregations quickly using Leopards.

If you have any question, you can ask me.

Top comments (0)