Recently i came across a requirement to compare a column data in a csv file with another csv file. These files contains 13 columns with 65 million of rows.
Initially i thought it's simple one and used basic scripting with bash to process line by lines. Then, i realized it took me hours to get that processed. Then i moved to nodejs and then to python. No matter what language i used the result is same. It took more time and i felt it's not worth to process using the regular approach. Then i turned towards pandas library and i tried. To my surprise it processed within 10 seconds.
The requirement is basically to match the emailId in the file1.csv against the md5 email hash of the file2.csv.
Here the snippet i used to compare the matched dataset.
import pandas as pd
import hashlib
def md5hash(mailId):
print(mailId)
result = hashlib.md5(mailId)
return result.digest()
df=pd.read_csv("file1.csv",dtype={ "emailId": "string"},low_memory=False)
# First get the has of the emailId and append to the existing row
df['md5_hashed'] = [hashlib.md5(val.encode()).hexdigest() for val in df['emailId']]
df.to_csv("hashed_records.csv",index=False)
#Store the data in a dataframe
df1=pd.read_csv("file1_with_hashed_email.csv",low_memory=False)
df2=pd.read_csv("file2.csv")
df2_hash=df2["md5_emailId"]
matched=df.loc[df1['md5_hashed'].isin(df2_hash)]
matched.to_csv("non-matched.csv",index=False)
#Just add a ~ to filter the unmatched
unmatched=df.loc[~df['md5_hashed'].isin(df2_hash)]
unmatched.to_csv("non-matched.csv",index=False)
I worked with Pandas earlier. But i realized the power of pandas which has saved my hours instead of parsing with nodejs/db based solutions.
Hope this helps someone who faced similar scenarios.
Top comments (0)