DEV Community

Cover image for Final project part 3
Cris Crawford
Cris Crawford

Posted on • Edited on

Final project part 3

I'm ready to start the process of putting my files from the temporary directory back into Google cloud storage, with only the data I want in them, in parquet format.

First I had to define the columns, because the data is stored without columns. There's another file provided by the Secretary of State that lets me know what the columns actually are. I decided to keep nine of the columns for the final project and delete all of the columns having to do with name and address of the voter. I kept the election_date, the election_type (primary, general, etc.), the voter ID, the town and the zip code, the party affiliation, the voter status (active or inactive), and the ward and precinct.

column_names = ['election_date', 'election_type', 'voter_id', \
                'tmp1', 'tmp2', 'tmp3', 'tmp4', 'tmp5', 'tmp6', 'tmp7', 'tmp8', \
                'zip', 'city', 'party_affiliation', \
                'tmp9', 'tmp10', \
                'ward', 'precinct', 'voter_status', \
                'tmp11', 'tmp12', 'tmp13', 'tmp14', 'tmp15', 'tmp16']

schema = {
  'election_date': 'object',
  'election_type': 'object',
  'voter_id': 'object',
  'tmp1': 'object',
  'tmp2': 'object',
  'tmp3': 'object',
  'tmp4': 'object',
  'tmp5': 'object',
  'tmp6': 'object',
  'tmp7': 'object',
  'tmp8': 'object',
  'zip': 'object',
  'city': 'object',
  'party_affiliation': 'object',
  'tmp9': 'object',
  'tmp10 ': 'object',
  'ward': 'object',
  'precinct': 'object',
  'voter_status': 'object',
  'tmp11': 'object',
  'tmp12': 'object',
  'tmp13': 'object',
  'tmp14': 'object',
  'tmp15': 'object',
  'tmp16': 'object'
}
Enter fullscreen mode Exit fullscreen mode

Here's the code I used to read one text file:

df = pd.read_csv('extracted_files/001_voter_act.txt', delimiter='|', names=column_names, dtype=schema)
df['election_date'] = pd.to_datetime(df['election_date'])
df['election_date'] = df['election_date'].dt.date
Enter fullscreen mode Exit fullscreen mode

I ran some tests to check that I read the file correctly, for example printing the number of lines and finding the date of the earliest election. I wrote a loop to read the other files, copy only the columns I wanted into parquet, and write the file to Google cloud. I inputted start and end number of the files so I could do a batch at a time:

from google.cloud import storage

client = storage.Client()

bucket = client.get_bucket('cris-voter-data')

selected_columns = ['election_date', 'election_type', 'voter_id', 'zip', 'city', 'party_affiliation', 'ward', 'precinct', 'voter_status']

def text_to_parquet(start, end):
    for n in range(start, end):
        filename = f"{n:03d}_voter_act.txt";
        df = pd.read_csv('extracted_files/' + filename, delimiter='|', names=column_names, dtype=schema)
        new_df = df[selected_columns]
        new_df.to_parquet(filename[:-4] + '.parquet', engine='pyarrow')
        blob = bucket.blob('output/' + filename[:-4] + '.parquet')
        blob.upload_from_filename(filename[:-4] + '.parquet')
        print({filename}, len(new_df))
Enter fullscreen mode Exit fullscreen mode

Then I ran text_to_parquet with a range, for example text_to_parquet(1, 50). Unfortunately I discovered some of the data was bad. I got the error message UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 10395: invalid continuation byte. I asked ChatGPT how to solve this, but nothing worked. I had 16 files with this error, out of 351. But I needed all the data, especially because some of the files were significant, for example the files representing the voters of Boston and Cambridge.

What I eventually did was set up a file transfer system with Cyberduck, transferred the files to my computer, opened them in Excel as text imports, wrote them as csv files, and transferred them back. This worked for 14 out of the 16 files. The last two were too big for Excel to read, and the instructions I had for reading them into Excel didn't work. ChatGPT suggested several editors that I could use. I downloaded the text editor "sublime text". This worked very well. I could read the files and cut and paste the sections, then read the sections into Excel and save them as .csv files. I only found the lines where there was bad data in one of the files. I cut out the records for that voter, because one voter out of 1.4 million wouldn't make a difference. But one city out of 351 cities and towns would.

For these files, I wrote "csv_to_parquet" as follows:

def csv_to_parquet(filename):
    df = pd.read_csv('extracted_files/' + filename, delimiter='|', names=column_names, dtype=schema)
    new_df = df[selected_columns]
    new_df.to_parquet(filename[:-4] + '.parquet', engine='pyarrow')
    blob = bucket.blob('output/' + filename[:-4] + '.parquet')
    blob.upload_from_filename(filename[:-4] + '.parquet')
    print({filename}, len(new_df))
Enter fullscreen mode Exit fullscreen mode

This task was labor-intensive and took about an hour to complete. Once I finished, I looked at the files in my Google cloud bucket. I was missing one, so I went back and downloaded it, turned it into .csv, and uploaded and processed. Then shut down for the night.

Top comments (0)