DEV Community

Cover image for Generating Fake CSV Data With Python
Dennis O'Keeffe
Dennis O'Keeffe

Posted on • Originally published at blog.dennisokeeffe.com

Generating Fake CSV Data With Python

I write content for AWS, Kubernetes, Python, JavaScript and more. To view all the latest content, be sure to visit my blog and subscribe to my newsletter. Follow me on Twitter.

This is Day 24 of the #100DaysOfPython challenge.

This post will use the Faker library to generate fake data and export it to a CSV file.

We wil be emulating some of the free datasets from Kaggle, in particular the Netflix original films IMDB score to generate something similar.

The final code can be found here.

Prerequisites

  1. Familiarity with Pipenv. See here for my post on Pipenv.
  2. Familiarity with JupyterLab. See here for my post on JupyterLab.

Getting started

Let's create the generating-fake-csv-data-with-python directory and install Pillow.

# Make the `generating-fake-csv-data-with-python` directory
$ mkdir generating-fake-csv-data-with-python
$ cd generating-fake-csv-data-with-python
# Create a folder to place your icons
$ mkdir docs

# Init the virtual environment
$ pipenv --three
$ pipenv install faker
$ pipenv install --dev jupyterlab
Enter fullscreen mode Exit fullscreen mode

At this stage, we have the packages that we

Now we can start up the notebook server.

# Startup the notebook server
$ pipenv run jupyter-lab
# ... Server is now running on http://localhost:8888/lab
Enter fullscreen mode Exit fullscreen mode

The server will now be up and running.

Creating the notebook

Once on http://localhost:8888/lab, select to create a new Python 3 notebook from the launcher.

Ensure that this notebook is saved in generating-fake-csv-data-with-python/docs/generating-fake-data.ipynb.

We will create four cells to handle four parts of this mini project:

  1. Importing Faker and generating data.
  2. Importing the CSV module and exporting the data to a CSV file.

Before generating our data, we need to look at what we are trying to emulate.

Emulating The Netflix Original Movies IMDB Scores Dataset

Looking at the preview for our dataset, we can see that it contains the following columns and example rows:

Title Genre Premiere Runtime IMDB Score Language
Enter the Anime Documentary August 5, 2019 58 2.5 English/Japanese
Dark Forces Thriller August 21, 2020 81 2.6 Spanish

We only have two rows for example, but from here we can make a few assumptions about how we want to emulate it.

  1. In our langauges, we will stick to a single language (unlike the example English/Japanese).
  2. IMDB scores are between 1 and 5. We won't be too harsh on any movies and go from 0.
  3. Runtimes should emulate a real movie - we can set it to be between 50 and 150 minutes.
  4. Genres may be something we need to write our own Faker provider for.
  5. We are going to be okay with non-sense data, so we can just use a string generator for the names.

With this said, let's look at how we can fake this.

Emulating a value for each column

We will create seven cells - one to import Faker and one for each column.

For the first cell, we will import Faker.

from faker import Faker

fake = Faker()
Enter fullscreen mode Exit fullscreen mode

Secondard, we will fake a movie name with words:

def capitalize(str):
    return str.capitalize()
words = fake.words()
capitalized_words = list(map(capitalize, words))
movie_name = ' '.join(capitalized_words)
print(movie_name) # Serve Fear Consider
Enter fullscreen mode Exit fullscreen mode

Third, we will generate a date this decate and use the same format as the example:

from datetime import datetime

date = datetime.strftime(fake.date_time_this_decade(), "%B %d, %Y")
print(date) # April 30, 2020
Enter fullscreen mode Exit fullscreen mode

Fourth, we will create our own fake data geneartor for the genre:

# creating a provider for genre
from faker.providers import BaseProvider
import random

# create new provider class
class GenereProvider(BaseProvider):
    def movie_genre(self):
        return random.choice(['Documentary', 'Thriller', 'Mystery', 'Horror', 'Action', 'Comedy', 'Drama', 'Romance'])

# then add new provider to faker instance
fake.add_provider(GenereProvider)

# now you can use:
movie_genre = fake.movie_genre()
print(movie_genre) # Horror
Enter fullscreen mode Exit fullscreen mode

Fifth, we will do the same for a language:

# creating a provider for genre
from faker.providers import BaseProvider
import random

# create new provider class
class LanguageProvider(BaseProvider):
    def language(self):
        return random.choice(['English', 'Chinese', 'Italian', 'Spanish', 'Hindi', 'Japanese'])

# then add new provider to faker instance
fake.add_provider(LanguageProvider)

# now you can use:
language = fake.language()
print(language) # Spanish
Enter fullscreen mode Exit fullscreen mode

Sixth we need to generate a runtime:

# Getting random movie length
movie_len = random.randrange(50, 150)
print(movie_len) # 143
Enter fullscreen mode Exit fullscreen mode

Lastly, we need a rating with one decimal point between 1.0 and 5.0:

# Movie rating
random_rating = round(random.uniform(1.0, 5.0), 1)
print(random_rating) # 2.2
Enter fullscreen mode Exit fullscreen mode

Now that we have all our information together, it is time to generate a CSV with 100 entries.

Generating the CSV

We can place everything we know into a last cell to generate some data:

from faker import Faker
from faker.providers import BaseProvider
import random
import csv

class GenereProvider(BaseProvider):
    def movie_genre(self):
        return random.choice(['Documentary', 'Thriller', 'Mystery', 'Horror', 'Action', 'Comedy', 'Drama', 'Romance'])

class LanguageProvider(BaseProvider):
    def language(self):
        return random.choice(['English', 'Chinese', 'Italian', 'Spanish', 'Hindi', 'Japanese'])

fake = Faker()

fake.add_provider(GenereProvider)
fake.add_provider(LanguageProvider)

# Some of this is a bit verbose now, but doing so for the sake of completion

def get_movie_name():
    words = fake.words()
    capitalized_words = list(map(capitalize, words))
    return ' '.join(capitalized_words)

def get_movie_date():
    return datetime.strftime(fake.date_time_this_decade(), "%B %d, %Y")

def get_movie_len():
    return random.randrange(50, 150)

def get_movie_rating():
    return round(random.uniform(1.0, 5.0), 1)

def generate_movie():
    return [get_movie_name(), fake.movie_genre(), get_movie_date(), get_movie_len(), get_movie_rating(), fake.language()]

with open('movie_data.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Title', 'Genre', 'Premiere', 'Runtime', 'IMDB Score', 'Language'])
    for n in range(1, 100):
        writer.writerow(generate_movie())
Enter fullscreen mode Exit fullscreen mode

Running the cell will output the CSV file movie_data.csv in our root that looks like this:

Title,Genre,Premiere,Runtime,IMDB Score,Language
Discuss According Model,Horror,"February 09, 2020",107,2.6,Japanese
People Conference Be,Comedy,"April 25, 2020",84,1.8,Chinese
Forget Great Kind,Drama,"May 22, 2021",128,3.3,Chinese
Trial Employee Cover,Drama,"February 24, 2020",90,3.6,Spanish
Choose System We,Drama,"June 29, 2020",102,3.3,Spanish
Range Laugh Reach,Comedy,"August 09, 2021",92,3.9,Spanish
Increase Fire Popular,Romance,"May 03, 2020",107,4.1,Japanese
Show Job Believe,Thriller,"March 13, 2021",62,1.6,English
Or Power Century,Comedy,"February 29, 2020",146,2.3,Spanish
Ago Ability Within,Drama,"July 23, 2020",120,4.8,Italian
Foreign Always Sing,Mystery,"May 16, 2021",112,1.9,English
Once Movie Artist,Documentary,"February 09, 2020",79,4.1,Hindi
Near Explain Process,Action,"July 17, 2021",134,2.0,Spanish
Big Information Grow,Romance,"February 25, 2020",64,4.4,Spanish
Wind Project Heavy,Drama,"February 20, 2021",128,4.8,English
Child Form Theory,Mystery,"January 12, 2021",91,3.0,Spanish
Bring Sport Present,Drama,"March 02, 2021",87,2.7,Hindi
Themselves That Activity,Action,"August 20, 2020",148,3.0,Spanish
City Threat Almost,Thriller,"February 16, 2020",107,3.9,Spanish
See Main Student,Drama,"January 17, 2020",125,1.4,Chinese
Population Impact Season,Action,"March 19, 2020",109,2.3,Italian
Manager Thank Truth,Documentary,"February 12, 2021",124,4.1,Hindi
Child South Believe,Thriller,"April 18, 2020",65,3.9,Italian
Present Main Themselves,Romance,"September 08, 2020",89,3.8,Hindi
Maintain Order Old,Drama,"December 14, 2020",110,1.8,Hindi
Difficult Town Hair,Documentary,"October 12, 2020",51,4.9,Japanese
Page Hold Discussion,Drama,"November 01, 2020",139,1.9,Chinese
Style True Car,Comedy,"July 03, 2021",84,5.0,Japanese
Care Item Sing,Comedy,"November 16, 2020",100,4.9,Japanese
Do Car Organization,Romance,"February 28, 2021",129,1.1,Japanese
Learn Service Figure,Documentary,"March 04, 2020",50,2.0,Italian
Forget Situation Fact,Comedy,"January 22, 2020",52,3.9,English
Order International Report,Documentary,"December 17, 2020",101,2.2,Chinese
Another Black Teach,Mystery,"December 08, 2020",96,4.2,Italian
Professor Watch Throughout,Action,"September 15, 2020",111,4.0,English
Which Quickly Son,Documentary,"July 02, 2021",98,2.4,Chinese
Change East Article,Comedy,"March 28, 2020",61,2.4,English
Partner Individual Local,Romance,"May 07, 2020",149,5.0,English
Instead Watch Particular,Horror,"May 04, 2020",115,2.3,Hindi
Democratic Someone Available,Romance,"July 26, 2021",98,1.4,Italian
Place Would Mind,Drama,"May 09, 2021",141,2.4,Italian
Likely Economy Weight,Mystery,"February 03, 2021",106,3.1,Hindi
Could Certain More,Drama,"January 31, 2021",137,4.9,Hindi
Source Operation Sure,Action,"March 03, 2020",81,3.3,Hindi
Really Share Treat,Documentary,"August 05, 2020",99,2.2,English
Edge When Data,Drama,"July 27, 2020",115,1.6,Italian
Huge Imagine Federal,Romance,"August 08, 2021",141,3.0,Chinese
Tend Often Collection,Documentary,"June 25, 2020",73,3.2,Chinese
Wait Major Move,Action,"June 17, 2021",120,2.5,Spanish
Firm Reason With,Thriller,"July 16, 2021",67,2.6,Spanish
Significant Fall Travel,Romance,"March 14, 2021",123,2.0,Hindi
Send Size Eye,Comedy,"June 18, 2021",74,3.5,Spanish
Describe Hospital She,Drama,"March 14, 2021",90,1.4,Spanish
Give Drive Better,Mystery,"March 15, 2020",106,1.2,Spanish
Their Measure Choose,Action,"April 28, 2021",86,2.8,Italian
Resource Sell Agent,Thriller,"February 08, 2020",50,3.1,Hindi
Next Plan Soon,Action,"May 16, 2021",93,3.7,Hindi
Land Allow Simply,Mystery,"May 23, 2021",144,1.0,Hindi
Friend Total Few,Mystery,"June 12, 2021",93,4.1,Italian
Role Might Bad,Drama,"December 08, 2020",100,3.5,Japanese
Opportunity Public Certainly,Horror,"August 07, 2020",76,2.0,Italian
Else Play Politics,Drama,"August 01, 2021",145,2.5,Italian
Staff Main West,Documentary,"May 09, 2021",76,2.5,Japanese
Ready Treat Everything,Drama,"July 24, 2021",121,1.6,Hindi
Ahead Yourself Crime,Horror,"February 09, 2021",80,4.9,Italian
Next These Night,Comedy,"February 20, 2020",65,3.4,Hindi
Line Else Along,Comedy,"February 05, 2020",83,1.8,Hindi
Degree Continue Green,Documentary,"March 10, 2020",73,3.8,Hindi
Marriage Until Cover,Thriller,"November 26, 2020",147,4.8,English
Republican Way Mission,Drama,"April 04, 2021",57,2.9,Chinese
Prepare Rich Street,Romance,"February 26, 2021",94,2.6,Japanese
Term Five On,Horror,"September 06, 2020",62,2.7,English
Sister Manage Relate,Documentary,"August 17, 2020",76,4.4,Hindi
Scientist Beat Wonder,Horror,"June 23, 2021",137,1.5,Chinese
Fast Staff If,Romance,"February 05, 2021",148,2.7,Hindi
Ready Campaign Field,Comedy,"October 25, 2020",147,2.7,Chinese
Worker State Every,Mystery,"May 17, 2021",104,1.7,English
Bar Wind Story,Action,"January 28, 2021",108,3.2,Hindi
At Total Half,Thriller,"December 03, 2020",79,4.4,Spanish
One Something Focus,Thriller,"June 29, 2020",59,1.2,Japanese
Play We Impact,Comedy,"March 19, 2020",88,1.3,Hindi
Message After Again,Comedy,"May 28, 2021",75,4.1,Chinese
Such Something Information,Comedy,"June 01, 2021",145,2.2,Spanish
Power Organization Myself,Action,"January 29, 2021",119,1.4,Hindi
Apply Boy Success,Documentary,"August 06, 2020",93,1.4,Italian
Evening Production Bar,Romance,"April 13, 2020",102,2.5,Chinese
Work For Form,Drama,"September 19, 2020",80,4.4,Hindi
Occur Billion Cover,Documentary,"December 03, 2020",56,3.7,Chinese
Budget Wall Tv,Horror,"January 02, 2021",135,1.0,English
Share Beyond Loss,Action,"January 23, 2021",55,1.5,Italian
Professional Source Make,Horror,"December 08, 2020",107,4.1,Japanese
To Protect Improve,Mystery,"July 30, 2020",100,3.6,Japanese
Democratic Hundred Appear,Horror,"August 18, 2020",84,4.3,Hindi
Face Central Summer,Documentary,"November 25, 2020",63,1.8,Spanish
Involve Clearly At,Documentary,"November 25, 2020",56,1.5,Italian
Fall Term Drug,Horror,"April 05, 2020",52,2.2,Chinese
Fly Language Where,Romance,"May 18, 2021",102,4.4,Chinese
Service Local Door,Drama,"August 04, 2020",63,1.9,Italian
Son Avoid Himself,Drama,"July 30, 2020",53,1.8,Hindi
Enter fullscreen mode Exit fullscreen mode

Success!

Summary

Today's post demonstrated how to use the Faker package to generate fake data and the CSV library to export that data to file.

In future, we may use this data to make our data sets to work with and some some data science around.

Kaggle and Open Data are great resources for data and data visualization for any use you may also have when not generating your own data.

This "100 Days in Python" series will move towards data science and machine learning from here on out.

Resources and further reading

  1. The ABCs of Pipenv
  2. Hello, JupyterLab
  3. Pipenv
  4. Open Data
  5. Faker
  6. Kaggle
  7. Netflix original films IMDB score
  8. Final code

Photo credit: pawel_czerwinski

Originally posted on my blog. To see new posts without delay, read the posts there and subscribe to my newsletter.

Discussion (0)