We can Calculate all of these operations with Python. We will use Python Package numpy. We will use numpy more later for manipulating arrays, but for now we will just use a few functions for statistical calculations: Mean, median, percentile, std, var
import numpy as np
Let's initialize the variable data to have the list of ages.
data = [15, 16, 18, 19, 22, 24, 29, 30, 34]
Now we can use the numpy functions. For the mean, median, standard deviation and variance functions, we just pass in the data list. For the percentile function, we pass the data list and the percentile (as a number between 0 and 100)
Make sure you download Anaconda Navigator. Here is the
download link to (https://www.anaconda.com/) then after installation select Jupyter Lab. The screen for Jupyter Lab appears as pictured below.
#Age Array
data = [15, 16, 18, 18, 19, 22, 24, 29, 30, 34]
#import numpy library
import numpy as np
print("mean:", np.mean(data))
print("median:", np.median(data))
print("50th percentile (median):", np.percentile(data, 50))
print("25th Percentile:", np.percentile(data, 25))
print("75th percentile:", np.percentile(data, 75))
print("Standard Deviation:", np.std(data))
print("Variance:", np.var(data))
Numpy is a python library that allows fast and easy methematical operations to be performed on arrays.
Reading Data with Pandas
What is Pandas?
This course is in Python, one of the most commonly used languages for Machine Learning.
One of the reasons it is so popular is that there are numerous helpful python modules for working with data. The first we will be introducing is called Pandas
Pandas is a Python module that helps us read and manipulate data. What's cool about pandas is that you can take in data and view it as a table that's human readable, but it can also be interpreted numerically so that you can do lots of computations with it.
We call the table of data a DataFrame.
Python will satisfy all of our Machine Learning needs. We'll use the Pandas module for data manipulation.
Reading in Your Data
We need to start by importing Pandas. It's standard practice to nickname it pd so that it's faster to type later on.
import pandas as pd
We'll be working with a dataset of Titanic passengers. For each passenger, we'll have some data on them as well as whether or not they survived the crash.
Our data is stored as CSV(Comma Separated Values) file. The Titanic.csv file is below. The first line is the header and then each subsequent line is the data for a single
passenger.
Survived, Pclass, Sex, Age, Siblings/
Spouses, Parents/Children, Fare
0, 3, male, 22.0, 1, 0, 7.25
1, 1, female, 38.0, 1, 0, 71.2833
1, 3, female, 26.0, 0, 0, 7.925
1, 1, female, 35.0, 1, 0, 53.1
We're going to pull the data into pandas so we can view it as a DataFrame.
The read_csv function takes in csv format and converts it into a Pandas DataFrame
df = pd.read_csv("Titanic.csv")
The object df is now our pandas dataframe with the Titanic dataset. Now we can use the head method to look at the data.
print(df.head())
Run this code to see the results
import pandas as pd
df = pd.read_csv("Titanic.csv")
print(df.head())
Generally, data is stored in CSV (Comma Separated Values) files, which we can easily read with panda's read_csv function. The head method returns the first 5 rows.
Summarize the Data
Usually our data is much too big for us to be able to display it all. Looking at the first few rows is the first step to understanding our data , but then we want to look at some summary statistics.
In pandas, we can use the describe method. It returns a table of statistics about the columns.
print(df.describe())
We add a line in the code below to force python to display all 6 columns. Without the line, it will abbreviate the results.
import pandas as pd
pd.options.display .max_columns = 6
df = read_csv("Titanic.csv")
print(df.describe())
Top comments (0)