Kiran U Kamath

Posted on Apr 16, 2021 • Originally published at blog.learnwithdata.me

Peep into the basics of Numpy and Pandas

#machinelearning

This blog is written in Jupyter notebook, so you can experiment and learn by editing the notebook.

Click here for notebook.

Just change the input and check the output.

Learning by experiment and hands-on exercises is always better.

The purpose of this notebook is just to revise python basics.

Let's get started.

1. NUMPY BASICS

NumPy is a Linear Algebra Library used for multidimensional arrays

NumPy brings the best of two worlds:

C/Fortran computational efficiency,
Python language easy syntax

import numpy as np 

# Let's define a one-dimensional array 
my_list = [10, 20, 30, 40, 50, 60, 70, 80]
my_list

[10, 20, 30, 40, 50, 60, 70, 80]

Let's create a numpy array from the list "my_list"

x = np.array(my_list)
x

array([10, 20, 30, 40, 50, 60, 70, 80])

Get shape

x.shape

(8,)

Let's create a Multi-dimensional numpy array from the list "my_list"


matrix = np.array([[5, 8], [9, 13]])
matrix

array([[ 5,  8],
       [ 9, 13]])

# "rand()" uniform distribution between 0 and 1
xy = np.random.rand(7)
xy

array([0.40408966, 0.12527144, 0.04465052, 0.39450693, 0.93339664,
       0.14009694, 0.94461679])

you can create a matrix of random number from random.rand


xy = np.random.rand(2, 2)
xy

array([[0.86152202, 0.22526627],
       [0.41562272, 0.33467273]])

# "randn()" normal distribution between 0 and 1
xy = np.random.randn(7)
xy

array([-1.27678101,  1.20667812,  0.7945132 ,  0.62421099, -0.44447512,
       -0.57038096,  2.19949273])

"randint" is used to generate random integers between upper and lower bounds


xy = np.random.randint(1, 10)
xy

Create an evenly spaced values with a step of 7

xy = np.arange(1, 50, 7)
xy

array([ 1,  8, 15, 22, 29, 36, 43])

# Array of ones
xy = np.ones(7)
xy

array([1., 1., 1., 1., 1., 1., 1.])

# Matrices of ones
xy = np.ones((2, 2))
xy

array([[1., 1.],
       [1., 1.]])

# Array of zeros
xy = np.zeros(5)
xy

array([0., 0., 0., 0., 0.])

Reshape 1D array into a matrix

z = x.reshape(2,4)
print(x)
print(z)

[10 20 30 40 50 60 70 80]
[[10 20 30 40]
 [50 60 70 80]]

Obtain the maximum element (value)

x.max()

Obtain the minimum element (value)

x.min()

Obtain the location of the max element

x.argmax()

# Obtain the location of the min element
x.argmin()

# Access specific index from the numpy array
x[0]

# Starting from the first index 0 up until and NOT including the last element
x[0:3]

array([10, 20, 30])

# Broadcasting, altering several values in a numpy array at once
x[0:2] = 10
x

array([10, 10, 30, 40, 50, 60, 70, 80])

2. Pandas

Pandas is a data manipulation and analysis tool that is built on Numpy.

Pandas uses a data structure known as DataFrame (think of it as Microsoft excel in Python).

DataFrames empower programmers to store and manipulate data in a tabular fashion (rows and columns).

Series Vs. DataFrame? Series is considered a single column of a DataFrame.

import pandas as pd

# Let's define two lists as shown below:
stock_list = ['Reliance','AMAZON','facebook']
stock_list

['Reliance', 'AMZN', 'facebook']

label   = ['stock#1', 'stock#2', 'stock#3']
label

['stock#1', 'stock#2', 'stock#3']

Let's create a one dimensional Pandas "series"

Note that series is formed of data and associated labels


x_series = pd.Series(data = stock_list, index = label)

# Let's view the series
x_series

stock#1    Reliance
stock#2        AMZN
stock#3    facebook
dtype: object

Let's obtain the datatype

type(x_series)

pandas.core.series.Series

Let's define a two-dimensional Pandas DataFrame

Note that you can create a pandas dataframe from a python dictionary


bank_client_df = pd.DataFrame({'Bank client ID':[1111, 2222, 3333, 4444], 
                               'Bank Client Name':['Kiran', 'Chaitanya', 'dheeraj', 'shreyas'], 
                               'Net worth [$]':[3500, 29000, 10000, 2000], 
                               'Years with bank':[3, 4, 9, 5]})
bank_client_df

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	Bank client ID	Bank Client Name	Net worth [$]	Years with bank
0	1111	Kiran	3500	3
1	2222	Chaitanya	29000	4
2	3333	dheeraj	10000	9
3	4444	shreyas	2000	5

Let's obtain the data type


type(bank_client_df)

pandas.core.frame.DataFrame

you can only view the first couple of rows using .head()

bank_client_df.head(2)

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	Bank client ID	Bank Client Name	Net worth [$]	Years with bank
0	1111	Kiran	3500	3
1	2222	Chaitanya	29000	4

you can only view the last couple of rows using .tail()

bank_client_df.tail(1)

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	Bank client ID	Bank Client Name	Net worth [$]	Years with bank
3	4444	shreyas	2000	5

Pandas is used to read a csv file and store data in a DataFrame

bank_df = pd.read_csv('sample.csv')

write to a csv file without an index

bank_df.to_csv('sample_output.csv', index = False)

CONCATENATING AND MERGING WITH PANDAS

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])

df1

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])

df2

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	A	B	C	D
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])

df3

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	A	B	C	D
8	A8	B8	C8	D8
9	A9	B9	C9	D9
10	A10	B10	C10	D10
11	A11	B11	C11	D11

pd.concat([df1, df2, df3])

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}