DEV Community

Cover image for Python 101: Introduction to Python for Data Science
Karen Ngala
Karen Ngala

Posted on • Edited on

Python 101: Introduction to Python for Data Science

A big dilemma many techies face when picking up a new skill, is "what language or tool should I use, and why?". This dilemma of choice is popularly known as "analysis paralysis" or "choice overload." You will feel overwhelmed by the options available to you which can lead to indecision and a feeling of being stuck. I've been there.

Going into data science, you have the option of learning many languages ranging from Python, R, Java, and Julia, just to name a few. The choice you make should be individual to you, your specific goals, background, and preferences. Not because of peer influence. So, why Python?

  1. It has a simple and intuitive syntax.
  2. Python has developed a deep ecosystem around Data Science. It has a large and active community of volunteers that create and contribute to the wealth of data science libraries such as matplotlib, sklearn, pandas, and numpy.
  3. Python can be applied widely beyond Data Science which includes areas such as web development.

Setting up a Python environment

Before jumping into the deep-end, you need to set up your computer in a way that allows you to write and run code. First, check that you have python installed using the following command:

python --version
Enter fullscreen mode Exit fullscreen mode

If you have python the output should be the version of python you have installed eg:

Python 3.8.5
Enter fullscreen mode Exit fullscreen mode

If you do not, you will get an error. You can download the latest python version from the official python website

A good place to start for beginners is using Anaconda as the environment for your Data Science workflow. Package conflicts in a Python environment can be a nightmare to deal with. Anaconda helps you navigate this and houses required tools, such as Jupyter Notebook. You can later move on to using virtual environments.

TOOLS:

Jupyter Notebook
is an open-source website that allows data scientists, like yourself, to create and share live code and visualizations. Each notebook contains executable cells and text descriptions. This makes it easy for people to interact and understand the code from start to end. You can share your code with others using Jupyter notebook.

Google Colab
Also known as Colaboratory, is a jupyter notebook environment that runs purely on the cloud and requires no setup. It allows users to load notebooks from public GitHub repos as well as saving to GitHub. A copy of each notebook will be saved on your Google Drive.

Python Basics

Learning the language entails first understanding the syntax and rules of Python as a programming language. I will summarize some of the fundamentals of working with Python. For absolute beginners, It would be benefitial to seek further resources and materials. The following are great places to start:

1. Variables & Data types

A variable is a named reference to a value that can be changed during program execution. Assigning a value to a variable is done using the assignment operator (=).
A data type is the nature of value assigned to variables. Python supports the following data types:

  • integer (an integer value with no decimal value)
  • string (alphanumeric text)
  • float (a number with a decimal value)
  • boolean (value can only True or False)

Data Structures:

  • lists - collection of values that are ordered and changeable. Syntax wise, it uses square brackets: my_list = [1, 2, 3, 4]
  • tuple - similar to a list, but its values cannot be changed once created. Syntax wise, it uses parenthesis: my_tuple = (1, 2, 3, 4)
  • dictionary - a collection of key-value pairs that are unordered and changeable. Syntax wise, it uses curly braces: my_dict = {'name': 'John', 'age': 30}
  • sets - an unordered collection of unique values. Example: my_set = {1, 2, 3, 4}. Values in a set will never repeat

2. Operators

The symbols used for mathematical and logical operations are pretty straight-forward in Python. + for addition, - for subtraction, * for multiplication, and / for division. == for checking value equality, != for not equal and < for less than, and > for greater than.

3. Logic & Process Flow

The first thing to note here, is indentation. Python follows a strict indentation rule when it comes to blocks of code. While other languages use markers such as curly braces, python relies on indentation level when executing code.
Conditions
They are used to execute a block of code based on whether a certain condition is true or false. For example, if... else is a conditional loop that executes the first block of statements if the condition is true and the statements after else if the condition is false. For multiple conditions, the if... elif statement can be used.
Loops
They are used to repeat a certain block of code multiple times until a specific condition is met. Python has the for loop and the while loop. For loops are used to iterate over a sequence, while while loops are used to repeat a block of code until a specific condition is met.
Functions
They are used to group together a set of instructions that can be called multiple times elsewhere in a program. Functions are defined using the def keyword, followed by the function name and the input parameters. They can also return a value or simply perform an action.
For example:

def greet_user(name):
    if name == "Alice":
        print("Hello, Alice!")
    else:
        print("Hello, stranger!")
Enter fullscreen mode Exit fullscreen mode

Classes and objects
Python is an object-oriented programming language. This is a programming paradigm that organizes code into reusable and modular components.
A class is a blueprint for creating objects that share the same attributes and behaviours.
Objects are instances of a class that are created using the class constructor. They can have attributes, which are variables that store data, and methods, which are functions that can be called on the object.
In the following example,

Class: Rectangle
Object: my_rectangle

class Rectangle:
    def __init__(self, length, width):
        self.length = length
        self.width = width

    def area(self):
        return self.length * self.width

my_rectangle = Rectangle(4, 5)
print(my_rectangle.area())  # Output: 20
Enter fullscreen mode Exit fullscreen mode

Understanding OOP will be important when interacting with the libraries used in data science.

4. File Handling

This is an important part of data science. Reading from and writing to files is a common task of data science and data analysis.
Reading a File
The open() function is used to open a file (file.txt in this case) in 'r' mode. This mode specifies that the file should be opened in read-only mode. The read() method reads the contents of file.txt into the contents variable.

The with keyword is used to ensure that the file is closed once it is read.

with open('file.txt', 'r') as f:
    contents = f.read()
Enter fullscreen mode Exit fullscreen mode

Writing to a File
The 'w' denotes write mode while the write() function is used to write "Hello, world!" to the file.

with open('file.txt', 'w') as f:
    f.write('Hello, world!')
Enter fullscreen mode Exit fullscreen mode

Other modes include the 'a' mode which specifies that the file should be opened in append mode. THis allows additional text to be written into the file.

Loading and manipulating data in Python

Data Science often requires working with large amounts of data. Therefore, you need to load the data. There are several ways to load data in Data Science with the most common method being the Pandas library.

Pandas

It is an open-source data analysis and manipulation library for Python. It offers fast and flexible data structures for working with structured and time series data.

Install the pandas library by running the following command in your terminal or command prompt:

pip install pandas

Enter fullscreen mode Exit fullscreen mode

Pandas offers two primary data structures: Series and DataFrame. A Series is a one-dimensional labelled array.

A DataFrame is a 2D table-like data structure in Pandas. It is similar to a spreadsheet or SQL table in that it consists of rows and columns. You access data in a DataFrame by its row and column labels. Rows are labelled with an index, and the columns are labelled with column names. You can then load data into a pandas DataFrame as follows:

import pandas as pd

# Replace 'data.csv' with the name of your file
data = pd.read_csv("data.csv")
Enter fullscreen mode Exit fullscreen mode

There are many methods that you can apply to manipulate your data using Pandas. Pandas offers an array of data manipulation tools such as filtering, grouping, merging, reshaping, pivoting data, as well as time series analysis. It is worth reading further on these. Below are a few examples:

# Print the first few rows of the DataFrame
print(df.head())

# Output;
       name  age gender
0     Alice   25      F
1       Bob   30      M
2  Charlie   35      M

# Filter the DataFrame to only include rows where the 'age' column is greater than 30
filtered_df = df[df['age'] > 30]

# Group the DataFrame by the 'gender' column and compute the mean of the 'salary' column for each group
grouped_df = filtered_df.groupby('gender')['salary'].mean()
Enter fullscreen mode Exit fullscreen mode

Numpy

Numpy is also a data analysis and manipulation library. However, it differs from pandas in that numpy supports homogeneous data types while pandas supports heterogeneous data types. Read about Homogeneous vs Heterogeneous data types

Install the numpy library by running the following command in your terminal or command prompt:

pip install numpy

Enter fullscreen mode Exit fullscreen mode

Numpy is the foundation for many other scientific computing and data science libraries in Python, such as Pandas.

Numpy is a great library for statistical and mathematical operations. For example, generating mean, median and standard deviation:

import numpy as np

# Create a dataset
data = [1, 2, 3, 4, 5]

# Calculate the mean, median, and standard deviation
mean = np.mean(data)
median = np.median(data)
std = np.std(data)

print("Mean:", mean)
print("Median:", median)
print("Standard deviation:", std)
Enter fullscreen mode Exit fullscreen mode

Resources for Numpy:

Data Visualizations using Matplotlib

Data visualization is a critical part of data science. It allows you to understand and communicate the insights derived from your data. Matplotlip provides a wide range of tools for creating different types of charts and plots, including line charts, bar charts, histograms, scatter plots, and more. It also offers customization through styles, shapes, and colors.

Install the matplotlib by running the following command in your terminal or command prompt:

pip install matplotlib

Enter fullscreen mode Exit fullscreen mode

To demonstrate the different capabilities of Matplotlib, let's create a simple line plot.

# Import the librarty
import matplotlib.pyplot as plt

# Some random data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Plot the data to create a line chart
plt.plot(x, y)

# Add labels and title
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.title('Line Plot')

# Display the chart
plt.show()
Enter fullscreen mode Exit fullscreen mode

To represent the relationship between the variables, you can create a scatter plot. The only difference in the above code will be in plotting (and the title, of course). Replace plt.plot(x, y) with:

plt.scatter(x, y)
Enter fullscreen mode Exit fullscreen mode

Numpy could be used in the above example to generate random data.

import numpy as np

x = np.linspace(0, 10, 100) # This generates 100 data points for the x-axis 
y = np.sin(x) # This calculates the corresponding y-axis values 
Enter fullscreen mode Exit fullscreen mode

For a bar graph on the other hand, you would need labels and their corresponding values

import matplotlib.pyplot as plt

# Data to be used
labels = ['A', 'B', 'C', 'D', 'E']
values = [5, 3, 7, 2, 8]

# Create a bar chart
plt.bar(labels, values)

# Add labels and title
plt.xlabel('Category')
plt.ylabel('Value')
plt.title('Bar Chart')

# Display the chart
plt.show()
Enter fullscreen mode Exit fullscreen mode

There are many other charts that you can create using matplotlib such as histograms, scatter plots, pie charts, and more. It is worth exploring the matplotlib documentation to familiarize yourself with the different charts.

Optimisation with SciPy

SciPy is a scientific computing library built on top of NumPy. It provides additional functionality for optimization, integration, interpolation, linear algebra, and more.

The example below uses SciPy to perform a simple optimization problem:

from scipy.optimize import minimize_scalar

# Define the objective function (a quadratic function)
def objective(x):
    return x**2 + 3*x + 4

# Find the minimum of the objective function 
result = minimize_scalar(objective)

# Print the minimum value and the corresponding value of x
print("Minimum value:", result.fun)
print("Value of x at minimum:", result.x)
Enter fullscreen mode Exit fullscreen mode

The minimize_scalar() function is an optimization algorithm used to find the minimum of the function. This code finds the minimum value of the function result.fun and the value of x when the function is at minimum result.x

This concept can be applied to more complex optimization problems, including those with multiple variables and constraints. Scipy is a powerful and versatile library with many scientific and engineering applications.

Statistical analysis in Python

This is involves interpreting, analyzing, and presenting the collected data. There are several libraries that support statistical analysis in python. These libraries can perform various statistical analysis tasks such as:

  • Hypothesis testing — testing claims about the population based on a sample of data. This can be done using libraries such as SciPy
  • Regression analysis — modelling the relationship between two or more variables. For example, Statsmodels can be used to perform a linear regression on a dataset.
  • Descriptive statistics — simple and quick summary of a dataset. Numpy is used for summaries such calculating the mean, median, and standard deviation of a dataset
  • Time series analysis — modelling and forecasting time-dependent data. This can be done using libraries such as Statsmodels.
  • Predictive Modelling — libraries such as Scikit-learn provide a range of machine learning algorithms, including linear and logistic regression, decision trees, random forests, support vector machines, and neural networks.
  • Probability distribution — modelling the uncertainty in a dataset using common probability distributions such as normal distribution, binomial distribution, and Poisson distribution. This can be done using SciPy.

Further reading:

Python Statistics Fundamentals: How to Describe Your Data by Mirko Stojiljković

An Introduction to Statistical Analysis and Modelling with Python by Roberto

Conclusion

In this article, we have covered some of the key features and concepts of Python, including data types, operators, control flow, functions, and file reading/writing. We have also introduced some of the most commonly used libraries in Python for data analysis, such as NumPy, Pandas, Matplotlib, and SciPy.

Python is a powerful language for Data Science. Its readability and its popularity within the data science community makes it easy for beginners to dive into Data Science. There are numerous resources available for learning and development.

As an aspiring data scientist, learning Python is only the beginning of building your skillset. This article is a great starting point for beginners looking to learn Python and its applications in data analysis. Keep practising and exploring the wonderful world of Data Science. The possibilities are endless.

Top comments (0)