A big dilemma many techies face when picking up a new skill, is "what language or tool should I use, and why?". This dilemma of choice is popularly known as "analysis paralysis" or "choice overload." You will feel overwhelmed by the options available to you which can lead to indecision and a feeling of being stuck. I've been there.
Going into data science, you have the option of learning many languages ranging from Python, R, Java, and Julia, just to name a few. The choice you make should be individual to you, your specific goals, background, and preferences. Not because of peer influence. So, why Python?
- It has a simple and intuitive syntax.
- Python has developed a deep ecosystem around Data Science. It has a large and active community of volunteers that create and contribute to the wealth of data science libraries such as matplotlib, sklearn, pandas, and numpy.
- Python can be applied widely beyond Data Science which includes areas such as web development.
Setting up a Python environment
Before jumping into the deep-end, you need to set up your computer in a way that allows you to write and run code. First, check that you have python installed using the following command:
python --version
If you have python the output should be the version of python you have installed eg:
Python 3.8.5
If you do not, you will get an error. You can download the latest python version from the official python website
A good place to start for beginners is using Anaconda as the environment for your Data Science workflow. Package conflicts in a Python environment can be a nightmare to deal with. Anaconda helps you navigate this and houses required tools, such as Jupyter Notebook. You can later move on to using virtual environments.
TOOLS:
Jupyter Notebook
is an open-source website that allows data scientists, like yourself, to create and share live code and visualizations. Each notebook contains executable cells and text descriptions. This makes it easy for people to interact and understand the code from start to end. You can share your code with others using Jupyter notebook.
Google Colab
Also known as Colaboratory, is a jupyter notebook environment that runs purely on the cloud and requires no setup. It allows users to load notebooks from public GitHub repos as well as saving to GitHub. A copy of each notebook will be saved on your Google Drive.
Python Basics
Learning the language entails first understanding the syntax and rules of Python as a programming language. I will summarize some of the fundamentals of working with Python. For absolute beginners, It would be benefitial to seek further resources and materials. The following are great places to start:
- Getting Started with Python on Programiz
- Python For Beginners on python.org
- How to Use Python: Your First Steps by Leodanis Pozo Ramos on Real Python
1. Variables & Data types
A variable is a named reference to a value that can be changed during program execution. Assigning a value to a variable is done using the assignment operator (=).
A data type is the nature of value assigned to variables. Python supports the following data types:
- integer (an integer value with no decimal value)
- string (alphanumeric text)
- float (a number with a decimal value)
- boolean (value can only
True
orFalse
)
Data Structures:
-
lists - collection of values that are ordered and changeable. Syntax wise, it uses square brackets:
my_list = [1, 2, 3, 4]
-
tuple - similar to a list, but its values cannot be changed once created. Syntax wise, it uses parenthesis:
my_tuple = (1, 2, 3, 4)
-
dictionary - a collection of key-value pairs that are unordered and changeable. Syntax wise, it uses curly braces:
my_dict = {'name': 'John', 'age': 30}
-
sets - an unordered collection of unique values. Example:
my_set = {1, 2, 3, 4}
. Values in a set will never repeat
2. Operators
The symbols used for mathematical and logical operations are pretty straight-forward in Python. +
for addition, -
for subtraction, *
for multiplication, and /
for division. ==
for checking value equality, !=
for not equal and <
for less than, and >
for greater than.
3. Logic & Process Flow
The first thing to note here, is indentation. Python follows a strict indentation rule when it comes to blocks of code. While other languages use markers such as curly braces, python relies on indentation level when executing code.
Conditions
They are used to execute a block of code based on whether a certain condition is true or false. For example, if... else
is a conditional loop that executes the first block of statements if the condition is true and the statements after else if the condition is false. For multiple conditions, the if... elif
statement can be used.
Loops
They are used to repeat a certain block of code multiple times until a specific condition is met. Python has the for loop
and the while loop
. For loops are used to iterate over a sequence, while while loops are used to repeat a block of code until a specific condition is met.
Functions
They are used to group together a set of instructions that can be called multiple times elsewhere in a program. Functions are defined using the def
keyword, followed by the function name and the input parameters. They can also return a value or simply perform an action.
For example:
def greet_user(name):
if name == "Alice":
print("Hello, Alice!")
else:
print("Hello, stranger!")
Classes and objects
Python is an object-oriented programming language. This is a programming paradigm that organizes code into reusable and modular components.
A class is a blueprint for creating objects that share the same attributes and behaviours.
Objects are instances of a class that are created using the class constructor. They can have attributes, which are variables that store data, and methods, which are functions that can be called on the object.
In the following example,
Class: Rectangle
Object: my_rectangle
class Rectangle:
def __init__(self, length, width):
self.length = length
self.width = width
def area(self):
return self.length * self.width
my_rectangle = Rectangle(4, 5)
print(my_rectangle.area()) # Output: 20
Understanding OOP will be important when interacting with the libraries used in data science.
4. File Handling
This is an important part of data science. Reading from and writing to files is a common task of data science and data analysis.
Reading a File
The open()
function is used to open a file (file.txt in this case) in 'r'
mode. This mode specifies that the file should be opened in read-only mode. The read()
method reads the contents of file.txt into the contents
variable.
The with
keyword is used to ensure that the file is closed once it is read.
with open('file.txt', 'r') as f:
contents = f.read()
Writing to a File
The 'w'
denotes write mode while the write()
function is used to write "Hello, world!" to the file.
with open('file.txt', 'w') as f:
f.write('Hello, world!')
Other modes include the 'a'
mode which specifies that the file should be opened in append mode. THis allows additional text to be written into the file.
Loading and manipulating data in Python
Data Science often requires working with large amounts of data. Therefore, you need to load the data. There are several ways to load data in Data Science with the most common method being the Pandas library.
Pandas
It is an open-source data analysis and manipulation library for Python. It offers fast and flexible data structures for working with structured and time series data.
Install the pandas library by running the following command in your terminal or command prompt:
pip install pandas
Pandas offers two primary data structures: Series and DataFrame. A Series is a one-dimensional labelled array.
A DataFrame is a 2D table-like data structure in Pandas. It is similar to a spreadsheet or SQL table in that it consists of rows and columns. You access data in a DataFrame by its row and column labels. Rows are labelled with an index, and the columns are labelled with column names. You can then load data into a pandas DataFrame as follows:
import pandas as pd
# Replace 'data.csv' with the name of your file
data = pd.read_csv("data.csv")
There are many methods that you can apply to manipulate your data using Pandas. Pandas offers an array of data manipulation tools such as filtering, grouping, merging, reshaping, pivoting data, as well as time series analysis. It is worth reading further on these. Below are a few examples:
# Print the first few rows of the DataFrame
print(df.head())
# Output;
name age gender
0 Alice 25 F
1 Bob 30 M
2 Charlie 35 M
# Filter the DataFrame to only include rows where the 'age' column is greater than 30
filtered_df = df[df['age'] > 30]
# Group the DataFrame by the 'gender' column and compute the mean of the 'salary' column for each group
grouped_df = filtered_df.groupby('gender')['salary'].mean()
Numpy
Numpy is also a data analysis and manipulation library. However, it differs from pandas in that numpy supports homogeneous data types while pandas supports heterogeneous data types. Read about Homogeneous vs Heterogeneous data types
Install the numpy library by running the following command in your terminal or command prompt:
pip install numpy
Numpy is the foundation for many other scientific computing and data science libraries in Python, such as Pandas.
Numpy is a great library for statistical and mathematical operations. For example, generating mean, median and standard deviation:
import numpy as np
# Create a dataset
data = [1, 2, 3, 4, 5]
# Calculate the mean, median, and standard deviation
mean = np.mean(data)
median = np.median(data)
std = np.std(data)
print("Mean:", mean)
print("Median:", median)
print("Standard deviation:", std)
Resources for Numpy:
Data Visualizations using Matplotlib
Data visualization is a critical part of data science. It allows you to understand and communicate the insights derived from your data. Matplotlip provides a wide range of tools for creating different types of charts and plots, including line charts, bar charts, histograms, scatter plots, and more. It also offers customization through styles, shapes, and colors.
Install the matplotlib by running the following command in your terminal or command prompt:
pip install matplotlib
To demonstrate the different capabilities of Matplotlib, let's create a simple line plot.
# Import the librarty
import matplotlib.pyplot as plt
# Some random data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Plot the data to create a line chart
plt.plot(x, y)
# Add labels and title
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.title('Line Plot')
# Display the chart
plt.show()
To represent the relationship between the variables, you can create a scatter plot. The only difference in the above code will be in plotting (and the title, of course). Replace plt.plot(x, y)
with:
plt.scatter(x, y)
Numpy could be used in the above example to generate random data.
import numpy as np
x = np.linspace(0, 10, 100) # This generates 100 data points for the x-axis
y = np.sin(x) # This calculates the corresponding y-axis values
For a bar graph on the other hand, you would need labels and their corresponding values
import matplotlib.pyplot as plt
# Data to be used
labels = ['A', 'B', 'C', 'D', 'E']
values = [5, 3, 7, 2, 8]
# Create a bar chart
plt.bar(labels, values)
# Add labels and title
plt.xlabel('Category')
plt.ylabel('Value')
plt.title('Bar Chart')
# Display the chart
plt.show()
There are many other charts that you can create using matplotlib such as histograms, scatter plots, pie charts, and more. It is worth exploring the matplotlib documentation to familiarize yourself with the different charts.
Optimisation with SciPy
SciPy is a scientific computing library built on top of NumPy. It provides additional functionality for optimization, integration, interpolation, linear algebra, and more.
The example below uses SciPy to perform a simple optimization problem:
from scipy.optimize import minimize_scalar
# Define the objective function (a quadratic function)
def objective(x):
return x**2 + 3*x + 4
# Find the minimum of the objective function
result = minimize_scalar(objective)
# Print the minimum value and the corresponding value of x
print("Minimum value:", result.fun)
print("Value of x at minimum:", result.x)
The minimize_scalar()
function is an optimization algorithm used to find the minimum of the function. This code finds the minimum value of the function result.fun
and the value of x when the function is at minimum result.x
This concept can be applied to more complex optimization problems, including those with multiple variables and constraints. Scipy is a powerful and versatile library with many scientific and engineering applications.
Statistical analysis in Python
This is involves interpreting, analyzing, and presenting the collected data. There are several libraries that support statistical analysis in python. These libraries can perform various statistical analysis tasks such as:
- Hypothesis testing — testing claims about the population based on a sample of data. This can be done using libraries such as SciPy
- Regression analysis — modelling the relationship between two or more variables. For example, Statsmodels can be used to perform a linear regression on a dataset.
- Descriptive statistics — simple and quick summary of a dataset. Numpy is used for summaries such calculating the mean, median, and standard deviation of a dataset
- Time series analysis — modelling and forecasting time-dependent data. This can be done using libraries such as Statsmodels.
- Predictive Modelling — libraries such as Scikit-learn provide a range of machine learning algorithms, including linear and logistic regression, decision trees, random forests, support vector machines, and neural networks.
- Probability distribution — modelling the uncertainty in a dataset using common probability distributions such as normal distribution, binomial distribution, and Poisson distribution. This can be done using SciPy.
Further reading:
Python Statistics Fundamentals: How to Describe Your Data by Mirko Stojiljković
An Introduction to Statistical Analysis and Modelling with Python by Roberto
Conclusion
In this article, we have covered some of the key features and concepts of Python, including data types, operators, control flow, functions, and file reading/writing. We have also introduced some of the most commonly used libraries in Python for data analysis, such as NumPy, Pandas, Matplotlib, and SciPy.
Python is a powerful language for Data Science. Its readability and its popularity within the data science community makes it easy for beginners to dive into Data Science. There are numerous resources available for learning and development.
As an aspiring data scientist, learning Python is only the beginning of building your skillset. This article is a great starting point for beginners looking to learn Python and its applications in data analysis. Keep practising and exploring the wonderful world of Data Science. The possibilities are endless.
Top comments (0)