DEV Community

Purva Masurkar
Purva Masurkar

Posted on • Updated on

Day 4 of 100 Days of ML Code: Pandas Library

"Programs must be written for people to read, and only incidentally for machines to execute." - Harold Abelson and Gerald Jay Sussman, Structure and Interpretation of Computer Programs

Introduction to Pandas

Pandas is an open-source Python library that is widely used for data manipulation and analysis.

  • Summarizes the data.
  • Read and write different formats of file like CSV, JSON, EXCEL, HTML etc.
  • We can filter and modify the data based on multiple conditions.
  • We can merge multiple files.

Difference between Attributes and Methods

Attributes are used to represent properties or state of an object, while methods are used to represent behaviors or operations on its data. Attributes are accessed using the dot notation without parentheses, while methods are called using the dot notation with parentheses and optional arguments.

Importing Pandas

To use the Pandas library in Python, we first need to import it into our code. There are different ways to import Pandas, but the most common one is using the import statement

Image description

This statement imports the entire Pandas library, and we can access its functions and classes using the pd namespace.

Reading and Viewing the csv file

To work with real-world data, I have selected the Stack Overflow Annual Developer Survey file, which is a widely used dataset for data analysis and machine learning. This dataset contains information about the demographics, education, employment, and technology preferences of software developers from different parts of the world. The survey is conducted annually by Stack Overflow, a popular Q&A website for programmers.

To read a CSV file using Pandas, we use the pd.read_csv() function.

Image description

  • df.head(n): Displays the first n rows of the DataFrame (by default, n=5).
  • df.tail(n): Displays the last n rows of the DataFrame (by default, n=5).
  • df.shape: Returns a tuple containing the number of rows and columns in the DataFrame.
  • df.columns: Returns a list of column names in the DataFrame.
  • df.dtypes: Returns the data type of each column in the DataFrame.

Image description
Image description

To check null values in data we use. This function counts the total number of missing data from columns and sums them up.

Image description

To give summary of the data we use. It only includes columns that are numerical and not strings.
Image description

Gives all information of column such as number of rows, missing value, data types.
Image description

We are not allowed to see all columns so we use this function
Image description

DataFrame

In Pandas, a DataFrame is a two-dimensional table-like data structure that consists of rows and columns. Once created, a DataFrame can be manipulated, transformed, and analyzed using various Pandas functions and methods.

Image description

iloc and loc are two methods in Pandas that allows to select subsets of rows and columns from a DataFrame based on their index or label values. iloc is used for integer-based indexing, while loc is used for label-based indexing.

Image description

Conclusion
I am interested in continuing my exploration of the Pandas library because there is a lot to learn from it that can be helpful for my future applications. I will continue listing my daily progress and try to remain consistent. Please do share your feedback on how I can my 100daysofcode challenge more productive. I'll see you tomorrow for my daily update.

Top comments (0)