DEV Community

Soujanya Satpute
Soujanya Satpute

Posted on • Edited on

Pandas - Brief

What is pandas?
Pandas is python package built on two python packages Matplotlib and Numpy.
14 million users

DataFrame: 2 dimensional, Mutable, heterogeneous(Can be),Tabular Data structure
Image description

  • .info() Method: Generates Summary of the dataFrame with column names, Non-null counts, Dtype, memory Usage. Image description
  • .head() Method: returns the first few rows (the “head” of the DataFrame). Image description
  • .describe() Method: use for calculating statistical properties like mean, max, std Deviation, percentiles Image description
  • .values Returns Numpy representation of the dataFrame. But new method that is to_numpy() should be used rather than .values.
  • .columns List all column heading for database and its data types.
  • .index
    List all index in the dataFrame. These index means numbers of rows
    Image description

  • .shape Function:
    Returns the tuple of shape such as rows and columns

  • .size Function:
    Returns overall number of elements in that data frame

  • .ndim Function:
    Returns dimensions of Database

  • DataFrame column selecting
    You can select also multiple columns in database by double square bracket syntax. First square bracket is for syntax of dataFrame selection and second is for List of columns.

column1 = dataFrame['columnName']
column1 = dataFrame.columnName
column1 = dataFrame[['columnName', 'col2']]
Enter fullscreen mode Exit fullscreen mode
  • DataFrame row selecting with logical testing
  • And or Operators in row selection
  • Specific Value row selection: This selects particular row from given column where value is value. We can use different logical operator here also
row1 = dataFrame.[dataFrame.column == 'Value']
row1 = dataFrame.[dataFrame[column]== 'Value']

Enter fullscreen mode Exit fullscreen mode
  • Sorting Dataframe:
sortedDataFrame = dataFrame.sort_values('column_to_sort')
sortedDataFrame = dataFrame.sort_values(by = ['column_to_sort1', 'column_to_sort2'])
Enter fullscreen mode Exit fullscreen mode

Sorting can be perform on numbers, Dates.
Extra Attributes -
ascending = True / False,
na_position = first/ last - where to put Nan Values.
Example:

homelessness_reg_fam = homelessness.sort_values(['region','family_members'],ascending=[True,False])
Enter fullscreen mode Exit fullscreen mode
  • isin() Method: isin() is used in filtering DataFrame. With Particular Value and particular column.
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness.state.isin(canu)]
Enter fullscreen mode Exit fullscreen mode
  • Adding New Column to Database: Terms for adding new columns: Mutating/transforming DataFrame or feature engineering
dataframe['new_column'] = old_column.some_transformation
Enter fullscreen mode Exit fullscreen mode
  • Summary Statistics Summary statistics is the way you can summarise and know more about your data. mean(), median(),mode(),min(),max(),var(), std(), sum(), quantile(), agg(), agg() method is use to calculate custom summary statistic. agg() function takes more than one parameter function in the form of list. Example of custom percentile is as follows.
def percentile30(column):
   return column.quantile(0.4)

dataFrame[columnName].agg(percentile30)
Enter fullscreen mode Exit fullscreen mode

Functions like min,max works on Date columns also.

Calculating Cumulative Statistics
cumsum(), cummax(),cummin(),cumprod()

To be Continued...

Top comments (0)