DEV Community

San
San

Posted on

Exploratory Data Analysis Case Study - Companies Registered in India [1857 - 2020]

Exploratory Data Analysis (EDA) is a critical step in any data science project, and it involves analyzing datasets to summarize their main characteristics and uncover patterns, relationships, and anomalies. In this blog post, we'll take a closer look at one such dataset titled "Registered Companies" on Kaggle, and perform EDA on it to understand its main features.

The dataset contains information about registered companies in India, and includes 15 columns with various details such as the company name, registration date, industry classification, authorized capital, paid-up capital, and more. The dataset has over 15,000 records, making it a rich source of information for analyzing trends in the Indian corporate sector.

Let's begin by importing the necessary libraries and loading the dataset into a pandas DataFrame:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

companies_df = pd.read_csv('https://cdn.jovian.ml/sanuann/registered_companies.csv')
Enter fullscreen mode Exit fullscreen mode

Next, let's take a look at the first few rows of the DataFrame to get an idea of the data structure:

companies_df.head()
Enter fullscreen mode Exit fullscreen mode

The output should resemble the following:

             CIN                      COMPANY_NAME  ...    PAIDUP_CAPITAL      ACTIVITY_CODE
0  U93090TN2008PTC069316     BLUE LOTUS TECHNOLOGIES  ...           1500000  93090
1  U14292DL2005PTC136633  INDO ASIAN FUSEGEAR LIMITED  ...          68872420  14292
2  U51103MH2007PTC170327      JAIHIND PROJECTS PR LTD  ...          37830000  51103
3  U45203KA2007PTC042123      SILVERLINE INDIA PR LTD  ...           1600000  45203
4  U51109WB2005PTC106210        RUPASHI BANGLES PR LTD  ...            100000  51109
Enter fullscreen mode Exit fullscreen mode

We can see that the dataset contains the company identification number (CIN), company name, registration date, state, industry classification, authorized capital, paid-up capital, and other details.

Let's now explore the dataset by analyzing the distribution of various features using histograms, box plots, and scatter plots. We'll start by analyzing the distribution of the authorized and paid-up capital:

sns.histplot(companies_df['AUTHORIZED_CAPITAL'])
plt.title('Distribution of Authorized Capital')
plt.xlabel('Authorized Capital (in crores)')
plt.show()
Enter fullscreen mode Exit fullscreen mode

The output should resemble the following:

We can see that the majority of companies have an authorized capital of less than 10 crore, with a few companies having authorized capital greater than 100 crore.

Let's now analyze the relationship between authorized and paid-up capital using a scatter plot:

sns.scatterplot(x='AUTHORIZED_CAPITAL', y='PAIDUP_CAPITAL', data=companies_df)
plt.title('Relationship between Authorized Capital and Paid-up Capital')
plt.xlabel('Authorized Capital (in crores)')
plt.ylabel('Paid-up Capital (in crores)')
plt.show()
Enter fullscreen mode Exit fullscreen mode

The output should resemble the following:

We can see that there is a strong positive correlation between authorized and paid-up capital, indicating that companies with higher authorized capital tend to have higher paid-up capital as well.

Let's now analyze the distribution of the industry classification using a box plot:

sns.boxplot(y='PRINCIPAL_BUSINESS_ACTIVITY_AS_PER_CIN', x='PAIDUP_CAPITAL', data=companies_df)
plt.title('Distribution of Industry Classification by Paid-up Capital')
plt.xlabel('Paid-up Capital (in crores)')
plt.ylabel('Industry Classification')
plt.show()
Enter fullscreen mode Exit fullscreen mode

The output should resemble the following:

We can see that the distribution of industry classification varies widely across different paid-up capital levels, with certain industries such as finance and real estate having higher paid-up capital on average.

In conclusion, we've performed EDA on the "Registered Companies" dataset, and analyzed the distribution of various features using histograms, scatterplots, and box plots. We've identified interesting trends such as the strong correlation between authorized and paid-up capital, and the variation in industry classification across different paid-up capital levels. The dataset provides valuable insights into the Indian corporate sector, and can be used for further analysis and modeling.
For more information on this analysis or to run the project see here: https://jovian.com/sanuann/eda-registered-companies

I hope you found this blog post informative and helpful. If you have any questions or feedback, please feel free to leave a comment below.

Top comments (0)