DEV Community

Cover image for Exploratory Data Analysis for Indian Startup Ecosystem- A Visualization Approach of Indian Startup Ecosystem Funding (2018–2021)
Bambo
Bambo

Posted on

Exploratory Data Analysis for Indian Startup Ecosystem- A Visualization Approach of Indian Startup Ecosystem Funding (2018–2021)

Introduction
In this comprehensive article, I will provide a detailed summary of my investigation into the funding patterns of Indian startups from 2018 to 2021. As a data scientist, I meticulously analyzed the data, following The Cross Industry Standard Process for Data Mining (CRISP-DM) methodology, to gain deep insights into the Indian startup ecosystem.
The main goal of this research was to explore the dynamics of the Indian startup landscape, identify key trends, and understand which industries were most likely to secure significant funding during the specified period. With these valuable insights, I intend to offer well-founded recommendations to a fictional team eager to enter the Indian startup ecosystem.
Through rigorous data analysis and visualization, I am to provide practical and actionable insights that can guide decision-making processes and facilitate the team’s success in navigating the vibrant and dynamic world of Indian startups. Let’s delve into the findings and discoveries from this intriguing exploration.

Business Understanding.
The Indian startup ecosystem has witnessed extraordinary growth, bolstered by India’s rapidly expanding economy and the emergence of numerous influential unicorn startups on the global stage. Despite their size, startups wield a profound impact on economic growth, creating jobs and fostering a robust employment market, leading to a healthier economy. Moreover, they serve as catalysts for economic vitality by driving innovation and fostering healthy competition.The overarching goal of this project is to provide valuable insights to key stakeholders eager to venture into the Indian startup ecosystem. By meticulously analyzing crucial funding metrics, we aim to equip management teams with the knowledge necessary for making well-informed and impactful business decisions. Armed with data-driven insights, stakeholders can confidently steer their ventures toward success in the thriving and competitive Indian startup ecosystem.

Hypothesis

Null: Technological industries do not have a higher success rate of being funded.
Alternate: Technological industries have a higher success rate of being funded.

Research Questions:

  1. How has funding to startups changed over the period?
  2. How does location affect funding to startups?
  3. Which sectors are most favored by investors?
  4. What is the average amount of funding for start-ups in the sector with the most funding, and the location with the most funding?
  5. How does the breakdown by stages of funding look?
  6. Which start-ups were most favored by investors?

Data Preparation and Processing

At this stage, we organize the data to make it fit for analysis. Cleanliness and consistency of data are the objectives here.

Loading packages
To start with, the basic packages for the analysis were loaded. These were:

  • Pandas: for data cleaning and manipulation
  • Numpy: for data cleaning and manipulation
  • Matplotlib: for visualisations
  • Seaborn: for visualisations
  • Re: for regular expressions
  • Warnings: to deal with the warnings

Notes from Previewing the DataFrames
The individual datasets were then loaded as Pandas DataFrames. The first note was the differences in the number of columns in the datasets: 2018 (6), 2019 (9), 2020 (10), and 2021 (9). Other observations from looking at the individual datasets are as follows

The 2018 DataFrame
• The DataFrame has 526 rows and 6 columns.
• Dashes were used in the amounts column for deals whose values were not known.
• The amounts in the 2018 DataFrame are a mix of Indian Rupees (INR) and US Dollars (USD), meaning they have to be converted into the same currency.
• The industry and location columns have multiple information. A decision is to be made between selecting the first value before the separator(,) as the main value or representing that column with a word cloud.

The 2019 DataFrame
• The DataFrame has 89 rows and 9 columns.
• The data type of the “Founded” column is set to float64. It should be set to a string for uniformity.
• The headquarter column has multiple information. A decision is to be made between selecting the first value before the separator(,) as the main value or representing that column with a word cloud
The 2020 DataFrame

• There is an extra column called “Unnamed:9”, giving it a total of 10 columns. It should be dropped to ensure complete alignment with the other DataFrames for ease of concatenation.
The 2021 DataFrame
• The data type of the “Founded” column is set to float64. It should be set to a string for uniformity.
** • General Notes**
• The columns in 2018 are different from those of 2019–2021, meaning they have to be renamed before concatenation.
• The currency signs and commas have to be removed from each amount column for each DataFrame.
• All the columns with amounts have to be set to float.
• All the years of funding and the years founded should be converted to strings.
• The respective years of funding have to be attached to each DataFrame before combining.
Assumptions
1. The average Indian Rupee (INR) to US Dollar (USD) rate for the relevant year will be used for currency conversions.
2. The first values of industry and location in the 2018 data are the primary sector and headquarters respectively.
3. Amounts without currency symbols in the 2018 dataset are in USD.
4. Imputations will not be made for undisclosed and/or unavailable (missing) amounts due to the uncertainties, risks of misstatements and possible misleading effects on the analyses.
Data Cleaning
In summary, the major activities performed on the DataFrames involved extensive data cleaning and preparation. String formatting was applied to all columns, excluding the amounts columns, which were converted to numeric format. The location and industry columns were separated using commas as delimiters, with the first value selected as the primary sector.
For the 2018 Amount column, two new columns were created to aid in currency conversion, and after standardizing the amounts, the extra columns were dropped. Commas and currency signs were removed from the “Amounts” columns, and the “Undisclosed” text was replaced.
To ensure data integrity, “nan” values in the “Founders” column were replaced with nulls, and any notable misplaced or erroneous values were rectified in the respective rows.
Further refinements were made by dropping the extra unnamed column in the 2020 DataFrame. A new column indicating the year of funding was appended to each DataFrame.
To unify the DataFrames, the columns in the 2018 DataFrame were renamed to match the others before concatenating them. Additional steps included reformatting amounts as numerics, replacing nulls with zeros, and formatting funding years and years founded as strings.
To ensure data accuracy, all duplicates were removed, and the index was reset. Column-specific cleaning was also performed.
For more detailed functions and processes, please refer to the attached notebook.

Exploratory Data Analysis

A Dashbord Power bi showing 40% responded "Yes", 50% responded "No" and 10% responded "Not sure"

1 How the funding of the startups have changed over the period.

A bar chart showing 40% responded "Yes", 50% responded "No" and 10% responded "Not sure"

A linear chart showing 40% responded "Yes", 50% responded "No" and 10% responded "Not sure"

Over the analyzed period, the funding to startups in the Indian ecosystem exhibited a consistent upward trajectory, despite a temporary dip in 2019. The number of deals surged from 525 in 2018 to 1,054 in 2020, reaching a peak of 1,190 in 2021.
Correspondingly, the amounts involved in these deals followed a similar growth pattern. Funding escalated from USD 6.6 billion in 2018 to USD 90 billion in 2020, and remarkably, it soared to USD 179.6 billion in 2021.
These compelling findings unequivocally demonstrate the thriving nature of the Indian startup landscape, as both the number of deals and funding received by startups experienced remarkable growth during the observed period.

2 Wich Region received the most investment

In terms of total number of deals for startups headquartered in the various locations, Mumbai , Bangalore , Gurgaon , and New Delhi made up the top 5. A visual representation can be seen below:

A linear chart showing 40% responded "Yes", 50% responded "No" and 10% responded "Not sure"

The majority of funding in the Indian startup ecosystem is concentrated in two key locations — Mumbai (82.5%) and Bangalore (8.6%). Together, they accounted for about 91% of the total funding received by Indian startups during the analyzed period, indicating significant centralization of funding in these cities.

** 3 Which Sectors are Favoured by Investors**

We note that funding to the **Fintech **and **Retail **sectors made up about 80% of total funding to startups over the period, implying centralization of funding around these sectors as well.

A linear chart showing 40% responded "Yes", 50% responded "No" and 10% responded "Not sure"

A pie chart showing 40% responded "Yes", 50% responded "No" and 10% responded "Not sure"

The data analysis highlights a concentration of funding in specific sectors (Fintech and Retail) and locations (Mumbai and Bangalore) for startups in the Indian ecosystem. Startups positioned at the intersection of these sectors and locations received above-average funding in the deals they participated in.

Conclusion and Recommendations

  • Top sectors in indian startup ecosystem are Fintech*, **Retail, **Edtech, **Tech* and E-commerce.
  • Bangalore has the most startups. It seems to be the emerging city with the top sectors being Innovation Management, Food Delivery and Mechanical & Industrial Engineering
  • Mumbai is the big city with the big money investments, with leading sectors being Fintech, Retail and Multinational conglomerates
  • In different regions, different sectors are more heavily invested in. In this case, we reject our null hypothesis and accept that investment raised is spread widely across multiple sectors
  • The team should start a business in Mumbai in the Fintech or Retail. There seems to be high demand for finance solutions and shopping experiences. Retail was popular during the pandemic as more people were probably shopping from home.
  • Alternatively, the team could start a business in Bangalore, since it seems the preferred region for startups. Innovation Management or*Food Delivery* is the preferred sector to go into.

References
Data analysis made simple: Python Pandas tutorial
A Beginner’s Guide to Data Analysis in Python

A Beginner’s Guide to Data Analysis in Python

Appreciation
I highly recommend Azubi Africa for their comprehensive and effective programs. Read More articles about Azubi Africa here and take a few minutes to visit this link to learn more about Azubi Africa life-changing programs
Tags
AzubiData Science

Top comments (0)