Basics of Statistics

#machinelearning #datascience #beginners #tutorial

Hey reader👋. Hope you are doing well.
In the last post we have read about statistics, its types and why it is important in data science. In this post we are going to talk about some of the basic terms used in statistics.
So let's get started🔥.

Population and Samples

Population-: In statistical terms, a population encompasses the entire set of items or individuals that share one or more characteristics of interest. The population is the complete pool from which a statistical sample can be drawn and to which results can be generalized.
Example-: collecting data of all eligible voters in state, so here eligible voters comprises population.
Populations can be finite or infinite:

Finite Population: A population with a limited number of members. For example, all the students in a particular school or all the residents of a city.

Infinite Population: A population with an unlimited number of members. For example, all the possible outcomes of rolling a die or all potential customers of an online store.

Collecting data from such a large set is quite difficult this is why sample comes into picture.

Samples-:A sample is a subset of the population that is selected for analysis. The purpose of taking a sample is to make inferences about the population without examining every member. Samples are used because it is often impractical or impossible to study the entire population due to constraints of time, cost, or accessibility.
In our above example of population collection of data from every eligible individual is difficult so we will select different regions and collect data from there and based on these samples we will do further analysis.

Different Sampling Techniques

To select appropriate sample from population we use different sampling techniques.

1. Simple Random Sampling-:
Every member of the population has an equal chance of being selected. This method helps eliminate bias and ensures that the sample represents the population. (Sample is selected randomly).

Example-:You want to select a simple random sample of 1000 employees of a social media marketing company. You assign a number to every employee in the company database from 1 to 1000, and use a random number generator to select 100 numbers.

2. Stratified Sampling-:
The population is divided into strata (groups) based on a specific characteristic, and samples are drawn from each stratum. This ensures representation across key subgroups.

Example-:The company has 800 female employees and 200 male employees. You want to ensure that the sample reflects the gender balance of the company, so you sort the population into two strata based on gender. Then you use random sampling on each group, selecting 80 women and 20 men, which gives you a representative sample of 100 people.

3. Systematic Sampling-:
Systematic sampling is similar to simple random sampling, but it is usually slightly easier to conduct. Every member of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at regular intervals.

Example-:All employees of the company are listed in alphabetical order. From the first 10 numbers, you randomly select a starting point: number 6. From number 6 onwards, every 10th person on the list is selected (6, 16, 26, 36, and so on), and you end up with a sample of 100 people.

4. Convenience Sampling-:
Samples are selected based on ease of access. This method is less rigorous and more prone to bias but can be useful for preliminary research. A convenience sample simply includes the individuals who happen to be most accessible to the researcher.

Example-:You are researching opinions about student support services in your university, so after each of your classes, you ask your fellow students to complete a survey on the topic. This is a convenient way to gather data, but as you only surveyed students taking the same classes as you at the same level, the sample is not representative of all the students at your university.

Symbols used for Population and Sample-:

Population -> N
Sample -> S

Types of Variables

Generally we have two types of variables-:

Qualitative (Categorical) Variables -: These variables represent categories or groups and describe qualities or characteristics. They can be further divided into:

Nominal Variables: These have categories without any intrinsic order. Examples include gender (male, female), hair color (blonde, brown, black), and type of car (sedan, SUV, truck).

Ordinal Variables: These have categories with a meaningful order, but the intervals between the categories are not necessarily equal. Examples include educational level (high school, bachelor’s, master’s, PhD) and satisfaction rating (satisfied, neutral, dissatisfied).

Quantitative (Numerical) Variables -: These variables represent numerical values and can be measured. They can be further divided into:

Discrete Variables: These can take on a finite or countable number of values. Examples include the number of children in a family, the number of cars in a parking lot, and the number of pages in a book.

Continuous Variables: These can take on an infinite number of values within a given range. They are often measurements. Examples include height, weight, temperature, and time.

You can find more about types of variables in my ML Day2 blog.

So this is it for this blog. I hope you have understood it well. For any query please comment. I'll try my best to solve your query.
We will go more deep in next blog. Till then stay connected and don't forget to follow me.🩵

DEV Community

Basics of Statistics

Population and Samples

Different Sampling Techniques

Types of Variables

Top comments (0)