Bivariate Regression on MLB 2002 Dataset

chandra0505 profile image Chandra Prakash ・2 min read

This is my mini-project undertaken in pre-final year of college

In this project, I analyzed the data of an American Major League Baseball (MLB) tournament for season, 2002, which has a collection of batting statistics of 331 baseball players.

I aim to predict whether there is a relationship between batting average and the number of home runs a player hits.

First, I checked for outliers then perform the transformation on the data such that it does not violate any assumptions of regression.

Various types of plots used to visualize the data like scatter plot, normal q-q plot, etc.

Below is the GitHub link for the code

{% https://github.com/Chandra0505/Project-1-mlb-dataset %}

How I built it?

We divided the dataset into two sets one training set (80%) and the other as test set (20%). On the training set, we trained our model and with the test, we test its accuracy by cross-validating it.

Through our final regression model, we achieve an accuracy of about 22% which quite good because we are told to perform Bivariate Regression on batting average and home runs of a player.
Of course, many other factors also affect a person’s ability to hit home runs, such as size, strength, number of at-bats, and other factors.
However, batting average alone accounts for nearly one-fourth of the variability in the response.
So we neglected/ remove all other features like which could also play a crucial in finding the relationship.

What's the stack?

R as programming language (Version:3.5)
Libraries Used: ggplot2, caTools, Publish
RStudio as IDE (Version: 1.1.463)

My learnings / Feelings / Stories

Through this project, I got to learn how to perform various data science skills of a real-world dataset.

Posted on by:

chandra0505 profile

Chandra Prakash


Incoming Software Engineer


Editor guide