If you're a confused beginner like I was when just starting out with machine learning in python, then stick around, because today, I'll be trying my best at demystifying and simplifying machine learning for you!
In the last "Demystifying machine learning for beginners" blog post, I've explained and demonstrated how to plot data, as well as classify new pieces of data. In this blog post, I'll be demonstrating how to predict a value based on another value using linear regression!
Let's get started!
First, import the required libraries(you can install them by using pip install
or pip3 install
):
from sklearn.linear_model import LinearRegression
import pandas
from sklearn import preprocessing
import numpy as np
For this tutorial, we're going to be using this dataset for medical expenses based on values such as: age, sex, bmi and more...
df = pandas.read_csv('insurance.csv')
Next, we're going make the "age" column represent the X axis, and let the "charges" column represent the Y axis, which will be the set of values which we are going to try to predict:
X = np.array(df["age"]).reshape((-1, 1))
y = np.array(df["charges"])
- The "age" and "charges" columns are turned into arrays using the "np.array" method.
- The "reshape" function then makes sure to turn the "age" column into a 2D array.
Before predicting new values, finally add these lines of code:
model = LinearRegression()
model.fit(X, y)
- This will define our linear regression model, as well as fit it with the "X" and "y" values defined earlier.
Now we can finally predict new values!
Put these lines of code at the end of your script:
X_predict = [[35]]
y_predict = model.predict(X_predict)
print(y_predict)
- _This will try to predict the insurance price based on the age, in this case, the age is "35 years old".
If you run your code, your output should look like this:
>[12186.1766594]
- As you can see, our python script predicted that someone aged 35 years old would have medical charges of about 12186$
Now of course, this prediction isn't very accurate because a person and a person's medical expenses aren't going to be defined solely by their age, that's why, we're going to add some more values to our model.
First of all, to add another value to our model, replace X = np.array(df["age"]).reshape((-1, 1))
with:
X = list(zip(df["age"], df["bmi"]))
- This will add the values of the "bmi" column to the age values to create a 2D array.
Now we'll be able to predict a person's medical charges based not only on their age, but also based on their bmi:
X_predict = [[35, 45]]
y_predict = model.predict(X_predict)
print(y_predict)
- Here our python script is going to predict one's medical charges based on their age and their bmi, where "35" is their age and "45" is their bmi.
If you run your script, the output should look something like this:
>[17026.20170095]
As you can see, this is different from our previous result because we've introduced new pieces of information into our model.
Nice! your first working linear regression model that can predict values.
You can now experiment with this code, add more values, predict different sets of values and more!
Byeeeeeđź‘‹
Top comments (1)
Why didn’t you use train_test_split before training your data?