DEV Community

Discussion on: Building our First Machine Learning Model (Pt. 4)

Collapse
 
ctrowbridge profile image
Cindy Trowbridge • Edited

I'm having problems following this and getting working code. Specifically, on the random_forest.fit(train_X, train_y) call, I get the following error: "ValueError: could not convert string to float: 'setosa'"

I think this may be because the train_X data still has the species in text format. How does the fit function know the relationship between the X species column and the Y setosa/versicolor/virginica columns? Do I need to do one-hot encoding on the X data?

Also, the steps seem to be out of order. Shouldn't you do the get_dummies(y) call before you do the train_test_split(x, y, ...)? Maybe this isn't intended to be a full working example?

Collapse
 
imronlearning profile image
Michael Learns

Right! My bad 😅 The order is actually correct. Doing get_dummies first before splitting the data might cause a data leakage. We want to make sure that when we split our data, it is "pure". My mistake was that y = pd.get_dummies(y) I've updated it so that would be like this instead:

train_y = pd.get_dummies(train_y)
val_y = pd.get_dummies(val_y)

Sorry I took so long to reply 😅you can easily reach me tho through twitter @heyimprax.

Collapse
 
ctrowbridge profile image
Cindy Trowbridge

I added the two lines above, but I still get the same error message. "ValueError: could not convert string to float: 'setosa'"

I think this may be because the train_X data still has the species in text format. How does the fit function know the relationship between the X species column and the Y setosa/versicolor/virginica columns? Do I need to do one-hot encoding on the X data?

Could you post a full, working Python script somewhere so I can see how this is supposed to work?

Thread Thread
 
imronlearning profile image
Michael Learns • Edited

Oh right! Take out the species in the features array. That should fix the "ValueError: could not convert string to float: 'setosa'"

Also, I've added the missing from sklearn.metrics import mean_absolute_error
for the mean_absolute_error function.

Here's a link to a working kaggle notebook: kaggle.com/interestedmike/iris-dat...