I'm having problems following this and getting working code. Specifically, on the random_forest.fit(train_X, train_y) call, I get the following error: "ValueError: could not convert string to float: 'setosa'"
I think this may be because the train_X data still has the species in text format. How does the fit function know the relationship between the Xspecies column and the Ysetosa/versicolor/virginica columns? Do I need to do one-hot encoding on the X data?
Also, the steps seem to be out of order. Shouldn't you do the get_dummies(y) call before you do the train_test_split(x, y, ...)? Maybe this isn't intended to be a full working example?
Right! My bad 😅 The order is actually correct. Doing get_dummies first before splitting the data might cause a data leakage. We want to make sure that when we split our data, it is "pure". My mistake was that y = pd.get_dummies(y) I've updated it so that would be like this instead:
I added the two lines above, but I still get the same error message. "ValueError: could not convert string to float: 'setosa'"
I think this may be because the train_X data still has the species in text format. How does the fit function know the relationship between the Xspecies column and the Ysetosa/versicolor/virginica columns? Do I need to do one-hot encoding on the X data?
Could you post a full, working Python script somewhere so I can see how this is supposed to work?
I'm having problems following this and getting working code. Specifically, on the
random_forest.fit(train_X, train_y)
call, I get the following error: "ValueError: could not convert string to float: 'setosa'"I think this may be because the
train_X
data still has the species in text format. How does thefit
function know the relationship between theX
species
column and theY
setosa/versicolor/virginica
columns? Do I need to do one-hot encoding on the X data?Also, the steps seem to be out of order. Shouldn't you do the
get_dummies(y)
call before you do thetrain_test_split(x, y, ...)
? Maybe this isn't intended to be a full working example?Right! My bad 😅 The order is actually correct. Doing
get_dummies
first before splitting the data might cause a data leakage. We want to make sure that when we split our data, it is "pure". My mistake was thaty = pd.get_dummies(y)
I've updated it so that would be like this instead:train_y = pd.get_dummies(train_y)
val_y = pd.get_dummies(val_y)
Sorry I took so long to reply 😅you can easily reach me tho through twitter @heyimprax.
I added the two lines above, but I still get the same error message. "ValueError: could not convert string to float: 'setosa'"
I think this may be because the train_X data still has the species in text format. How does the fit function know the relationship between the
X
species
column and theY
setosa/versicolor/virginica
columns? Do I need to do one-hot encoding on the X data?Could you post a full, working Python script somewhere so I can see how this is supposed to work?
Oh right! Take out the
species
in the features array. That should fix the "ValueError: could not convert string to float: 'setosa'"Also, I've added the missing
from sklearn.metrics import mean_absolute_error
for the
mean_absolute_error
function.Here's a link to a working kaggle notebook: kaggle.com/interestedmike/iris-dat...