Intro and Ludwig
At the start of February 2019, Uber made their code-free machine learning toolbox, Ludwig, open-source.
Website - https://uber.github.io/ludwig/
User guide - https://uber.github.io/ludwig/user_guide/
Github repo - https://github.com/uber/ludwig/
Ludwig runs on top of the popular and powerful TensorFlow library and offers CLI access to experiment and train machine learning models and predict using TensorFlow models
As an engineer, I'm absolutely not a data scientist. I know enough around TensorFlow to build the most basic of models using tutorials but really couldn't create anything from scratch. Ludwig offered that opportunity.
Our first experiment
Let's dive in and run through a basic example. We're going to try to recreate the Keras tutorial at https://www.tensorflow.org/tutorials/keras/basic_regression with zero lines of code.
The dataset shows basic data to cars in the Auto MPG dataset. Our task is to predict the MPG from the features provided. I've grabbed this and converted it to a CSV file for use in this example.
Ludwig uses a model definition file to determine the parameters for building the model. The internals of Ludwig deals with your data. It creates train, test and validation datasets. It also modifies the data into the best format for training depending on the data type you've specified.
The Keras example needs us to manipulate the data in order to train and test the model. Ludwig does all of this for us. It allows us to train the model immediately by setting up a model definition file at modeldef.yaml
. Here we define the input features and their data type. There are a number of other parameters against each feature which can be set for more complex models. We also define the output feature and its parameters.
input_features:
-
name: Cylinders
type: numerical
-
name: Displacement
type: numerical
-
name: Horsepower
type: numerical
-
name: Weight
type: numerical
-
name: Acceleration
type: numerical
-
name: ModelYear
type: numerical
-
name: Origin
type: category
output_features:
-
name: MPG
type: numerical
First run
Our first experiment can now be run with the following command:
ludwig experiment --data_csv cars.csv --model_definition_file modeldef.yaml --output_directory results
This gives the following results:
===== MPG =====
loss: 52.9658573971519
mean_absolute_error: 6.3724554520619066
mean_squared_error: 52.9658573971519
r2: 9.58827477467211e-05
After 200 epochs complete, I have a mean absolute error (MAE) of 6.4 (yours may vary slightly depending on the random train/test split). This means that on average MPG prediction on a car is 6.4MPG from the actual value. Bearing in mind that values are generally between 10MPG and 47MPG, that 6.4MPG represents quite a large error.
Refinement
If you were watching the log scrolling as Ludwig was running, you'd have seen the MAE against the validation set reducing with each epoch.
The Keras example was suggesting a final MAE of ~2 so we may need a bit of tweaking to get closer. There was fair indication that the MAE was still decreasing as the run ended. We can increase the amount of epochs with a simple addition to the addition to the model definition
training:
epochs: 400
and continue from the previous training model with the command
ludwig experiment --data_csv cars.csv --model_definition_file modeldef.yaml --output_directory results -mrp ./results/experiment_run_0
Our MAE only comes down to 5.3MPG. Still not that close.
Further refinement
In a real life example, we'd start amending the hyperparameters, retraining, amending and retraining again while our target MAE still reduces.
We'll skip this step by replicating the hyperparameters from the Keras tutorial:
training:
batch_size: 32
epochs: 400
early_stop: 50
learning_rate: 0.001
optimizer:
type: rmsprop
In addition, we set early stop at 50 epochs - this means that our model will stop training if our validation curve doesn't improve for 50 epochs. The experiment is fired off in the same way as before. It produces these results:
Last improvement of loss on combined happened 50 epochs ago
EARLY STOPPING due to lack of validation improvement, it has been 50 epochs since last validation accuracy improvement
Best validation model epoch: 67
loss: 10.848812248133406
mean_absolute_error: 2.3642308198952975
mean_squared_error: 10.848812248133406
r2: 0.026479910446118703
We get a message that our model has stopped training at 132 epochs because it's hit the early stop limit.
MAE is down to 2.36MPG without writing a line of code and we've got our example to similar results to the Keras tutorial.
Visualising our training
Now we'd like to validate that our test and validation loss curves are getting pretty close but not showing overfitting. Ludwig continues to deliver on its promise of a no-code solution. We can view our learning curves with the following command:
ludwig visualize -v learning_curves -ts results/experiment_run_0/training_statistics.json
The curves remain following a similar trajectory. Should the validation curve start heading upwards while the training curve remain on this trajectory, it would suggest that overfitting is occurring.
Real life validation
Ok, this is all well and good but tutorials notoriously pick and choose data so the output "just works". Let's try our model out with some real data.
With a bit of investigation, I've dug out the required stats of the DeLorean DMC-12 (https://en.wikipedia.org/wiki/DMC_DeLorean):
Cylinders: 6
Displacement: 2849cc (174 cubic inches)
Horsepower: 130hp
Weight: 1230 kg (2712 lb)
Acceleration: 10.5s
Year: 1981
Origin: US
and converted it to the same CSV format as the training data:
Cylinders,Displacement,Horsepower,Weight,Acceleration,ModelYear,Origin
6,174,130,2712,10.5,81,1
Now, to predict the fuel economy of this, we run the predict
command through Ludwig:
ludwig predict --data_csv delorean.csv -m results/experiment_run_0/model -op
We specify the -op
flag to tell Ludwig that we only want predictions. Inputting a CSV file with MPG column and not adding this flag will run the predictions but also provide us with statistics against actual values supplied in the file.
The result given by my model is 23.53405mpg. How good is this? Unfortunately our Wikipedia article doesn't show the published fuel economy but I did manage to find it in this fantastic article about the amazing car - 22.8mpg. A pretty decent real life test!
Summary
I appreciate that the data scientists out there are screaming that we didn't run through any analysis on the input features to create a meaningful feature set and that we didn't run specific analysis on the test data predictions. I also appreciate that MAE isn't necessarily the ultimate measure of accuracy as it may be skewed heavily by outliers which we could have validated through further analysis.
What we have shown is that using Ludwig, we can experiment and train a machine learning model and then predict using the model we've trained.
Machine learning is becoming more and more accessible. Ludwig seems to be big step forward in that regard.
Top comments (12)
Can you please elaborate setting up the definition of yaml for model definition??
Hi - yeah sure. The modeldef file is a YAML file which defines, at its most basic, the input and output features. In this most basic of examples, we're telling Ludwig that the input features are either numerical or category. This then preprocessed the data to train our model.
More complex models may use a level of NLP to break down sentences or process images. This would all be specified in the input feature.
Equally, we may want to define how the output features are generated. Again, that's specified here.
In my final refinement above, I've also specified the training parameters.
Documentation and plenty of examples can be found here - uber.github.io/ludwig/examples/
The final, full modeldef.yaml from the above example looks like gist.github.com/c-m-hunt/3271efb2a...
So for a scenario specific question I'm running ludwig in colab and the format of the yaml file as described in the docs doesn't work the same way rendering errors. Have you had an opportunity to explore this?
Took a bit of fiddling to get running in Colab but here it is...
colab.research.google.com/drive/1Z...
I think the explanation at the top of the notebook may have been the issue you were having. I had been running from terminal so hadn't come across it. See Github issue for details.
The first bit is getting the data and just dropping NaNs. Also changed one column name as I couldn't work out how to get it working with column names with spaces in.
The rest is the training. The key is that the training is code free.
In both instances however you use a pre-built model_definition.yaml file. Any chance you would now how to create a simple model_definition.yaml file from scratch in-line? I kind of wanted to understand how that would work as it is not described in the documents so well.
It's not "pre-built" - you have to write the model definition yourself.
The docs at uber.github.io/ludwig/user_guide/#... explain the basics of the model definition file. It's just a yaml file defining the input and output features along with any additional parameters you want to override the Ludwig defaults.
Thank you. I figured it out. I just didn't realize how to start it off.
Chris
I'm trying to redo your example but the CSV seems to have constraints which I cannot find.
Firstly I saw that the output feature should be the last field ( makes some sense)
I reconfigured data in the order of the YAML definiton but keep running into errors.
=====
indexer = self.columns.get_loc(key)
File "c:\users\uan401\appdata\local\continuum\anaconda3\envs\ludwig\lib\site-packages\pandas\core\indexes\base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas_libs\hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Cylinders'
removed some data with missing values but same results.
What is the structure you are using ( 4-line example??)
my example data
8\307.0\130.0\3504.\12.0\70\1\18.0
8\350.0\165.0\3693.\11.5\70\1\15.0
8\318.0\150.0\3436.\11.0\70\1\18.0
8\304.0\150.0\3433.\12.0\70\1\16.0
NB df = pandas.read_csv("autos3.csv",header=None, sep='\')
works fine with my data (\ is a double \ )
Hi. Have you had a look at the colab notebook in the previous comment? That runs end to end including the little bit of fiddling with the data
Hello!
May you please elaborate Visualization commands? I am facing some warnings related to TensorFlow.
Thanks in anticipation
Hi
I think it's the latest version of Tensorflow is warning about features which are going to be deprecated in v2 of Tensorflow which is currently sitting in alpha. It's nothing to be concerned with at the moment. I'm going to try and find out if Ludwig will be moved to support TF2. I'm pretty confident it will be.
Thanks
Chris