ML Model Inference in Painless

#elasticsearch #java #machinelearning

Inference in Painless

I am an employee of Elastic at the time of writing

Machine learning inference is just math. You have some parameters, pump them through some functions, and boom, you get a result. While this is simple on the surface, all the tooling can get complex. Could I script simple model inference in Elasticsearch?

What is [Elasticsearch | Painless]

Elasticsearch is a distributed, restful, and open data store. The underlying store is Lucene with a bunch of goodies built on top.

Painless is a secure, simple, and flexible scripting language purpose built for Elasticsearch. You can use custom scripts at search time, in many different aggregations, and even at ingest time. It's crazy powerful and flexible. But, with great power, comes great responsibility.

Machine Learning inference in Painless

Painless 100.5 (not even 101)

Painless can be used a couple of ways:

inline: where the whole script is included in the API call
stored: The script is stored in Elasticsearch's cluster state.

Painless scripts can reference fields in the given context (doc fields, _source fields). They also have access to a params object. This can be provided for script reuse on different input parameters.

Simple models

Linear regression, being intuitive and simple is a very nice place to start the experiments.

It is trivial to implement one dimensional linear regression Painless.

# Storing a simple linear regression function script
PUT _scripts/linear_regression_inference
{
  "script": {
    "lang": "painless",
    "source": """
    // This assumes the parameter definitions will be given when used
    // This also assumes a single target.
    double total = params.intercept;
    for (int i = 0; i < params.coefs.length; ++i) {
      total += params.coefs.get(i) * doc[params['x'+i]].value;
    }
    return total;
    """
  }
}

I trained a simple model in scikit-learn, on the diabetes data set. Here is using the model's resulting parameters in the script to return a script field.

GET diabetes_test/_search
{
  "script_fields": {
    "regression_score": {
      "script": {
        "id": "linear_regression_inference",
        # Here are the model parameters. The linear regression coefficients and intercept. 
        "params": {
          # coef_ attribute from sklearn 
          "coefs": [-35.55683674, -243.1692265, 562.75404632, 305.47203008, -662.78772128, 324.27527477, 24.78193291, 170.33056502, 731.67810787, 43.02846824],
          # intercept_ attribute from sklearn
          "intercept": 152.53813351954059,
          "x0": "age",
          "x1": "sex",
          "x2": "bmi",
          "x3": "bp",
          "x4": "s1",
          "x5": "s2",
          "x6": "s3",
          "x7": "s4",
          "x8": "s5",
          "x9": "s6"
        }
      }
    }
  }
}

Writing custom inference code for every model type could get tiring. More complex models will demand an ever growing library of functions. There are plenty of inference library's out there to experiment with. Why reinvent the wheel? Can one be made to work with Painless?

m2cgen to Painless

m2cgen is a python library that translates trained models into static code. While only specific models are supported, the code generated works great. Painless supports a subset of Java and m2cgen has Java as a potential output. Generating painless scripts from trained models is possible!

Well, its not out of the box. The entry point for painless is an object called params. So, m2cgen's Java functions have to be adjusted for how painless accepts outside parameters. Here is an example translating the Java output to a Painless acceptable one:

import xgboost as xgb
from sklearn import datasets
from sklearn.metrics import mean_squared_error
import m2cgen as m2c


diabetes = datasets.load_diabetes() # load data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=0)
print(diabetes.feature_names)
model = xgb.XGBRegressor(max_depth=6, learning_rate=0.3, n_estimators=50)
model.fit(X_train,y_train)
java_model = m2c.export_to_java(model)
java_model = java_model.replace("input", "params")
for idx, val in enumerate(diabetes.feature_names):
    java_model = java_model.replace("[" + str(idx) + "]", "[\"" + val + "\"]")
print(java_model)

Here is the output (truncated)

double var0;
        if ((params["s5"]) >= (0.0216574483)) {
            if ((params["bmi"]) >= (0.0131946635)) {
                var0 = 72.2889786;
            } else {
            ...
            return ((((((((((((((((((((((((((((((((((((((((((((((((((0.5) + (var0)) + (var1)) + (var2)) + (var3)) + (var4)) + (var5)) + (var6)) + (var7)) + (var8)) + (var9)) + (var10)) + (var11)) + (var12)) + (var13)) + (var14)) + (var15)) + (var16)) + (var17)) + (var18)) + (var19)) + (var20)) + (var21)) + (var22)) + (var23)) + (var24)) + (var25)) + (var26)) + (var27)) + (var28)) + (var29)) + (var30)) + (var31)) + (var32)) + (var33)) + (var34)) + (var35)) + (var36)) + (var37)) + (var38)) + (var39)) + (var40)) + (var41)) + (var42)) + (var43)) + (var44)) + (var45)) + (var46)) + (var47)) + (var48)) + (var49);

The generated script is humongous. Almost 7,000 lines. Anybody will tell you, that is too much.

But, does it work?

Ugh:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "exceeded max allowed stored script size in bytes [65535] with size [307597] for script [diabetes_xgboost_model]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "exceeded max allowed stored script size in bytes [65535] with size [307597] for script [diabetes_xgboost_model]"
  },
  "status" : 400
}

Script size limits are wise. Stored scripts are put in the cluster state object. The more stored scripts (and the larger the scripts), the more overall cluster performance will start to drag and might lead to other problems.

But what if I didn't care about my cluster health? I want my model and I want it now!

PUT _cluster/settings
{
  "transient": {
    "script.max_size_in_bytes": 10000000
  }
}

Sane limitations can't stop me!

Time to put the script:

PUT _scripts/diabetes_xgboost_model
{
  "script": {
    "lang": "painless",
    "source": """
    ...very large source...
    """
    }
}

Now I can use my stored script!

"regression": {
    "bucket_script": {
        "buckets_path": {
            "age": "age",
            "sex": "sex",
            "bmi": "bmi",
            "bp": "bp",
            "s1": "s1",
            "s2": "s2",
            "s3": "s3",
            "s4": "s4",
            "s5": "s5",
            "s6": "s6"
        },
        "script": {
          "id": "diabetes_xgboost_model"
        }
   }
}