DEV Community

DMetaSoul
DMetaSoul

Posted on

The design concept of an almighty Opensource project about machine learning platform

In my previous article, "Almighty Opensource project about machine learning you should try out", I introduced MetaSpore. In this article, I will introduce MetaSpore's design philosophy in detail. Next week, I will update the detailed steps of using MetaSpore to build industrial recommendation systems rapidly.
How did MetaSpore move its complex recommendation algorithms beyond the Internet giants to reach the vast majority of SMEs and developers? This article unveils MetaSpore's core design philosophy.

1.What is a one-stop machine learning platform
When it comes to machine learning, people tend to think of various machine learning frameworks, such as TensorFlow and PyTorch, etc. Many models implement Python code on GitHub. However, there are still a lot of difficulties in implementing the algorithm model in specific business scenarios._ Taking the recommendation system as an example, the following problems will be encountered in the implementation of the algorithm model in specific business scenarios:_

  • **How to generate training data (samples)? **In recommendation scenarios, it is often necessary to splice user feedback signals (exposure, click, etc.) with various feature sources, perform necessary feature cleaning and extraction, and divide data into verification sets, negative sampling, and other complex data processing. These processes are usually impossible in machine learning frameworks, so a big data system is needed to process bulk or streaming data.
  • **How to transfer training data generated by big data platforms to a deep learning framework? **Frameworks such as TensorFlow and PyTorch have their data input formats and corresponding DataLoader interfaces, requiring developers to parse the data. So how to deal with data fragmentation, variable-length characteristics, and other problems often puzzle algorithm engineers.
  • **How to run a distributed training that can freely schedule cluster resources, including GPUs? **This may require a dedicated operations team to manage machine learning-related hardware scheduling.
  • How to train the sparse feature model and use the NLP pre-training model? In the recommendation scenario, we need to be able to handle sparse features on a large scale to model the interesting relationship between users and goods. At the same time, multimodal model fusion has gradually become the frontier direction.
  • How to make online predictions efficiently after model training? In addition to cluster resources, elastic scheduling and load balancing are also involved. These problems require dynamic resource allocation in a heterogeneous environment with CPUs, GPUs, and NPUS. For Complex models, distillation, quantification, and other means are necessary to ensure their predictive performance.
  • After the model goes online, how does the online system extract splicing features, ensure the consistency of offline features, and evaluate the algorithm effect quantitatively? An online algorithm application framework that can integrate with an online system read all kinds of online data sources and provide a multi-layer ABTest traffic experiment function is necessary.
  • Finally, how to conduct efficient iteration in an algorithm experiment? Programmers want to run multiple, parallel experiments quickly to improve business performance rather than get bogged down in complex system environment configurations. Multiple teams and different systems often solve these problems in large Internet factories. For example, the following image is from the overall architecture of the recommendation system shared by Netflix's algorithm engineering team: Image description As can be seen from the above figure that building a complete, industrial-grade recommendation system is quite complex and tedious, requiring considerable knowledge in different fields and investment in engineering development. SMEs lack the staffing and a one-stop platform to solve these problems standardized.

_2.MetaSpore's one-stop machine learning platform_
The original intention of DMetaSoul to develop and open-source MetaSpore is to help enterprises and developers solve all kinds of problems encountered in the process of algorithm business development based on MetaSpore features and provide a one-stop development experience by using standardized components and development interfaces. Meet the needs of enterprises and developers to obtain algorithmic business development best practices. Specifically, MetaSpore has the following core functional design concepts:

  • Model training is seamlessly integrated with the big data system, which can directly read the structured and unstructured data of all kinds of data lakes and warehouses for training. Data feature preprocessing, and model training are seamlessly linked together, saving the tedious data import and export and format conversion.
  • Support for sparse features. Large-scale sparse Embedding layer training is necessary for search and generalization scenarios. Some processing of sparse features is involved, such as cross combination, variable-length feature pooling, etc., which requires special support from the training framework.
  • Provide high-performance online forecasting services. Online prediction services support neural networks (including sparse Embedding), decision trees, and a variety of traditional machine learning models. Supports heterogeneous hardware computing acceleration, reducing the engineering threshold for online deployment.
  • Unified offline feature calculation. Through the unified feature format and calculation logic, the unified offline feature calculation saves the repeated development of multiple systems and ensures the consistency of offline features.
  • Online algorithm application framework. The online algorithm application framework covers the common function points of online systems, such as automatic feature extraction from multi-data sources, feature calculation, predictive service interface, experimental dynamic configuration, ABTest dynamic cutting flow, etc.
  • Embrace open source. The MetaSpore platform provides several homegrown components to implement these core function points. At the same time, MetaSpore's development philosophy is to embrace the mature open source ecosystem as much as possible without fragmentation, lowering the barriers to learning and enabling developers to develop based on their existing experience quickly.

MetaSpore integrates data, algorithms, and online systems to provide a one-stop, full-process development experience for algorithms through these core functional points and design concepts.

2.1 **MetaSpore's integration with big data ecology**
A large class of machine learning algorithms deals with structured tabular data and conduct prediction, classification, regression, and modeling. For example, CTR estimation of recommended advertising and financial risk control are common. In the practice of this kind of algorithm, it is very important to preprocess data. The quality of the data is directly related to the final model effect. Common data processing includes feature splicing, null value filling, discrete bucket division, outlier filtering, feature engineering, and sample generation such as verification set division and random shuffle. These steps often rely on big data systems for processing in engineering practice, while deep learning frameworks lack comprehensive big data processing functions.
There will be many problems with data and training connection in this case. Big data systems have multiple storage formats and systems, making it difficult for deep learning frameworks to adopt one by one. Traditionally, algorithmic engineers are required to process data format conversion by themselves, converting the format of big data systems into formats that deep learning frameworks can recognize. Not only is this cumbersome, it creates data redundancy problems, but it is also challenging to implement incremental streaming learning.
MetaSpore provides an offline training framework using the Parameter Server architecture for such scenarios. The MetaSpore offline training framework integrates seamlessly with PySpark, allowing you to input the Spark DataFrame directly into model training with no format conversion required. Spark supports multiple data sources, including CSV, Parquet, and ORC. It also supports various data stores, data lakes, such as Hive, and streaming data such as Kafka. By docking with PySpark, MetaSpore can directly train all the data sources Spark can support, greatly simplifying the Pipeline for data processing.
In terms of implementation, MetaSpore takes advantage of the PandasUDF functionality provided by PySpark. PandasUDF is a Python vectorization UDF interface provided by PySpark that uses Arrow as an efficient data transfer protocol from JVM to Python. The Worker end of the MetaSpore training framework is a PandasUDF, which receives a batch of data and sends it to the model training module. The model training results, such as predicted values, are also sent back to Spark in Arrow format to form a new DataFrame. Take a look at a code example:

# Defining Neural Networks
module = metaspore.algos.MLP()
# Create a PyTorch Distributed Training Estimator
estimator = metaspore.PyTorchEstimator(module)
# Read training and validation sets via Spark
train_df = spark_session.read.csv('hdfs://movielens/train/')
test_df = spark_session.read.csv('hdfs://movielens/test/')
# Model training on the training set
model = estimator.fit(train_df)
# Use the trained model to make predictions on the test set
test_prediction_result = model.transform(test_df)
# The predicted result is Spark DataFrame,Operations that can be invoked on any DataFrame
evaluator = pyspark.ml.evaluation.BinaryClassificationEvaluator()
auc = evaluator.evaluate(test_prediction_result)
print(auc)
Enter fullscreen mode Exit fullscreen mode

You can see that MetaSpore integrates seamlessly with Spark. The read CSV in the middle can be replaced by any data source supported by Spark, and the model is trained after complex data processing in Spark.

2.2 **MetaSpore's support for sparse discrete feature training**
A large class of features is sparse and discrete in search, advertising, recommendation, and other personalized recommendation scenarios. This feature is usually converted to a fixed ID using one-hot Encoding and mapped to a Vector as a learning parameter. In this scenario, there will be two kinds of problems. One is that the space of feature ID is very large, usually over 100 million level; the other is that the ID space is not fixed, and the value types of some discrete features will change constantly.
MetaSpore provides a Parameter Server architecture to split the Embedding Table for the first problem. To solve the second problem, MetaSpore provides a complete set of dynamic sparse feature learning schemes, storing Embedding Table through hash Table and realizing the addition and removal of Embedding Vector. The original discrete eigenvalue hash, multi-value combination, and multi-value pooling are integrated and defined as a PyTorch Module for the recommendation scenario. Define a sparse MLP model using MetaSpore as follows:

class MLPLayer(torch.nn.Module):
    def __init__(self, ...):
        super().__init__()
        self.sparse = metaspore.EmbeddingSumConcat(deep_embedding_dim,
                                           deep_column_name_path,
                                           deep_combine_schema_path)
        dense_layers=[]
        dnn_linear_num=len(hidden_lay_unit)
        dense_layers.append(metaspore.nn.Normalization(feature_dim*embedding_size))
        dense_layers.append(torch.nn.Linear(feature_dim*embedding_size, hidden_lay_unit[0]))
        dense_layers.append(torch.nn.ReLU())
        for i in range(dnn_linear_num - 2):
            dense_layers.append(torch.nn.Linear(hidden_lay_unit[i], hidden_lay_unit[i + 1]))
            dense_layers.append(torch.nn.ReLU())
        dense_layers.append(torch.nn.Linear(hidden_lay_unit[-2], hidden_lay_unit[-1]))
        self.dnn = torch.nn.Sequential(*dense_layers)

    def forward(self, input_raw_strings):
        sparse = self.sparse(input_raw_strings)
        return self.dnn(sparse)
Enter fullscreen mode Exit fullscreen mode

In the above code, by creating an EmbeddingSumConcat layer, you can accept sample input of raw string type, automatically hash, combine, pooling (default Sum), and input_raw_strings for the forward method. It is a batch of table data read by PySpark (a Pandas DataFrame object). Compared with TorchRec and other recommendation system deep learning frameworks, the API level is simplified, and the call logic is more direct and easy to understand.

2.3 **MetaSpore model Serving service**
The MetaSpore platform integrates online model Serving (prediction) services. Unlike TFServing, MetaSpore Serving has the following characteristics:

  • Support a wider range of models. In addition to sparse NN models generated by the offline training framework, MetaSpore Serving also supports NLP, CV, and other NN models, XGBoost, LightGBM, and other decision tree models. And Spark ML, SKLearn, and other machine learning library models.
  • Integrate feature processing logic. Feature calculation is an integral part of model Serving, such as sparse discrete feature processing logic, Tokenizer in NLP, etc. Traditionally, these logics are implemented in two offline sets, which are inefficient and prone to logical inconsistencies and other errors. MetaSpore Serving integrates the computational logic for sparse features and shares a code offline.
  • Supports heterogeneous hardware acceleration, such as CPU, GPU, and NPU. Model prediction usually requires selecting different hardware to perform calculations in different scenarios. MetaSpore Serving is capable of supporting common hardware for heterogeneous computing.

In MetaSpore Serving, OnnxRuntime is used as the computational library. And feature processing is used as prediction processing logic. The prediction calculation of each model, including several feature table calculations and the final Onnx model calculation, forms a DAG calculation graph. For example, in a typical Wide&Deep model, the computational logic at Serving can be roughly expressed as follows:

Image description

In this way, MetaSpore Serving can easily integrate various feature extraction and calculation modules, maximizing alignment with offline logic and simplifying the development experience.

2.4 **MetaSpore is uniformly calculated in offline feature**
For the Table class model (sparse or dense), MetaSpore takes Arrow Table as input offline and expresses the feature calculation logic as an execution plan of Arrow Compute. An expression map evaluates each feature. For discrete features, hashing and multi-value combinations are supported. For dense features, normalization, binarization, and bucket splitting are supported. Much of this calculation can be done directly using Arrow's built-in expression. For Arrow, which does not have built-in computations, you can register your own expressions using a custom Compute Kernel.

In terms of implementation, the expression of feature operation used offline is serialized and saved in the directory exported by the model. The same Arrow Compute expression is constructed after the online prediction service loads the expression so that the completely consistent feature calculation can be shared offline. In addition, because Arrow Compute supports filtering, joins, and other operations, LakeSoul also supports input to input batches. Thanks to the design of Arrow column expression, these calculations are vectorized and can have relatively high performance.

The offline feature calculation and offline input calculation are unified Arrow Table format. Arrow itself provides a multi-language interface to create Arrow tables, so MetaSpore Serving also easily supports multiple language calls. For example, with an XGBoost model with ten float type feature inputs, LakeSoul implements a program in Python that calls the MetaSpore prediction service:

import grpc
import metaspore_pb2
import metaspore_pb2_grpc
import pyarrow as pa
with grpc.insecure_channel('0.0.0.0:50051') as channel:
    stub = metaspore_pb2_grpc.PredictStub(channel)
    # Construct a one-line, 10-feature sample
    row = []
    values = [0.6558618,0.13005558,0.03510657,0.23048967,0.63329154,0.43201634,0.5795548,0.5384891,0.9612295,0.39274803]
    for i in range(10):
        row.append(pa.array([values[i]], type=pa.float32()))
    # Create Arrow RecordBatch
    rb = pa.RecordBatch.from_arrays(row, [f'field_{i}' for i in range(10)])
    # Serialization RecordBatch
    sink = pa.BufferOutputStream()
    with pa.ipc.new_file(sink, rb.schema) as writer:
        writer.write_batch(rb)

    # Construct GRPC request for MetaSpore Serving
    payload_map = {"input": sink.getvalue().to_pybytes()}
    request = metaspore_pb2.PredictRequest(model_name="xgboost_model", payload=payload_map)
    # Call Serving service prediction
    reply = stub.Predict(request)
    for name in reply.payload:
        with pa.BufferReader(reply.payload[name]) as reader:
            tensor = pa.ipc.read_tensor(reader)
            print(f'Predict Output Tensor: {tensor.to_numpy()}')
Enter fullscreen mode Exit fullscreen mode

It can be seen that the online prediction is also the Arrow Table of the input characteristics, and MetaSpore Serving can be directly invoked. The invocation method is also uniform offline.

2.5 [MetaSpore](https://github.com/meta-soul/MetaSpore**) online algorithm application framework**
Offline training frameworks and online Serving services are now available. Then, an algorithm in the business scene landing is still a final step: an online algorithm experiment. In a service scenario, to verify the validity of an algorithm model, a baseline needs to be established and compared with the new algorithm model. Therefore, an online experimental framework is needed which can easily define algorithm experiments, read online features, and call model prediction services. In addition, multiple experiments can be traffic segmented to achieve ABTest effect comparison. A configuration center is also needed to quickly carry out multiple experimental iterations, which can dynamically load refresh experiments and cut flow configurations, support hot loading of experimental parameters, and various debugging and trace functions. This link also directly determines whether the AI model can be finally implemented into practical business applications.

MetaSpore provides an online algorithm application framework that covers the entire flow of online experiments. This framework is based on SpringBoot + Spring Cloud and provides the following core functions:

1. Experiment Pipeline. Developers can add experimental classes with simple annotations:

@ExperimentAnnotation(name = "rank.widedeep")
@Component
public class WideDeepRankExperiment implements BaseExperiment<RecommendResult, RecommendResult>
{
    @Override
    public void initialize(Map<String, Object> map) {}

    @SneakyThrows
    @Override
    public RecommendResult run(Context context, RecommendResult recommendResult)
{}
}
Enter fullscreen mode Exit fullscreen mode

The Spring Cloud configuration center is then used to dynamically assemble multiple experimental flows, including flow segmentation for horizontal experiments and sequence of vertical experiments. Take the recommendation system as an example, there are two experiments in the recall layer and two experiments in the sorting layer, and the two layers are orthogonal so that the online experiment process can be assembled through the following configuration:

scene-config:
  scenes:
    - name: guess-you-like
      layers:
        - name: match
          normalLayerArgs:
            - experimentName: match.base
              ratio: 0.3
            - experimentName: match.multiple
              ratio: 0.7
        - name: rank
          normalLayerArgs:
            - experimentName: rank.wideDeep
              ratio: 0.5
            - experimentName: rank.lightGBM
              ratio: 0.5
  experiments:
    - layerName: match
      experimentName: match.base
      extraExperimentArgs:
        modelName: match.base
        matcherNames: [ItemCfMatcher]
    - layerName: match
      experimentName: match.multiple
      extraExperimentArgs:
        modelName: match.multiple
        matcherNames: [ItemCfMatcher, SwingMatcher, TwoTowersMatcher]
    - layerName: rank
      experimentName: rank.wideDeep
      extraExperimentArgs:
        modelName: movie_lens_wdl
        ranker: WideAndDeepRanker
        maxReservation: 100
    - layerName: rank
      experimentName: rank.lightGBM
      extraExperimentArgs:
        modelName: lightgbm_test_model
        ranker: LightGBMRanker
        maxReservation: 100
Enter fullscreen mode Exit fullscreen mode

After the configuration and experimental class implementation are added to the SpringBoot project, the framework will automatically create and initialize the objects of the experimental class and execute them by order of the experimental layer, and automatically select an experiment to execute in each layer according to the tangential flow configuration. The developer can freely define the input and output between experiments, providing a high degree of flexibility. At the same time, this configuration file is placed in the configuration center, supporting dynamic hot update, modification of experimental configuration, and cutting flow configuration, which can take effect in real-time.
**2.Online feature extraction framework. **In online prediction, we need to construct an Arrow Table of features. However, online features can come from multiple sources, often with various databases, caches, and upstream services. MetaSpore, based on SpringBoot JPA, encapsulates a feature extraction code generation framework that automatically generates database access code only after the developer defines the schema and data source of the feature.
**3.MetaSpore provides a Client implementation of online Serving, **supporting SpringBoot and providing annotation injection to access Serving.

2.6 MetaSpore embraces the open-source ecosystem
MetaSpore itself is an open-source project, and MetaSpore uses a mix of mature open source ecosystems. For example, in offline training, MetaSpore's offline training framework Bridges PySpark and PyTorch, two popular frameworks in big data and deep learning, seamlessly combine their ecology better to address the pain points of real business scenarios. The online service is also fully integrated with gRPC, SpringBoot, and Spring Cloud, allowing developers to obtain the best practices of landing big data intelligent applications in the existing standard technology stack.

MetaSpore's philosophy is to embrace the open-source ecosystem fully. For example, for offline NLP large model training, developers who have the need can still use the existing open-source framework, such as OneFlow, to train the model; MetaSpore can continue to provide the ability to incorporate pre-training of the large model into multimodal model learning, as well as online Serving of the NLP model, making the NLP large model more inclusive.

3.The conclusion
This article details the design philosophy and core features of MetaSpore's one-stop machine learning platform to help readers better understand MetaSpore. Want to learn more about MetaSpore business scenario implementation.

Next week, I will update the detailed steps of using MetaSpore to build industrial recommendation systems rapidly.

Top comments (0)