DEV Community

Aman Gupta
Aman Gupta

Posted on

What is LLMOps


  • MLOps - it’s a ML engineering practice to unify ML development (Dev) and ML operations (Ops). Part of the ML system generation including :
    • integration
    • testing
    • releasing the model
    • deployment, as an api
    • infrastructure management, using specific hardware

Image description

  • Orchestration - meaning you telling the machine what to do and in what order
  • MLOps for LLM’s vs LLM system design :

Image description

  • LLM driven application :
    • it all starts with the user interface, and user gives an input, it is then pre processed and passed through grounding and that generates a prompt for feeding into the model.
    • When the model responds with a response it is again passed through grounding and post preprocessing maybe to check for toxicity or bias or any undesirable output for our specific use case.
    • On the other hand if we want to use a custom model for this, we have to go through a iterative approach of model customization which starts from data preparation to tuning and then evaluating, same process again and again till we get satisfying results.
    • After this the model is fine tuned and used for the given application, providing responces based on prompts.
    • In this article we’ll focus on how to customize the model and deploy it and in turn get the output from it.

Image description

  • LLMOps pipeline :

    • It starts off with data preparation and designing a pipeline to make an artifact. That artifact leads to the pipeline execution which deals with deploying the model.
    • Once the model is deployed we can use it for making our predictions and then use responsible AI to check the outputs for safety measure.
    • There are 2 main parts of this pipeline :
      • Orchestration - where we tell the pipeline to perform a certain steps automatically from loading the data to feeding it for fine tuning
      • Automation - automating the training of the model (maybe with new data), to make our life easier
    • Then comes Deployment - using the trained model and deploying it in the production environment
  • For data orchestration we can use tools like Kubeflow or Apache Airflow to design our workflows, I’ll be explaining Kubeflow for this articles sake :

    • Kubeflow has 2 concepts -
      • Components - each component is a step in the workflow
      • Pipelines - the workflow as a whole
    • Kuberflow understands DSL - which is Domain Specific Language that is used to define the configuration of the workflow
    • When we define a component the code in it runs in a containerized environment, a container is like a environment which has the dependencies and the code, and we have the OS to run the code, we are not dependent on the hardware underneath.
    • After defining the components and the pipeline workflow, we can export it as a yaml file.
    • Once we have that file we can use it to execute our workflow, but to avoid managing the machines or Vm’s we can use a managed server less environment to execute the file, like Vertex AI pipeline.
    • When we have a pipeline we can re-use the same pipeline for different use cases.

Image description

  • Deployment :
    • There can be two types :
      • Batch - for example if we have customer reviews and in a batch we want the results of all of them in one go, no need to do in real time
      • REST API’s - for real time use cases, like a chat application, need the responses with low latency
    • While loading the model we use multiple end points, ie. we deploy the models in multiple servers for Load Balancing. So every time user calls the API a random model is assigned to the user, so split the work load.

Image description

  • Beyond Deployment :
    • Package, Deployment and Version - we need to maintain proper versioning nomenclature for the model’s, so it’s easy to backtrack if needed
    • Model monitoring - we need to monitor the metrics and safety of the responses provided by the model
    • Inference scalability :
      • Load test
      • Controlled roll out
    • Latency - permissible latency,, should be discussed with the stakeholders. For example how much do we want the user to wait for the response. In order to lower the latency we can use :
      • A smaller model
      • Faster processors (GPU, TPU)
      • Host the model in a different region (based on the application’s region)
    • Formatting the prompt in production - changing the input, to make it similar to the training data (using the same instruction we used for training)
    • Safety attributes - the responses can be filtered out depending on the requirements or guidelines

These are my personal notes for LLMOps short course by

Here are the codes
Thank you for reading :)

Top comments (0)