Mahmoud Sehsah

Posted on Jan 23, 2024

Deploying HuggingFace Chat UI with the Hugging Face Text Generation Inference Server

#llm #mlops #ai #machinelearning

Introdcution

Before we dive into deploying the Hugging Chat UI, let's first explore the capabilities of the Hugging Face Text Generation Inference Server. We'll start with a practical walkthrough, demonstrating how to access and utilize its API endpoints effectively. This initial exploration is key to understanding the various configurations available for text generation and how they can enhance your AI interactions.

Start The Hugging Face Inference Server

In this section, we focus on launching the Hugging Face Text Generation Inference Server, specifically configured with 8-bit quantization. This setting is pivotal for optimizing GPU memory utilization, ensuring efficient resource management, please refer to the detailed setup instructions provided in this link



export model=mistralai/Mistral-7B-v0.1
export volume=$PWD/data



docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.3 --quantize=bitsandbytes --model-id $model

Discover Hugging Face Inference Server endpoints

Call the default generate Enpoint



curl --location 'http://127.0.0.1:8080/generate' \
--header 'Content-Type: application/json' \
--data '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}'

Call the streaming endpoint



curl --location 'http://127.0.0.1:8080/generate_stream' \
--header 'Content-Type: application/json' \
--data '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}'

Call the generate endpoint while activating sampling



curl --location 'http://127.0.0.1:8080/generate' \
--header 'Content-Type: application/json' \
--data '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":100, "do_sample":true, "top_k":50 }}'

Call the generate endpoint while changing temperature



curl --location 'http://127.0.0.1:8080/generate' \
--header 'Content-Type: application/json' \
--data '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":50, "do_sample":true, "top_k":50, "temperature":0.2 }}'

For more Generation strategies please refer to this link : https://huggingface.co/docs/transformers/generation_strategies

Monitoring with Health, Info, and Metrics API Endpoints

Ensuring System Health



curl --location 'http://127.0.0.1:8080/health'

Retrieving Server Information



curl --location 'http://127.0.0.1:8080/info'

Accessing Performance Metrics Endpoint



curl --location 'http://127.0.0.1:8080/metrics'

Install Hugging Face Chat UI

Clone the Repository

Initiate your project by cloning the Hugging face chat UI repository:



git clone https://github.com/huggingface/chat-ui.git

Configure the Environment

After cloning the repository, you'll need to set up your environment by editing the .env file. This involves specifying the correct IP addresses for your MongoDB instance and the Hugging Face Text Generation Inference Server.

Editing MongoDB Configuration:

Locate and edit the MONGODB_URL in the .env file to point to your MongoDB instance. Replace ${MONGO_DB_IP} with the actual IP address of your MongoDB server.



MONGODB_URL=mongodb://${MONGO_DB_IP}:27017

Setting Up Text Generation Inference Server Connection:

In the same .env file, ensure that the Hugging Face Text Generation Inference Server is correctly configured. Below is a JSON configuration snippet that you'll need to adjust based on your setup, it's important to recognize the MODELS object encapsulates your models' configurations:



{
      "name": "mistralai/Mistral-7B-Instruct-v0.1-local",
      "displayName": "mistralai/Mistral-7B-Instruct-v0.1-name",
      "description": "Mistral 7B is a new Apache 2.0 model, released by Mistral AI that outperforms Llama2 13B in benchmarks.",
      "websiteUrl": "https://mistral.ai/news/announcing-mistral-7b/",
      "preprompt": "",
      "chatPromptTemplate" : "<s>{{#each messages}}{{#ifUser}}[INST] {{#if @first}}{{#if @root.preprompt}}{{@root.preprompt}}\n{{/if}}{{/if}}{{content}} [/INST]{{/ifUser}}{{#ifAssistant}}{{content}}</s>{{/ifAssistant}}{{/each}}",
      "parameters": {
        "temperature": 0.1,
        "top_p": 0.95,
        "repetition_penalty": 1.2,
        "top_k": 50,
        "max_new_tokens": 1024,
        "stop": ["</s>"]
      },
      "endpoints": [{
        "type" : "tgi",
        "url": "http://${TEXT_GENERATION_INFERENCE_SERVER}:80/",
        }],
      "promptExamples": [
      {
          "title": "Assist in a task",
          "prompt": "How do I make a delicious lemon cheesecake?"
        }
      ]
    }

Build the Chat UI Docker image



DOCKER_BUILDKIT=1 docker build -t hugging-face-ui .

Run MongDB



docker run -d -p 27017:27017 --name mongo-chatui mongo:latest

Run the Hugging-Face Chat UI



docker run -p:3000:3000 hugging-face-ui

DEV Community