DEV Community

Mahmoud Sehsah
Mahmoud Sehsah

Posted on

Deploying HuggingFace Chat UI with the Hugging Face Text Generation Inference Server

Introdcution

Before we dive into deploying the Hugging Chat UI, let's first explore the capabilities of the Hugging Face Text Generation Inference Server. We'll start with a practical walkthrough, demonstrating how to access and utilize its API endpoints effectively. This initial exploration is key to understanding the various configurations available for text generation and how they can enhance your AI interactions.

Start The Hugging Face Inference Server

In this section, we focus on launching the Hugging Face Text Generation Inference Server, specifically configured with 8-bit quantization. This setting is pivotal for optimizing GPU memory utilization, ensuring efficient resource management, please refer to the detailed setup instructions provided in this link



export model=mistralai/Mistral-7B-v0.1
export volume=$PWD/data 


Enter fullscreen mode Exit fullscreen mode


docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.3 --quantize=bitsandbytes --model-id $model


Enter fullscreen mode Exit fullscreen mode

Image description

Discover Hugging Face Inference Server endpoints

Call the default generate Enpoint



curl --location 'http://127.0.0.1:8080/generate' \
--header 'Content-Type: application/json' \
--data '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}'


Enter fullscreen mode Exit fullscreen mode

Call the streaming endpoint



curl --location 'http://127.0.0.1:8080/generate_stream' \
--header 'Content-Type: application/json' \
--data '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}'


Enter fullscreen mode Exit fullscreen mode

Call the generate endpoint while activating sampling



curl --location 'http://127.0.0.1:8080/generate' \
--header 'Content-Type: application/json' \
--data '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":100, "do_sample":true, "top_k":50 }}'


Enter fullscreen mode Exit fullscreen mode

Call the generate endpoint while changing temperature



curl --location 'http://127.0.0.1:8080/generate' \
--header 'Content-Type: application/json' \
--data '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":50, "do_sample":true, "top_k":50, "temperature":0.2 }}'


Enter fullscreen mode Exit fullscreen mode

For more Generation strategies please refer to this link : https://huggingface.co/docs/transformers/generation_strategies

Monitoring with Health, Info, and Metrics API Endpoints

Ensuring System Health



curl --location 'http://127.0.0.1:8080/health'


Enter fullscreen mode Exit fullscreen mode

Retrieving Server Information



curl --location 'http://127.0.0.1:8080/info'


Enter fullscreen mode Exit fullscreen mode

Image description

Accessing Performance Metrics Endpoint



curl --location 'http://127.0.0.1:8080/metrics'


Enter fullscreen mode Exit fullscreen mode

Image description

Install Hugging Face Chat UI

Clone the Repository

Initiate your project by cloning the Hugging face chat UI repository:



git clone https://github.com/huggingface/chat-ui.git


Enter fullscreen mode Exit fullscreen mode

Configure the Environment

After cloning the repository, you'll need to set up your environment by editing the .env file. This involves specifying the correct IP addresses for your MongoDB instance and the Hugging Face Text Generation Inference Server.

Editing MongoDB Configuration:

Locate and edit the MONGODB_URL in the .env file to point to your MongoDB instance. Replace ${MONGO_DB_IP} with the actual IP address of your MongoDB server.



MONGODB_URL=mongodb://${MONGO_DB_IP}:27017


Enter fullscreen mode Exit fullscreen mode

Setting Up Text Generation Inference Server Connection:

In the same .env file, ensure that the Hugging Face Text Generation Inference Server is correctly configured. Below is a JSON configuration snippet that you'll need to adjust based on your setup, it's important to recognize the MODELS object encapsulates your models' configurations:



{
      "name": "mistralai/Mistral-7B-Instruct-v0.1-local",
      "displayName": "mistralai/Mistral-7B-Instruct-v0.1-name",
      "description": "Mistral 7B is a new Apache 2.0 model, released by Mistral AI that outperforms Llama2 13B in benchmarks.",
      "websiteUrl": "https://mistral.ai/news/announcing-mistral-7b/",
      "preprompt": "",
      "chatPromptTemplate" : "<s>{{#each messages}}{{#ifUser}}[INST] {{#if @first}}{{#if @root.preprompt}}{{@root.preprompt}}\n{{/if}}{{/if}}{{content}} [/INST]{{/ifUser}}{{#ifAssistant}}{{content}}</s>{{/ifAssistant}}{{/each}}",
      "parameters": {
        "temperature": 0.1,
        "top_p": 0.95,
        "repetition_penalty": 1.2,
        "top_k": 50,
        "max_new_tokens": 1024,
        "stop": ["</s>"]
      },
      "endpoints": [{
        "type" : "tgi",
        "url": "http://${TEXT_GENERATION_INFERENCE_SERVER}:80/",
        }],
      "promptExamples": [
      {
          "title": "Assist in a task",
          "prompt": "How do I make a delicious lemon cheesecake?"
        }
      ]
    }


Enter fullscreen mode Exit fullscreen mode

Build the Chat UI Docker image



DOCKER_BUILDKIT=1 docker build -t hugging-face-ui .


Enter fullscreen mode Exit fullscreen mode

Run MongDB



docker run -d -p 27017:27017 --name mongo-chatui mongo:latest


Enter fullscreen mode Exit fullscreen mode

Run the Hugging-Face Chat UI



docker run -p:3000:3000 hugging-face-ui


Enter fullscreen mode Exit fullscreen mode

Top comments (0)