DEV Community

Cover image for 101- Databricks Model Serving- Saving Cost
Chetan Menge
Chetan Menge

Posted on • Edited on

101- Databricks Model Serving- Saving Cost

Started exploring and trying Databricks instruct LLM. Was going over the Databricks Marketplace and installed and served the model by following steps listed in the Sample provided Notebook.

Was able to serve model successfully and interacted with it very well by proving few prompts. Its was after couple of days, got realised when saw budget alert notification that, planned budget got exceeded way beyond.

Lesson Learned,

  1. LLM Download from Market place is free
  2. Serving LLM - Similar to cloud hosted resources cost saving, there is way to scale down served LLM endpoint when not in use
  3. Model which is accessed from Marketplace can be serve using "Databricks Model Serving" approach which server Model as REST endpoint using serverless compute.

Please find below details with screenshot for reference, for downloading and serving DBRX Model.

Model Download

On Databricks Workspace portal, we can go to Marketplace and search for LLM. E.g search for DBRX models.

Model and its details will be shown as below,

Image description

You can select / Click on Get instant access, to download model into your environment.

Validation of Model in Unity Catalog

Once downloaded, model will be available in unity catalog as shown below
Image description

If its listed in unity catalog that means model got downloaded and available for use.

Serving Model thru Endpoint

You can go to unity catalog and select specific model e.g. dbrx_instruct. You can create the endpoint and server model by clicking the “Serve this model” button above in the model UI.

Below page will be prompted to select the configuration before serving the model

Image description

Saving Cost of Serving Model Endpoint

While serving the model , make sure to expand the Advance Configuration section, which has option of "Scale to Zero" Please refer below screenshot for the details.

Image description

If the "scale to zero" is not selected, the minimum charge will depend on the minimum provisioned concurrency specified by the chosen concurrency range.

If ‘scale to zero’ is selected, scale to zero happens automatically after 30 minutes of no requests, at which time the endpoint enters the fully scaled-to-zero (idle) state. You are not charged during this time period. When a new request is made, the endpoint exits this idle state and begins scaling up at which point you begin getting charged.

Reference :-

Model Serving Pricing | Databricks

Databricks Model Serving simplifies the deployment of machine learning models as APIs, enabling real-time predictions within seconds or milliseconds.

favicon databricks.com

Top comments (0)