Davide Bedin

Posted on Mar 6

Unlocking the Power of Azure OpenAI with Azure API Management

#openai #azure #api #kql

In today’s world, artificial intelligence (AI) plays a pivotal role in transforming businesses. Azure OpenAI Service (AOAI in the article) combines the power of OpenAI’s advanced models with the security and enterprise capabilities of Azure. Harnessing the full potential of Azure OpenAI requires effective management, security, and scalability: this is where Azure API Management (APIM) steps in to help.

TL; DR;

As I engaged with customers on Azure API Management + Azure Open AI scenarios, I had the chance to combine & extend few excellent ready to use scenario & samples, specifically to track tokens used by application.

In a nutshell I investigated:

How to correlate the diagnostics from APIM with the ones coming from AOAI.
How to track tokens used by each AOAI requests, therefore measuring AOAI token usage per APIM subscription = application.

This post describes how to do it.

All code, queries and workbook is in this feature branch dabedin/apim-aoai-smart-loadbalancing/tree/feature/tracingAOAI which is a fork of the great Smart load balancing for OpenAI endpoints and Azure API Management described below.

What Is Azure API Management?

Azure API Management is a robust solution that allows organizations to expose APIs securely, manage access, and monitor usage. It acts as a gateway, enabling controlled access to APIs while shielding sensitive keys. When combined with Azure OpenAI, it becomes a central capability that enhances the overall application and user experience.

Azure API Management for Azure OpenAI

Here's a list of my favorite Azure OpenAI with Azure API Manangement architectures and samples:

Smart load balancing for OpenAI endpoints and Azure API Management which elegantly supports smart load balancing between AOAI endpoints managing tokens per minute (TPM) and requests per minute (RPM) constraints via APIM policy.
Azure OpenAI Insights that offers a rich visualization on AOAI diagnostics persisted to a central Log Analytics.
Implement logging and monitoring for Azure OpenAI models describe an approach to logging and monitoring for Azure OpenAI models.

What is this post all about?

Together with the customer, we built a scenario combining many of the previously described approaches. Let's start with a diagram:

In the architecture presented above, APIM role is:

Governing access of external applications (via subscription) to AOAI, leveraging APIM Managed Identity authentication (https://learn.microsoft.com/en-us/azure/api-management/api-management-authenticate-authorize-azure-openai#authenticate-with-managed-identity) and preventing spread of AOAI access keys.
Smart load balancing of request from external application to OpenAI LLMs between multiple AOAI resources, whether because you decided to purchase provisioned throughput for predictable performances and cost saving AND/OR you want to increase overall resilience distributing load among multiple regions.
Measure usage of AOAI resources by external applications considering the metrics that are mostly relevant: while the typical API usually favors RPM, the used tokens are the most relevant usage metric in AOAI.

An important constraint

As clearly described in the APIM documentation, monitoring can have a significant impact on performances. That is the reason why we have a sampling rate of just 10% for detailed traces destined for Application Insights. The following image shows the configuration:

Also, to avoid potential impact on performances, we chose to neither log the payload of frontend or backend request & response. On the contrary, we decided to precisely trace only the information needed --> more about this in the next section.

While sampling is applied to Application Insights, we are leveraging APIM diagnostics to track all requests in APIM, as the main mechanism to measure application usage. APIM diagnostics is configured to be persisted in the Log analytics workspace also shared by the multiple AOAI resources.

Furthermore, customer is heavily leveraging Azure Log Analytics workspaces for storing logs and metrics and building workbooks on top of it.

Bringing it all together

Distributed tracing is the cornerstone of any modern application. Azure API Management supports the W3C trace context on top of which OpenTelemetry is built, so a client initiated distributed trace can pass through APIM and include interactions with backends and other resources.

APIM and AOAI support rich diagnostics, each on its own terms. Digging deeper on this part of the scenario, I found out that the W3C trace context passed by APIM in the request to the AOAI backend is not persisted in the AOAI diagnostic logs. I also noticed the AOAI response includes a apim-request-id header and also returns back the x-ms-client-request-id header with the same value passed by the client (APIM in this stance) or with the same value as apim-request-id.
As from Azure documentation (like this one) the x-ms-client-request-id is intended to be used as a 40-chars long client tracing string which .
For the time being, I decided not to pass a segment of the traceparent I can access in the APIM policy, also because while the AOAI diagnostic logs does include the value of apim-request-id header in the CorrelationId column, it does not persist the x-ms-client-request-id header.

As depicted in the diagram above, I included a section the powerful smart load balancing policy for Azure API Management to trace a tuple made of the the traceparent header from the APIM request and the apim-request-id from the AOAI response, this for each retry attempt performed by the logic.

The following policy fragment shows how to trace the aforementioned correlation information AND the involved tokens.

<!-- Prepare the tokens correlation info -->
<choose>
   <when condition="@(context.Response != null && context.Response.StatusCode == 200)">
      <set-variable name="tokens" value="@{
         var responseBody = context.Response.Body.As<JObject>(preserveContent: true);

         return new JObject(
            new JProperty("apim-traceparent", context.Request.Headers.GetValueOrDefault("traceparent",string.Empty)),
            new JProperty("aoai-correlation", context.Response.Headers.GetValueOrDefault("apim-request-id",string.Empty)),
            new JProperty("prompt_tokens", responseBody["usage"]["prompt_tokens"]),
            new JProperty("completion_tokens", responseBody["usage"]["completion_tokens"]),
            new JProperty("total_tokens", responseBody["usage"]["total_tokens"]),
            new JProperty("aoai-statusCode", context.Response.StatusCode)
          ).ToString();
        }" />
   </when>
   <otherwise>
      <set-variable name="tokens" value="@{
         return new JObject(
            new JProperty("apim-traceparent", context.Request.Headers.GetValueOrDefault("traceparent",string.Empty)),
            new JProperty("aoai-correlation", context.Response != null ? context.Response.Headers.GetValueOrDefault("apim-request-id",string.Empty) : string.Empty),
            new JProperty("prompt_tokens", 0),
            new JProperty("completion_tokens", 0),
            new JProperty("total_tokens", 0),
            new JProperty("aoai-statusCode", context.Response != null ? context.Response.StatusCode : 0)
         ).ToString();
        }" />
   </otherwise>
</choose>
<!--Trace the tokens correlation-->
<trace source="Global APIM Policy" severity="information">
   <message>@(context.Variables.GetValueOrDefault<string>("tokens", "none"))</message>
</trace>

What is the outcome of this tracing in the APIM diagnostic logs? As an example, following the sequence described in the previous diagram, a client request which encountered a HTTP 429 failure from a higher priority AOAI resource, therefore retried with the lower priority AOAI resource receiving a HTTP 200 success, would have two elements in the TraceRecords column in the ApiManagementGatewayLogs log anaytics table, on the record corresponding to the client request to APIM, as depicted below:

In the screenshot above you can notice the two requests originated in APIM towards AOAI have the same trace-id but different parent-id, as in accordance with W3C context specification.

KQL rules!

So far I described how I extended the existing smart load balancing policy for Azure API Management to collect additional information to correlate the APIM request to the AOAI requests. This is just the starting point.

Another relevant customer request was to be able to measure AOAI tokens consumed by client application (or project) which translate to APIM subscription. By default, none of the APIM concepts intersect with the AOAI diagnostic BUT with this additional trace, everything become possible while we unleash the power of the Kusto Query Language (KQL) powering Azure Monitor!

Let's consider the following screenshot, using a join between the APIM and AOAI diagnostic tables I can summarize by ApimSubscriptionId (representing each application) and by modelName.

Extending OpenAI Insights

The marvelous workbook provided by the Azure OpenAI Insights solution offers a rich representation of AOAI diagnostics.

As discussed in previous sections, the workbook is rightfully built on top of AOAI Diagnostic logs only. As an example, the view by CallerIP, once you introduced APIM, would be similar to the following screenshot as the outbound IP would to be the only client reaching the AOAI endpoint.

My colleagues Vincenzo Paolo Bacco and Edoardo Zonca took on the challenge of giving a UX to the tracing added to the APIM + AOAI scenario. They accomplished the task by extending the Azure OpenAI Insights with a set of visualization. As an example:

Present AOAI logs replacing the CallerIPAddress from AOAI logs with the real client IP reaching APIM.
Enable filtering by APIM subscription and product on many visualizations.
Analyze used tokens by modelName and modelType per APIM subscription.

The following screenshot is just an example of a tab added by Vincenzo and Edoardo to the workbook.

The displayed data originates from AOAI diagnostics but it is enriched with APIM diagnostics. As you can see an IP is prevalent (it is my home office IP, sorry about that) yet it clearly represent the value added to an already exceptional asset, thank to the additional traces I defined.

Also in the screenshot above, please note the ability to filter by APIM subscription and product has been added to all new visualizations.

Closing with the view on token based utilization filtered by APIM subscription.

The data displayed in this last screenshot shows how some client requests to a high priority AOAI endpoint had to fallback to a lower priority endpoint to sustain request load.

Wrap-up (and disclaimer)

It has been an interesting journey learning more about the Azure API Management (APIM) integration scenario with Azure OpenAI (AOAI) and identify a feature to build on top of exceptional assets provided by great colleagues and contributors.

I strongly believe that adding even a tiny portion of value is more beneficial than re-inventing the wheel.

That being said, this project is meant to be experiment: there are surely other approaches to achieve the same goals and therefore constraints & objectives guided this effort.

You can find code, queries and workbook in this feature branch dabedin/apim-aoai-smart-loadbalancing/tree/feature/tracingAOAI which is a fork of the great Smart load balancing for OpenAI endpoints and Azure API Management solution repository described above.

Please enjoy!

DEV Community

Unlocking the Power of Azure OpenAI with Azure API Management

TL; DR;

What Is Azure API Management?

Azure API Management for Azure OpenAI

What is this post all about?

An important constraint

Bringing it all together

KQL rules!

Extending OpenAI Insights

Wrap-up (and disclaimer)

Top comments (0)

Read next

Integrating Spin with Azure services

A KeyVault for the Power Platform

Top 8 Swagger Codegen Alternatives

🚨ReportSOS | AI-Powered SOS Emergency Caller: Enhancing Dispatcher Efficiency and Response