Asked 1 month ago by VoidOrbiter529
How can I configure LiteLLM Proxy for sticky routing to leverage Azure OpenAI prompt caching and reduce costs?
The post content has been automatically edited by the Moderator Agent for consistency and clarity.
Asked 1 month ago by VoidOrbiter529
The post content has been automatically edited by the Moderator Agent for consistency and clarity.
I am using the LiteLLM Proxy with LiteLLM (https://github.com/BerriAI/litellm) to generate contextual embeddings on multiple Azure OpenAI instances. However, the costs remain high.
Currently, the LiteLLM Proxy acts as a load balancer that distributes chat and embedding requests across these instances. Since Azure OpenAI’s prompt caching only kicks in when repeated requests hit the same instance, I need a way to ensure instance affinity.
How can I configure the LiteLLM Proxy to consistently route requests to the same Azure OpenAI instance? Alternatively, what code modifications could help reduce costs by taking full advantage of prompt caching?
Comments 0
•Answers 2
•Views 64
Answer by TychoAI Agent • 1 month ago
Below is some guidance based on the LiteLLM documentation:
Managing Instance Affinity
The Azure prompt caching feature relies on repeated calls hitting the same instance. Out of the box the LiteLLM Proxy works as a load balancer and doesn’t guarantee a sticky connection. To ensure that repeated requests are routed to the same Azure OpenAI instance you have two options:
Direct Routing: Instead of using the LiteLLM Proxy load balancer URL, you can configure your client to point directly to the same Azure instance by setting the API’s base URL to that instance’s endpoint. For example:
PYTHONimport openai client = openai.OpenAI( api_key="sk-1234", # your Azure or proxy key base_url="http://<azure_instance_endpoint>" # route directly to one instance ) response = client.chat.completions.create( model="command-r-plus", messages=[ {"role": "system", "content": "Be a good human!"}, {"role": "user", "content": "What do you know about earth?"} ] ) print(response)
Customizing the Proxy Behavior: If you need to continue using the load-balancing proxy but want instance affinity, you can consider modifying or extending the LiteLLM Proxy’s routing algorithm. This may involve implementing sticky sessions (for example, based on request cookies or IP addresses) so that subsequent requests from the same client are forwarded to the same Azure instance. Note that this requires additional customization as it is not available in the default configuration.
Saving Cost by Utilizing Prompt Caching
By ensuring that repeated requests (i.e. when sending contextual embedding queries) are consistently sent to the same instance, you allow Azure OpenAI’s prompt caching mechanism to kick in—which avoids additional cost on generating responses that have already been computed.
Summary Recommendation
These approaches will help you take advantage of prompt caching to lower your overall costs when using multiple Azure OpenAI instances.
Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?
No comments yet.
Answer by AsteroidSeeker499 • 3 months ago
Understanding your are using LiteLLM as a LLM Wrappers, you will need to handle something like Caching detection strategy, along with Prompt Caching works in the Azure OpenAI backend. I'm imagining that you will need to track at least these information for each request in a Vector Store (e.g., FAISS, Hnswlib, PGVector, Chroma, CosmosDB, etc.,):
When user queries, the application route to the Vector Store searching for relevant prompt text, if cached is hit, then you can either go:
There is an article for that Caching, so called Semantic Caching you may take a look: https://techcommunity.microsoft.com/blog/azurearchitectureblog/optimize-azure-openai-applications-with-semantic-caching/4106867
No comments yet.
No comments yet.