Large Language Models (LLMs) have transformed the way we build software applications. By combining LLMs with traditional software, developers can create powerful AI-driven applications called LLM-based applications or AI agents. However, serving these applications efficiently is challenging due to the complex workflows and diverse performance requirements of LLM requests. A new system called Parrot aims to address these challenges and significantly improve the end-to-end performance of LLM-based applications.
Link to the paper: https://arxiv.org/abs/2405.19888
The rise of LLM-based applications
LLM-based applications leverage the natural language understanding capabilities of LLMs to accomplish tasks collaboratively. These applications typically require multiple rounds of conversation, implemented through multiple API calls to the LLM. Some common conversation patterns include:
- Map-Reduce Summary: Splitting a long document into smaller chunks, summarizing each chunk (Map), and combining the summaries (Reduce).
- Chain summary: Summarizing a document incrementally, with each step incorporating the summary of the previous chunk.
- LLM-powered search: Using LLMs to rewrite queries, search for relevant information, and generate answers based on the search results.
- Multi-agent coding: Multiple LLM-powered agents collaborating on a software development task, e.g., a product manager, architect, engineer, and QA tester working together to write code.
Challenges in serving LLM-based applications
Public LLM service providers face diverse tenants and applications, each with different workflows and performance preferences. However, existing LLM services treat requests individually, losing essential application-level information. This leads to several challenges:
- Excessive overhead of consecutive requests: Dependent requests have to be executed interactively between the client and the LLM service, incurring extra network latency and queuing delays.
- Misaligned scheduling objectives: LLM services blindly optimize individual requests, leading to suboptimal end-to-end performance for applications with diverse requirements (e.g., latency vs. throughput).
- Redundant computations: Popular LLM applications often use a long system prompt that is repeated across requests, wasting storage, computation, and memory bandwidth. An analysis showed that over 94% of tokens in a production LLM-based search engine were repeated across users.
Parrot: efficient LLM serving with semantic variables
Parrot is an LLM service system that treats LLM applications as first-class citizens. It introduces a simple abstraction called Semantic Variable, which allows developers to annotate input/output variables in the prompt of a request. Semantic Variables create a data pipeline when connecting multiple LLM requests, providing a natural way to program LLM applications.
Exposing semantic variables to the LLM service enables conventional data flow analysis to uncover correlations across requests. This opens up new optimization opportunities for improving the end-to-end performance of LLM-based applications.
Optimizations enabled by semantic variables
- Serving dependent requests: Parrot can colocate and execute dependent requests consecutively on the LLM service side, eliminating the network latency and queuing delays associated with client-side execution.
- Performance objective deduction: Parrot analyzes the application’s DAG and performance criteria to deduce request-level scheduling preferences, optimizing for the end-to-end performance rather than individual requests.
- Sharing prompt prefix: By understanding the prompt structure, Parrot can efficiently detect commonality across requests and share the common prefix, reducing redundant storage, computation, and memory bandwidth.
- Application-centric scheduling: Parrot’s scheduler groups requests with similar performance requirements and maximizes opportunities for sharing across requests, improving both application performance and GPU cluster utilization.
Evaluation results
Extensive evaluations demonstrate the effectiveness of Parrot in optimizing various LLM-based applications:
- Data analytics on long documents: Parrot achieved up to 2.38× speedup for chain-style summarization and 2.37× speedup for map-reduce summarization compared to baselines.
- Serving popular LLM applications: Parrot reduced latency by 1.8× to 2.4× for Bing Copilot-like applications with shared system prompts, and sustained 12× higher request rates for multiple GPT applications compared to baselines.
- Multi-agent applications: Parrot accelerated a multi-agent programming application by up to 11.7× compared to a latency-centric baseline and 2.45× compared to a throughput-centric baseline.
- Scheduling of mixed workloads: In a scenario with a mix of latency-sensitive chat applications and throughput-oriented data analytics tasks, Parrot achieved 5.5× improvement in normalized latency for chat applications and 3.7× speedup for map-reduce applications compared to baselines.
Conclusion
Parrot introduces a novel approach to optimizing LLM-based applications by treating them as first-class citizens in the LLM serving system. By exposing application-level information through Semantic Variables, Parrot enables a range of optimizations that significantly improve the end-to-end performance of LLM applications. The evaluation results demonstrate Parrot’s effectiveness in accelerating various real-world LLM-based applications, making it a promising solution for the efficient serving of the next generation of AI-driven software.
Lin, C., Han, C.Z.Z., Zhang, C., Yang, Y., Yang, F., Chen, C. and Qiu, L., 2024, July. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), Santa Clara, CA.