Skip to main content
The AsyncLLMEngine class provides an asynchronous interface for serving LLM requests. It’s designed for online serving scenarios and is used by the OpenAI-compatible API server.
In vLLM v1, AsyncLLMEngine is an alias for vllm.v1.engine.async_llm.AsyncLLM.

Overview

The AsyncLLMEngine is the recommended interface for production serving workloads. It provides:
  • Asynchronous request processing
  • Request queuing and batching
  • Streaming support
  • Efficient memory management
  • Support for distributed inference

Usage

from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.sampling_params import SamplingParams

# Create engine args
engine_args = AsyncEngineArgs(
    model="facebook/opt-125m",
    tensor_parallel_size=1,
)

# Create engine
engine = AsyncLLMEngine.from_engine_args(engine_args)

# Generate text
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
request_id = "unique-request-id"

# Add request
await engine.add_request(
    request_id=request_id,
    prompt="Hello, my name is",
    sampling_params=sampling_params,
)

# Process results
async for request_output in engine.generate(request_id):
    if request_output.finished:
        print(request_output.outputs[0].text)

Key methods

from_engine_args

AsyncLLMEngine.from_engine_args(
    engine_args: AsyncEngineArgs,
    usage_context: UsageContext = UsageContext.ENGINE_CONTEXT,
) -> AsyncLLMEngine
Creates an AsyncLLMEngine from engine arguments.

add_request

await engine.add_request(
    request_id: str,
    prompt: str | list[int],
    sampling_params: SamplingParams,
    arrival_time: float | None = None,
) -> None
Adds a new request to the engine’s queue.

generate

async for output in engine.generate(request_id: str):
    # Process output
    ...
Generates outputs for a request. Yields RequestOutput objects as they become available.

abort_request

await engine.abort_request(request_id: str) -> None
Aborts an ongoing request.

Engine configuration

The AsyncEngineArgs class is used to configure the engine. See EngineArgs for all available configuration options.

Streaming support

The AsyncLLMEngine supports streaming outputs by yielding partial results as they’re generated:
async for request_output in engine.generate(request_id):
    # request_output contains partial results
    if not request_output.finished:
        print(f"Partial: {request_output.outputs[0].text}")
    else:
        print(f"Final: {request_output.outputs[0].text}")

Integration with FastAPI

The AsyncLLMEngine integrates seamlessly with FastAPI for building API servers:
from fastapi import FastAPI
from vllm.engine.async_llm_engine import AsyncLLMEngine

app = FastAPI()
engine = None

@app.on_event("startup")
async def startup():
    global engine
    engine_args = AsyncEngineArgs(model="facebook/opt-125m")
    engine = AsyncLLMEngine.from_engine_args(engine_args)

@app.post("/generate")
async def generate(prompt: str):
    request_id = f"req-{time.time()}"
    sampling_params = SamplingParams()
    
    await engine.add_request(request_id, prompt, sampling_params)
    
    async for output in engine.generate(request_id):
        if output.finished:
            return {"text": output.outputs[0].text}
  • LLM - Synchronous interface for offline inference
  • EngineArgs - Engine configuration options
  • SamplingParams - Text generation parameters

Build docs developers (and LLMs) love