- Backend-stored responses: The backend writes complete responses to a database and clients load those full responses from there, while Ably is used only to deliver live tokens for the current in-progress response.
- Live transcription, captioning, or translation: A viewer who joins a live stream only needs sufficient tokens for the current “frame” of subtitles, not the entire transcript so far.
- Code assistance in an editor: Streamed tokens become part of the file on disk as they are accepted, so past tokens do not need to be replayed from Ably.
- Autocomplete: A fresh response is streamed for each change a user makes to a document, with only the latest suggestion being relevant.
Publishing tokens
Publish tokens from a Realtime client, which maintains a persistent connection to the Ably service. This allows you to publish at very high message rates with the lowest possible latencies, while preserving guarantees around message delivery order. For more information, see Realtime and REST. Channels separate message traffic into different topics. For token streaming, each conversation or session typically has its own channel. Use theget() method to create or retrieve a channel instance:
When publishing tokens, don’t await the channel.publish() call. Ably rolls up acknowledgments and debounces them for efficiency, which means awaiting each publish would unnecessarily slow down your token stream. Messages are still published in the order that publish() is called, so delivery order is not affected.
This approach maximizes throughput while maintaining ordering guarantees, allowing you to stream tokens as fast as your AI model generates them.
Handling publish failures
The examples above publish successive tokens by pipelining the publish operations — that is, the agent publishes a token without waiting for prior operations to complete. This is necessary in order to avoid the publish rate being capped by the round-trip time from the agent to the Ably endpoint. However, this means that the agent does not await the outcome of each publish operation, and that can result in the agent continuing to publish tokens after an earlier publish has failed. For example, if a rate limit is exceeded, a single token may be rejected while the following tokens continue to be accepted. The agent needs to obtain the outcome of each publish operation, and take corrective action in the event that any operation failed for some reason. A simple but effective way to do this is to ensure that, if streaming fails for any reason, a recovery message containing the complete response text is published once it is available. This means that although the streaming experience is disrupted in the case of failure, subscribers can replace their accumulated tokens with the complete response. To detect publish failures, keep a reference to each publish operation and check for rejections after the stream completes: If any publish fails, publish a new message with a different event type (such asresponse-complete) containing the full response. Subscribers should handle this event by replacing any tokens they have accumulated for that response:
When streaming multiple concurrent responses, include a responseId in message extras so subscribers can correctly associate the recovery message with the tokens it replaces. See token stream with multiple responses for details on correlating messages.
Streaming patterns
Ably is a pub/sub messaging platform, so you can structure your messages however works best for your application. Below are common patterns for streaming tokens, each showing both agent-side publishing and client-side subscription. Choose the approach that fits your use case, or create your own variation.Continuous token stream
For simple streaming scenarios such as live transcription, where all tokens are part of a continuous stream, simply publish each token as a message.Publish tokens
Subscribe to tokens
This pattern is simple and works well when you’re displaying a single, continuous stream of tokens.Token stream with multiple responses
For applications with multiple responses, such as chat conversations, include aresponseId in message extras to correlate tokens together that belong to the same response.
Publish tokens
Subscribe to tokens
Use theresponseId header in message extras to correlate tokens. The responseId allows you to group tokens belonging to the same response and correctly handle token delivery for distinct responses, even when delivered concurrently.
Token stream with explicit start/stop events
In some cases, your AI model response stream may include explicit events to mark response boundaries. You can indicate the event type, such as a response start/stop event, using the Ably messagename.
Publish tokens
Subscribe to tokens
Handle each event type to manage response lifecycle:Client hydration
When clients connect or reconnect, such as after a page refresh, they often need to catch up on tokens that were published while they were offline or before they joined. Ably provides several approaches to hydrate client state depending on your application’s requirements.Using rewind for recent history
The simplest approach is to use Ably’s rewind channel option to attach to the channel at some point in the recent past, and automatically receive all tokens since that point: Rewind supports two formats:- Time-based: Use a time interval like
'30s'or'2m'to retrieve messages from that time period - Count-based: Use a number like
50or100to retrieve the most recent N messages (maximum 100)
Using history for older messages
For applications that need to retrieve tokens beyond the 2-minute rewind window, enable persistence on your channel. Use channel history with theuntilAttach option to paginate back through history to obtain historical tokens, while preserving continuity with the delivery of live tokens:
