Philosophy: Separation of Concerns
Flower Engine deliberately splits the application into two independent processes:
The Brain (Python) FastAPI Backend Handles AI, embeddings, and data persistence
The Face (Rust) Ratatui TUI Handles rendering, input, and animations
This design choice stems from a fundamental truth: AI inference and UI rendering have opposing performance characteristics.
The Problem : LLM streaming can take 5-30 seconds per response. During this time, the UI must remain buttery smooth at 60+ FPS for scrolling, animations, and input handling.
Why Python for the Brain?
Python dominates the AI/ML ecosystem for good reasons:
Ecosystem Maturity
# From engine/llm.py:29-48
client = AsyncOpenAI(
base_url = OPENAI_BASE_URL ,
api_key = OPENAI_API_KEY ,
default_headers = {
"HTTP-Referer" : "https://github.com/ritz541/flower-engine" ,
"X-Title" : "The Flower Roleplay Engine" ,
},
)
ds_client = AsyncOpenAI(
base_url = "https://api.deepseek.com" ,
api_key = DEEPSEEK_API_KEY
)
groq_client = AsyncOpenAI(
base_url = GROQ_BASE_URL ,
api_key = GROQ_API_KEY
)
if GEMINI_API_KEY :
from google import genai
gemini_client = genai.Client( api_key = GEMINI_API_KEY )
Four providers supported out-of-the-box with minimal code. Python’s AI library ecosystem is unmatched.
Async-Native Design
From engine/main.py:152-259, WebSocket handling leverages FastAPI’s async capabilities:
@app.websocket ( "/ws/rpc" )
async def websocket_endpoint ( websocket : WebSocket):
await websocket.accept()
await websocket.send_text(
build_ws_payload( "system_update" , "✓ Engine ready." , { "status" : "ok" })
)
await broadcast_sync_state(websocket)
try :
while True :
data = await websocket.receive_text()
# ... process commands or prompts
# Stream LLM response
task = asyncio.create_task(
stream_chat_response(
websocket, prompt, context,
world_id, char_id, session_id
)
)
# Allow cancellation during streaming
while not task.done():
try :
raw = await asyncio.wait_for(
websocket.receive_text(),
timeout = 0.05
)
if cmd_msg.get( "prompt" ) == "/cancel" :
task.cancel()
except asyncio.TimeoutError:
continue
The backend can:
Handle long-running LLM streams
Accept cancellation commands mid-stream
Manage multiple concurrent operations (RAG, DB writes, token streaming)
All without blocking
RAG with Minimal Boilerplate
Python makes vector search trivial:
# Chunked lore embedding at startup
for i, chunk in enumerate (chunks):
rag_manager.add_lore(w.id, f "base_lore_ { i } " , chunk)
# Query at runtime (from engine/main.py:197-201)
lore_list, _ = rag_manager.query_lore(
state. ACTIVE_WORLD_ID , prompt, n_results = 2
)
mem_key = f " { state. ACTIVE_CHARACTER_ID } _ { state. ACTIVE_SESSION_ID } "
mem_list, _ = rag_manager.query_memory(mem_key, prompt, n_results = 3 )
ChromaDB + SentenceTransformers provide production-grade RAG with ~10 lines of code. Embeddings run on CPU for maximum compatibility.
Why Rust for the Face?
Rust brings predictable performance to a domain (TUI rendering) where milliseconds matter.
Zero-Cost Abstractions
From tui/src/main.rs:19-52, the event loop is lean:
const TICK_RATE : Duration = Duration :: from_millis ( 150 );
#[tokio :: main]
async fn main () -> Result <(), Box < dyn Error >> {
enable_raw_mode () ? ;
let mut stdout = io :: stdout ();
execute! ( stdout , EnterAlternateScreen ) ? ;
let backend = CrosstermBackend :: new ( stdout );
let mut terminal = Terminal :: new ( backend ) ? ;
let mut app = App :: new ();
let ( tx_in , rx_in ) = mpsc :: unbounded_channel :: < WsMessage >();
let ( tx_out , rx_out ) = mpsc :: unbounded_channel :: < String >();
// Spawn WebSocket client in background task
tokio :: spawn ( async move {
ws :: start_ws_client ( tx_in , rx_out ) . await ;
});
let res = run_app ( & mut terminal , & mut app , rx_in , tx_out ) . await ;
// Cleanup
disable_raw_mode () ? ;
execute! ( terminal . backend_mut (), LeaveAlternateScreen ) ? ;
terminal . show_cursor () ? ;
Ok (())
}
No garbage collection pauses . Every frame renders in less than 1ms with predictable latency.
Concurrent Event Handling
The tokio::select! macro elegantly handles three event sources:
loop {
terminal . draw ( | f | ui :: draw ( f , app )) ? ;
tokio :: select! {
// WebSocket messages from Python backend
Some ( msg ) = rx_in . recv () => {
match msg . event . as_str () {
"chat_chunk" => {
app . append_chunk ( & msg . payload . content);
if let Some ( tps ) = msg . payload . metadata . tokens_per_second {
app . tps = tps ;
}
}
"sync_state" => { /* Update UI state */ }
"chat_end" => { app . finish_stream (); }
_ => {}
}
}
// User keyboard input
Some ( Ok ( event )) = reader . next () . fuse () => {
match event {
Event :: Key ( key ) => {
match key . code {
KeyCode :: Esc => { /* Cancel or quit */ }
KeyCode :: Enter => { /* Submit message */ }
KeyCode :: Char ( c ) => { app . handle_char ( c ); }
_ => {}
}
}
_ => {}
}
}
// Animation tick (spinner, cursor blink)
_ = tokio :: time :: sleep ( timeout ) . fuse () => {
if app . is_typing {
app . spinner_frame = ( app . spinner_frame + 1 ) % 10 ;
}
}
}
}
150ms tick rate balances smooth animations with CPU efficiency. The loop never blocks on any single operation.
Type-Safe Message Parsing
From tui/src/models.rs:1-51, WebSocket messages have compile-time guarantees:
#[derive( Debug , Deserialize , Serialize , Clone )]
pub struct WsMessage {
pub event : String ,
pub payload : Payload ,
}
#[derive( Debug , Serialize , Deserialize , Clone )]
pub struct Payload {
pub content : String ,
#[serde(default)]
pub metadata : Metadata ,
}
#[derive( Debug , Serialize , Deserialize , Clone , Default )]
pub struct Metadata {
pub model : Option < String >,
pub tokens_per_second : Option < f64 >,
pub world_id : Option < String >,
pub total_tokens : Option < u32 >,
pub available_worlds : Option < Vec < EntityInfo >>,
// ... 10+ optional fields
}
If the Python backend sends malformed JSON, serde_json::from_str fails gracefully: if let Ok ( ws_msg ) = serde_json :: from_str :: < WsMessage >( & text ) {
tx . send ( ws_msg ) . unwrap ();
} else {
eprintln! ( "Failed to parse WS message: {}" , text );
}
The TUI never crashes on bad data.
Communication: The WebSocket Bridge
The two halves communicate via a JSON-based WebSocket protocol at ws://localhost:8000/ws/rpc.
Connection Lifecycle
Startup
Python backend starts FastAPI server on port 8000
TUI Launch
Rust TUI attempts connection with 1-second retry loop: let ws_stream = loop {
match connect_async ( url . clone ()) . await {
Ok (( stream , _ )) => break stream ,
Err ( _ ) => {
tokio :: time :: sleep ( Duration :: from_secs ( 1 )) . await ;
}
}
};
Handshake
Backend sends system_update event: ”✓ Engine ready.”
State Sync
Backend sends sync_state with available worlds, characters, models, rules
Active Session
User sends prompts, receives chat_chunk streams
Shutdown
User presses Esc (quit), TUI closes connection
Why WebSocket over gRPC/REST?
Both client and server can send messages at any time. Essential for:
Token-by-token LLM streaming
Mid-stream cancellation commands
Real-time performance metrics (tokens/sec)
After initial handshake, messages are pure JSON with minimal framing. No HTTP headers on every message.
FastAPI has native WebSocket support: @app.websocket ( "/ws/rpc" )
async def websocket_endpoint ( websocket : WebSocket):
await websocket.accept()
Tokio-tungstenite is the de facto Rust WebSocket client: let ( ws_stream , _ ) = connect_async ( url ) . await ? ;
let ( write , read ) = ws_stream . split ();
Comparison: Monolithic vs Split
Metric Monolithic Python Split Architecture UI Frame Rate 10-30 FPS (during LLM calls) 60+ FPS (always) Input Latency 50-200ms (GC pauses) Less than 5ms (predictable) Memory Usage 800MB+ (Python overhead) 400MB Python + 10MB Rust Crash Isolation Backend crash kills UI UI survives backend restarts Dev Iteration Restart entire app Restart only changed process
Result : The TUI never stutters, even during 30-second LLM responses.
Real-World Latency
From tui/src/main.rs:124-139, token streaming is instant:
"chat_chunk" => {
app . append_chunk ( & msg . payload . content); // <1ms
if let Some ( tps ) = msg . payload . metadata . tokens_per_second {
app . tps = tps ; // Update metrics
}
}
Each token appears in the TUI within 5ms after Python sends it (WebSocket latency + render time).
Crash Resilience
The split design provides isolation:
Backend Crash
Frontend Crash
If Python crashes:
Rust TUI detects WebSocket closure
Displays “Connection lost” message
Enters retry loop (1-second intervals)
Reconnects when backend restarts
State synced via sync_state event
User doesn’t lose their session (SQLite persists data).If Rust TUI crashes:
Python backend detects WebSocket disconnect
Cleans up task resources
Waits for new connection
User relaunches TUI, reconnects instantly
No backend restart required.
Development Experience
Hot Reload
Both processes support live reloading:
# Terminal 1: Python backend with auto-reload
cd engine
uvicorn engine.main:app --host 0.0.0.0 --port 8000 --reload
# Terminal 2: Rust TUI (cargo watch for auto-rebuild)
cd tui
cargo watch -x run
Change a Python file → Backend reloads in ~2s → TUI reconnects automatically. Change a Rust file → TUI rebuilds in ~5s → No backend disruption.
Language-Appropriate Debugging
Python : Rich logging with context
from engine.logger import log
log.info( f " \n === RETRIEVED LORE ( { len (lore_list) } chunks) ===" )
for i, chunk in enumerate (lore_list):
log.info( f "[LORE { i + 1 } ] \n { chunk } \n " )
log.info( f "=== END LORE === \n " )
Rust : Compile-time error checking
// If metadata schema changes, compiler catches it
if let Some ( tps ) = msg . payload . metadata . tokens_per_second {
app . tps = tps ; // Type error if field removed
}
Trade-offs
Complexity
Downside : You must manage two processes, ensure protocol compatibility, and handle connection failures.
Flower Engine mitigates this with:
start.sh script launches both processes
Auto-reconnect logic in TUI
Strict JSON schema (errors are logged, not crashed)
Deployment
Benefit : Each half can be deployed independently. Future possibilities:
Remote backend (TUI connects via ws://remote-server:8000)
Multiple TUI clients sharing one backend
Backend horizontal scaling (session-based routing)
When to Use This Pattern
The split-brain architecture excels when:
Mismatched Performance One part is fast (UI), the other is slow (AI)
Different Ecosystems Best libraries exist in different languages
Isolation Requirements Crashes shouldn’t cascade between components
Independent Scaling UI and backend have different resource needs
Avoid this pattern for simple CRUD apps or when real-time sync is critical (e.g., collaborative editing with less than 10ms latency requirements).
Next Steps
WebSocket Protocol Learn the JSON message format
Architecture Overview See the full system design
Development Guide Set up your dev environment
System Rules Understand narrative constraints