Streamlit UI Serving - ML in Production Practice

Overview

Streamlit enables rapid development of interactive web UIs for machine learning models without frontend expertise. This implementation supports both single predictions and batch processing.

Implementation

Application Structure

The Streamlit app (serving/ui_app.py) provides two interfaces:

serving/ui_app.py

import pandas as pd
import streamlit as st
from serving.predictor import Predictor

@st.cache_data
def get_model() -> Predictor:
    return Predictor.default_from_model_registry()

predictor = get_model()

def main():
    st.header("UI serving demo")
    tab1, tab2 = st.tabs(["Single prediction", "Batch prediction"])
    with tab1:
        single_pred()
    with tab2:
        batch_pred()

if __name__ == "__main__":
    main()

Key features:

Model caching with @st.cache_data for fast reloads
Tabbed interface for different use cases
Automatic model loading from W&B registry

Single Prediction Interface

Implementation

serving/ui_app.py

def single_pred():
    input_sent = st.text_input(
        "Type english sentence",
        value="This is example input"
    )
    if st.button("Run inference"):
        st.write("Input:", input_sent)
        pred = predictor.predict([input_sent])
        st.write("Pred:", pred)

User Experience

Enter text

User types or pastes text into the input field

Run inference

Click button to trigger prediction

View results

Probability distributions displayed immediately

Example output:

Input: This is example input
Pred: [[0.23 0.77]]

Batch Prediction Interface

Implementation

serving/ui_app.py

def batch_pred():
    uploaded_file = st.file_uploader("Choose a CSV file", type=["csv"])
    if uploaded_file:
        dataframe = pd.read_csv(uploaded_file)
        st.write("Input dataframe")
        st.write(dataframe)
        
        dataframe_with_pred = predictor.run_inference_on_dataframe(dataframe)
        st.write("Result dataframe")
        st.write(dataframe_with_pred)

Batch Predictor Method

serving/predictor.py

def run_inference_on_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
    correct_sentence_conf = []
    for idx in tqdm(range(len(df))):
        sentence = df.iloc[idx]["sentence"]
        conf = self.predict([sentence]).flatten()[1]
        correct_sentence_conf.append(conf)
    df["correct_sentence_conf"] = correct_sentence_conf
    return df

Features:

Upload CSV files via drag-and-drop
Preview input dataframe
Progress tracking with tqdm
Results displayed in interactive table

Example Usage

Input CSV:

sentence
This is a good example
This is bad example
Great work!

Output:

sentence,correct_sentence_conf
This is a good example,0.89
This is bad example,0.23
Great work!,0.95

Local Development

Using Make

make run_app_streamlit

This command:

Builds Docker image with app-streamlit target
Runs container on port 8081
Forwards to internal port 8080
Mounts W&B credentials

Using Docker

# Build
docker build -f Dockerfile \
  -t app-streamlit:latest \
  --target app-streamlit .

# Run
docker run -it -p 8081:8080 \
  -e WANDB_API_KEY=${WANDB_API_KEY} \
  app-streamlit:latest

Access the UI

Open browser to http://localhost:8081

Kubernetes Deployment

Manifest

k8s/app-streamlit.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-streamlit
spec:
  replicas: 1
  selector:
    matchLabels:
      app: app-streamlit
  template:
    metadata:
      labels:
        app: app-streamlit
    spec:
      containers:
        - name: app-streamlit
          image: ghcr.io/kyryl-opens-ml/app-streamlit:latest
          env:
          - name: WANDB_API_KEY
            valueFrom:
              secretKeyRef:
                name: wandb
                key: WANDB_API_KEY
---
apiVersion: v1
kind: Service
metadata:
  name: app-streamlit
spec:
  ports:
  - port: 8080
    protocol: TCP
  selector:
    app: app-streamlit

Configuration notes:

Single replica (Streamlit maintains session state)
ClusterIP service for internal access
W&B API key injected from Kubernetes secret

Deployment Steps

Create cluster

kind create cluster --name ml-in-production

Configure secrets

export WANDB_API_KEY='your-key'
kubectl create secret generic wandb \
  --from-literal=WANDB_API_KEY=$WANDB_API_KEY

Deploy application

kubectl create -f k8s/app-streamlit.yaml

Monitor deployment

kubectl get pods -l app=app-streamlit
kubectl logs -l app=app-streamlit -f

Access UI

kubectl port-forward --address 0.0.0.0 svc/app-streamlit 8080:8080

Open http://localhost:8080 in browser

Caching Strategy

Model Caching

@st.cache_data
def get_model() -> Predictor:
    return Predictor.default_from_model_registry()

Benefits:

Model loads once per session
Faster page reloads during development
Shared across all users in production

Use @st.cache_resource for models in production to share state across sessions.

Best Practice

@st.cache_resource
def get_model() -> Predictor:
    return Predictor.default_from_model_registry()

Differences:

cache_data: Serializes return value (slower, safer)
cache_resource: Shares object reference (faster, use for models)

Testing Streamlit Apps

Streamlit provides testing utilities:

tests/test_ui_app.py

from streamlit.testing.v1 import AppTest

def test_single_prediction():
    at = AppTest.from_file("serving/ui_app.py")
    at.run()
    
    # Simulate user input
    at.text_input[0].set_value("test sentence").run()
    at.button[0].click().run()
    
    # Assert output appears
    assert "Pred:" in at.text[0].value

def test_batch_prediction():
    at = AppTest.from_file("serving/ui_app.py")
    at.run()
    
    # Upload file
    at.file_uploader[0].upload_file("test.csv").run()
    
    # Check results displayed
    assert "correct_sentence_conf" in at.dataframe[1].value.columns

Production Considerations

Session State

Streamlit maintains per-user session state. Scale horizontally with sticky sessions.

Configuration for load balancers:

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.kubernetes.io/session-affinity: ClientIP
spec:
  sessionAffinity: ClientIP

Performance Optimization

Fragment caching for components:

@st.cache_data
def expensive_computation(data):
    return process(data)

def single_pred():
    input_sent = st.text_input("Type sentence")
    if st.button("Run"):
        result = expensive_computation(input_sent)  # Cached
        st.write(result)

Error Handling

Graceful error display:

def single_pred():
    input_sent = st.text_input("Type english sentence")
    if st.button("Run inference"):
        try:
            pred = predictor.predict([input_sent])
            st.success("Prediction complete!")
            st.write("Pred:", pred)
        except Exception as e:
            st.error(f"Prediction failed: {str(e)}")
            st.exception(e)

UI Enhancements

Visualization

Add charts for probability distributions:

import matplotlib.pyplot as plt

def single_pred():
    input_sent = st.text_input("Type english sentence")
    if st.button("Run inference"):
        pred = predictor.predict([input_sent])[0]
        
        # Display as bar chart
        fig, ax = plt.subplots()
        ax.bar(["Negative", "Positive"], pred)
        ax.set_ylabel("Probability")
        st.pyplot(fig)

def main():
    st.sidebar.header("Configuration")
    threshold = st.sidebar.slider(
        "Confidence threshold",
        min_value=0.0,
        max_value=1.0,
        value=0.5
    )
    
    st.header("UI serving demo")
    # Use threshold in predictions

Comparison: Streamlit vs Gradio

Feature	Streamlit	Gradio
Learning curve	Low	Very low
Customization	High	Limited
Layout control	Excellent	Basic
HuggingFace integration	Manual	Built-in
Deployment	Self-hosted	HF Spaces

Choose Streamlit when:

Building internal tools
Need custom layouts
Require data exploration features
Have existing Python codebase

Choose Gradio when:

Quick demos for HuggingFace
Simple input/output interfaces
Want hosted deployment

Best Practices

Use Caching

Cache expensive operations with @st.cache_data

Progress Indicators

Show st.spinner() for long-running tasks

Input Validation

Validate user input before processing

Error Messages

Display helpful errors with st.error()

Next Steps

Triton Inference Server

Deploy high-performance inference with Triton

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

​Overview

​Implementation

​Application Structure

​Single Prediction Interface

​Implementation

​User Experience

​Batch Prediction Interface

​Implementation

​Batch Predictor Method

​Example Usage

​Local Development

​Using Make

​Using Docker

​Access the UI

​Kubernetes Deployment

​Manifest

​Deployment Steps

​Caching Strategy

​Model Caching

​Best Practice

​Testing Streamlit Apps

​Production Considerations

​Session State

​Performance Optimization

​Error Handling

​UI Enhancements

​Visualization

​Configuration Sidebar

​Comparison: Streamlit vs Gradio

​Best Practices

Use Caching

Progress Indicators

Input Validation

Error Messages

​Next Steps

Triton Inference Server

​Resources

Build docs developers (and LLMs) love

Overview

Implementation

Application Structure

Single Prediction Interface

Implementation

User Experience

Batch Prediction Interface

Implementation

Batch Predictor Method

Example Usage

Local Development

Using Make

Using Docker

Access the UI

Kubernetes Deployment

Manifest

Deployment Steps

Caching Strategy

Model Caching

Best Practice

Testing Streamlit Apps

Production Considerations

Session State

Performance Optimization

Error Handling

UI Enhancements

Visualization

Configuration Sidebar

Comparison: Streamlit vs Gradio

Best Practices

Next Steps

Resources