Features
- Automatic vLLM Setup: Sets up vLLM on fresh Ubuntu pods
- Agentic Model Configuration: Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.)
- Smart GPU Allocation: Manages multiple models on the same pod with automatic GPU assignment
- OpenAI-Compatible API: Provides OpenAI-compatible endpoints for each model
- Interactive Agent: Includes an agent with file system tools for testing
- Predefined Model Configs: Built-in configurations for popular models
Installation
Prerequisites
- Node.js 18+
- HuggingFace token (for model downloads)
- GPU pod with:
- Ubuntu 22.04 or 24.04
- SSH root access
- NVIDIA drivers installed
- Persistent storage for models
Quick Start
Supported Providers
Primary Support
DataCrunch - Best for shared model storage- NFS volumes sharable across multiple pods in same region
- Models download once, use everywhere
- Ideal for teams or multiple experiments
- Network volumes persist independently
- Cannot share between running pods simultaneously
- Good for single-pod workflows
Also Works With
- Vast.ai (volumes locked to specific machine)
- Prime Intellect (no persistent storage)
- AWS EC2 (with EFS setup)
- Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH
Commands
Pod Management
release(default): Stable vLLM release, recommended for most usersnightly: Latest vLLM features, needed for newest models like GLM-4.5gpt-oss: Special build for OpenAI’s GPT-OSS models only
Model Management
Agent & Chat Interface
Predefined Model Configurations
pi includes predefined configurations for popular agentic models. Run pi start without additional arguments to see a list of predefined models that can run on the active pod.
Qwen Models
GPT-OSS Models
GLM Models
Custom Models with —vllm
For models not in the predefined list, use--vllm to pass arguments directly to vLLM:
Provider Setup Examples
DataCrunch Setup
DataCrunch offers the best experience with shared NFS storage across pods:-
Create Shared Filesystem (SFS)
- Go to DataCrunch dashboard → Storage → Create SFS
- Choose size and datacenter
- Note the mount command
-
Create GPU Instance
- Create instance in same datacenter as SFS
- Share the SFS with the instance
- Get SSH command from dashboard
-
Setup with pi
- Models persist across instance restarts
- Share models between multiple instances in same datacenter
- Download once, use everywhere
RunPod Setup
-
Create Network Volume (optional)
- Go to RunPod dashboard → Storage → Create Network Volume
- Choose size and region
-
Create GPU Pod
- Select “Network Volume” during pod creation
- Attach your volume to
/runpod-volume - Get SSH command from pod details
-
Setup with pi
Multi-GPU Support
Automatic GPU Assignment
When running multiple models, pi automatically assigns them to different GPUs:Specify GPU Count for Predefined Models
Tensor Parallelism for Large Models
API Integration
All models expose OpenAI-compatible endpoints:Tool Calling Support
pi automatically configures appropriate tool calling parsers for known models:
- Qwen models:
hermesparser (Qwen3-Coder usesqwen3_coder) - GLM models:
glm4_moeparser with reasoning support - GPT-OSS models: Uses
/v1/responsesendpoint - Custom models: Specify with
--vllm --tool-call-parser <parser> --enable-auto-tool-choice
Memory and Context Management
GPU Memory Allocation
Controls how much GPU memory vLLM pre-allocates:--memory 30%: High concurrency, limited context--memory 50%: Balanced (default)--memory 90%: Maximum context, low concurrency
Context Window
Sets maximum input + output tokens:--context 4k: 4,096 tokens total--context 32k: 32,768 tokens total--context 128k: 131,072 tokens total
Session Persistence
The interactive agent mode (-i) saves sessions for each project directory:
~/.pi/sessions/ organized by project path.
Environment Variables
HF_TOKEN- HuggingFace token for model downloadsPI_API_KEY- API key for vLLM endpointsPI_CONFIG_DIR- Config directory (default:~/.pi)OPENAI_API_KEY- Used bypi-agentwhen no--api-keyprovided
Troubleshooting
OOM (Out of Memory) Errors
- Reduce
--memorypercentage - Use smaller model or quantized version (FP8)
- Reduce
--contextsize
Model Won’t Start
Tool Calling Issues
- Not all models support tool calling reliably
- Try different parser:
--vllm --tool-call-parser mistral - Or disable:
--vllm --disable-tool-call-parser