@mariozechner/pi-pods

Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads.

Features

Automatic vLLM Setup: Sets up vLLM on fresh Ubuntu pods
Agentic Model Configuration: Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.)
Smart GPU Allocation: Manages multiple models on the same pod with automatic GPU assignment
OpenAI-Compatible API: Provides OpenAI-compatible endpoints for each model
Interactive Agent: Includes an agent with file system tools for testing
Predefined Model Configs: Built-in configurations for popular models

Installation

npm install -g @mariozechner/pi-pods

Prerequisites

Node.js 18+
HuggingFace token (for model downloads)
GPU pod with:
- Ubuntu 22.04 or 24.04
- SSH root access
- NVIDIA drivers installed
- Persistent storage for models

Quick Start

# Set required environment variables
export HF_TOKEN=your_huggingface_token      # Get from https://huggingface.co/settings/tokens
export PI_API_KEY=your_api_key              # Any string you want for API authentication

# Setup a DataCrunch pod with NFS storage
pi-pods setup dc1 "ssh [email protected]" \
  --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"

# Start a model (automatic configuration for known models)
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen

# Send a single message to the model
pi agent qwen "What is the Fibonacci sequence?"

# Interactive chat mode with file system tools
pi agent qwen -i

# Use with any OpenAI-compatible client
export OPENAI_BASE_URL='http://1.2.3.4:8001/v1'
export OPENAI_API_KEY=$PI_API_KEY

Supported Providers

Primary Support

DataCrunch - Best for shared model storage

NFS volumes sharable across multiple pods in same region
Models download once, use everywhere
Ideal for teams or multiple experiments

RunPod - Good persistent storage

Network volumes persist independently
Cannot share between running pods simultaneously
Good for single-pod workflows

Also Works With

Vast.ai (volumes locked to specific machine)
Prime Intellect (no persistent storage)
AWS EC2 (with EFS setup)
Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH

Commands

Pod Management

pi-pods setup <name> "<ssh>" [options]        # Setup new pod
  --mount "<mount_command>"                   # Run mount command during setup
  --models-path <path>                        # Override extracted path (optional)
  --vllm release|nightly|gpt-oss              # vLLM version (default: release)

pi-pods                                       # List all configured pods
pi-pods active <name>                         # Switch active pod
pi-pods remove <name>                         # Remove pod from local config
pi shell [<name>]                             # SSH into pod
pi ssh [<name>] "<command>"                   # Run command on pod

vLLM Version Options:

release (default): Stable vLLM release, recommended for most users
nightly: Latest vLLM features, needed for newest models like GLM-4.5
gpt-oss: Special build for OpenAI’s GPT-OSS models only

Model Management

pi start <model> --name <name> [options]  # Start a model
  --memory <percent>      # GPU memory: 30%, 50%, 90% (default: 90%)
  --context <size>        # Context window: 4k, 8k, 16k, 32k, 64k, 128k
  --gpus <count>          # Number of GPUs to use (predefined models only)
  --pod <name>            # Target specific pod (overrides active)
  --vllm <args...>        # Pass custom args directly to vLLM

pi stop [<name>]          # Stop model (or all if no name given)
pi list                   # List running models with status
pi logs <name>            # Stream model logs (tail -f)

Agent & Chat Interface

pi agent <name> "<message>"               # Single message to model
pi agent <name> "<msg1>" "<msg2>"         # Multiple messages in sequence
pi agent <name> -i                        # Interactive chat mode
pi agent <name> -i -c                     # Continue previous session

# Standalone OpenAI-compatible agent (works with any API)
pi-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello"
pi-agent --api-key sk-... "What is 2+2?"  # Uses OpenAI by default
pi-agent --json "What is 2+2?"            # Output event stream as JSONL
pi-agent -i                                # Interactive mode

The agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities.

Predefined Model Configurations

pi includes predefined configurations for popular agentic models. Run pi start without additional arguments to see a list of predefined models that can run on the active pod.

Qwen Models

# Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen

# Qwen3-Coder-30B - Advanced reasoning with tool use
pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3

# Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode)
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b

GPT-OSS Models

# Requires special vLLM build during setup
pi-pods setup gpt-pod "ssh [email protected]" --models-path /workspace --vllm gpt-oss

# GPT-OSS-20B - Fits on 16GB+ VRAM
pi start openai/gpt-oss-20b --name gpt20

# GPT-OSS-120B - Needs 60GB+ VRAM
pi start openai/gpt-oss-120b --name gpt120

GLM Models

# GLM-4.5 - Requires 8-16 GPUs, includes thinking mode
pi start zai-org/GLM-4.5 --name glm

# GLM-4.5-Air - Smaller version, 1-2 GPUs
pi start zai-org/GLM-4.5-Air --name glm-air

Custom Models with —vllm

For models not in the predefined list, use --vllm to pass arguments directly to vLLM:

# DeepSeek with custom settings
pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
  --tensor-parallel-size 4 --trust-remote-code

# Mistral with pipeline parallelism
pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \
  --tensor-parallel-size 8 --pipeline-parallel-size 2

# Any model with specific tool parser
pi start some/model --name mymodel --vllm \
  --tool-call-parser hermes --enable-auto-tool-choice

Provider Setup Examples

DataCrunch Setup

DataCrunch offers the best experience with shared NFS storage across pods:

Create Shared Filesystem (SFS)
- Go to DataCrunch dashboard → Storage → Create SFS
- Choose size and datacenter
- Note the mount command
Create GPU Instance
- Create instance in same datacenter as SFS
- Share the SFS with the instance
- Get SSH command from dashboard

Setup with pi

pi-pods setup dc1 "ssh [email protected]" \
  --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"

Benefits:

Models persist across instance restarts
Share models between multiple instances in same datacenter
Download once, use everywhere

RunPod Setup

Create Network Volume (optional)
- Go to RunPod dashboard → Storage → Create Network Volume
- Choose size and region
Create GPU Pod
- Select “Network Volume” during pod creation
- Attach your volume to /runpod-volume
- Get SSH command from pod details

Setup with pi

# With network volume
pi-pods setup runpod "ssh [email protected]" --models-path /runpod-volume

# Or use workspace (persists with pod but not shareable)
pi-pods setup runpod "ssh [email protected]" --models-path /workspace

Multi-GPU Support

Automatic GPU Assignment

When running multiple models, pi automatically assigns them to different GPUs:

pi start model1 --name m1  # Auto-assigns to GPU 0
pi start model2 --name m2  # Auto-assigns to GPU 1
pi start model3 --name m3  # Auto-assigns to GPU 2

Specify GPU Count for Predefined Models

# Run Qwen on 1 GPU instead of all available
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1

# Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config)
pi start zai-org/GLM-4.5 --name glm --gpus 8

Tensor Parallelism for Large Models

# Use all available GPUs
pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
  --tensor-parallel-size 4

# Specific GPU count
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \
  --data-parallel-size 8 --enable-expert-parallel

API Integration

All models expose OpenAI-compatible endpoints:

from openai import OpenAI

client = OpenAI(
    base_url="http://your-pod-ip:8001/v1",
    api_key="your-pi-api-key"
)

# Chat completion with tool calling
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-32B-Instruct",
    messages=[
        {"role": "user", "content": "Write a Python function to calculate fibonacci"}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "execute_code",
            "description": "Execute Python code",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {"type": "string"}
                },
                "required": ["code"]
            }
        }
    }],
    tool_choice="auto"
)

Tool Calling Support

pi automatically configures appropriate tool calling parsers for known models:

Qwen models: hermes parser (Qwen3-Coder uses qwen3_coder)
GLM models: glm4_moe parser with reasoning support
GPT-OSS models: Uses /v1/responses endpoint
Custom models: Specify with --vllm --tool-call-parser <parser> --enable-auto-tool-choice

To disable tool calling:

pi start model --name mymodel --vllm --disable-tool-call-parser

Memory and Context Management

GPU Memory Allocation

Controls how much GPU memory vLLM pre-allocates:

--memory 30%: High concurrency, limited context
--memory 50%: Balanced (default)
--memory 90%: Maximum context, low concurrency

Context Window

Sets maximum input + output tokens:

--context 4k: 4,096 tokens total
--context 32k: 32,768 tokens total
--context 128k: 131,072 tokens total

Example for coding workload:

# Large context for code analysis, moderate concurrency
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
  --context 64k --memory 70%

Session Persistence

The interactive agent mode (-i) saves sessions for each project directory:

# Start new session
pi agent qwen -i

# Continue previous session (maintains chat history)
pi agent qwen -i -c

Sessions are stored in ~/.pi/sessions/ organized by project path.

Environment Variables

HF_TOKEN - HuggingFace token for model downloads
PI_API_KEY - API key for vLLM endpoints
PI_CONFIG_DIR - Config directory (default: ~/.pi)
OPENAI_API_KEY - Used by pi-agent when no --api-key provided

Troubleshooting

OOM (Out of Memory) Errors

Reduce --memory percentage
Use smaller model or quantized version (FP8)
Reduce --context size

Model Won’t Start

# Check GPU usage
pi ssh "nvidia-smi"

# Check if port is in use
pi list

# Force stop all models
pi stop

Tool Calling Issues

Not all models support tool calling reliably
Try different parser: --vllm --tool-call-parser mistral
Or disable: --vllm --disable-tool-call-parser

Access Denied for Models

Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click “Request access”.

License

MIT

Core Packages

Additional Packages

@mariozechner/pi-pods

Features

Installation

Prerequisites

Quick Start

Supported Providers

Primary Support

Also Works With

Commands

Pod Management

Model Management

Agent & Chat Interface

Predefined Model Configurations

Qwen Models

GPT-OSS Models

GLM Models

Custom Models with —vllm

Provider Setup Examples

DataCrunch Setup

RunPod Setup

Multi-GPU Support

Automatic GPU Assignment

Specify GPU Count for Predefined Models

Tensor Parallelism for Large Models

API Integration

Tool Calling Support

Memory and Context Management

GPU Memory Allocation

Context Window

Session Persistence

Environment Variables

Troubleshooting

OOM (Out of Memory) Errors

Model Won’t Start

Tool Calling Issues

Access Denied for Models

License

Build docs developers (and LLMs) love

Core Packages

Additional Packages

​Features

​Installation

​Prerequisites

​Quick Start

​Supported Providers

​Primary Support

​Also Works With

​Commands

​Pod Management

​Model Management

​Agent & Chat Interface

​Predefined Model Configurations

​Qwen Models

​GPT-OSS Models

​GLM Models

​Custom Models with —vllm

​Provider Setup Examples

​DataCrunch Setup

​RunPod Setup

​Multi-GPU Support

​Automatic GPU Assignment

​Specify GPU Count for Predefined Models

​Tensor Parallelism for Large Models

​API Integration

​Tool Calling Support

​Memory and Context Management

​GPU Memory Allocation

​Context Window

​Session Persistence

​Environment Variables

​Troubleshooting

​OOM (Out of Memory) Errors

​Model Won’t Start

​Tool Calling Issues

​Access Denied for Models

​License

Build docs developers (and LLMs) love

Features

Installation

Prerequisites

Quick Start

Supported Providers

Primary Support

Also Works With

Commands

Pod Management

Model Management

Agent & Chat Interface

Predefined Model Configurations

Qwen Models

GPT-OSS Models

GLM Models

Custom Models with —vllm

Provider Setup Examples

DataCrunch Setup

RunPod Setup

Multi-GPU Support

Automatic GPU Assignment

Specify GPU Count for Predefined Models

Tensor Parallelism for Large Models

API Integration

Tool Calling Support

Memory and Context Management

GPU Memory Allocation

Context Window

Session Persistence

Environment Variables

Troubleshooting

OOM (Out of Memory) Errors

Model Won’t Start

Tool Calling Issues

Access Denied for Models

License