Overview
Custom providers allow you to:- Use alternative AI services
- Run models locally
- Connect to self-hosted endpoints
- Access specialized model providers
- Integrate with enterprise AI platforms
OpenAI-Compatible Providers
Many services implement the OpenAI Chat Completions API format. Forge can work with any such service.Setup Steps
Get Provider Details
Collect from your provider:
- API base URL (e.g.,
https://api.example.com/v1) - API key or authentication token
- Available model names
Configure in Forge
- OPENAI_URL: Your provider’s base URL
- API Key: Your authentication key
Supported Services
Popular OpenAI-compatible services:Cloud Services
- Groq - Ultra-fast inference
- Together AI - Open model hosting
- Fireworks AI - Production inference
- Anyscale Endpoints - Ray-powered serving
Local Inference
- Ollama - Easy local deployment
- LM Studio - Desktop GUI for local models
- llama.cpp - C++ inference engine
- vLLM - Fast LLM serving
- Jan AI - Privacy-focused desktop app
Example: Groq
Groq provides ultra-fast inference:Example: Ollama
Ollama runs models locally:-
Install and start Ollama:
-
Configure Forge:
-
Set model:
Example: LM Studio
LM Studio provides a desktop GUI:- Download and start LM Studio
- Load a model in the GUI
- Start the local server (default port: 1234)
- Configure Forge:
Anthropic-Compatible Providers
Some services implement Anthropic’s Messages API format.Setup Steps
Configure in Forge
- ANTHROPIC_URL: Provider base URL
- API Key: Authentication key
Advanced Configuration
Multiple Custom Providers
You can configure multiple custom providers by creating aprovider.json file:
~/.config/forge/provider.json(user-wide)./provider.json(project-specific)
Custom Model Definitions
Define model metadata for custom providers:Local Model Configuration
Ollama (Detailed)
Complete Ollama setup:llama.cpp Server
Run llama.cpp server:vLLM
Deploy vLLM server:Troubleshooting
Connection Refused
If Forge can’t connect:- Verify the server is running
- Check the URL and port are correct
- Ensure no firewall blocking
- Test with curl:
Invalid Model Name
If model not found:- List available models:
- Verify spelling in
forge.yaml - Check model is loaded/running
Unsupported Features
Some providers may not support:- Tool calling / function calling
- Parallel tool execution
- Streaming responses
- Vision/multimodal input
Authentication Errors
If authentication fails:- Verify API key is correct
- Check if key is required (some local servers don’t need keys)
- Try with and without the key
- Check provider-specific auth format
Performance Issues
For slow local inference:- Use GPU acceleration if available
- Reduce context length
- Use quantized models (e.g., GGUF Q4)
- Increase server worker threads
- Consider cloud providers for production
Deprecated: Environment Variable Setup
For backward compatibility:Best Practices
Security
- Use authentication even for local servers
- Keep API keys in secure storage
- Use HTTPS in production
- Implement rate limiting
Performance
Local Inference:- Use GPU when available (CUDA, Metal, ROCm)
- Choose appropriate quantization (Q4, Q5, Q8)
- Tune context length to your needs
- Monitor memory usage
- Choose providers near your location
- Monitor response times
- Implement caching when possible
- Use streaming for long responses
Cost Management
Local Models:- Free inference (after hardware cost)
- Pay only for electricity
- No rate limits
- Full privacy
- Compare per-token pricing
- Monitor usage carefully
- Set spending limits
- Use cheaper models when appropriate
Next Steps
- Explore Ollama models
- Try Groq for fast inference
- Set up monitoring for your custom provider
- Configure retry logic for reliability