Overview
NVIDIA Nsight Systems provides application-level profiling with metric sampling capabilities that bridge the gap between timing analysis and kernel-level deep dives. Key capabilities:- Toggle CUDA profiler on/off to focus on specific regions
- PyTorch profiler integration (PyTorch workflow only)
- NVTX markers for understanding execution phases
- Metric collection for GPU utilization analysis
Profiling Features
CUDA Profiler Control
Toggling the CUDA profiler runtime API on and off provides:- Precise control over profiled regions
- Smaller profile files for faster post-processing
- Focused analysis on iterations of interest
PyTorch Profiler (PyTorch Workflow Only)
For PyTorch backend users:- Detailed performance breakdown of model execution
- Chrome tracing visualization
- CPU/GPU timeline analysis
NVTX Markers
- Basic NVTX markers enabled by default (PyTorch workflow)
- Enhanced markers available for debugging
- Garbage collection tracking
- Python GIL (Global Interpreter Lock) visibility
Environment Variables
Specify iteration range to profile:
A-B where A is start iteration and B is end iteration.Example: TLLM_PROFILE_START_STOP=100-150Usage: Combine with nsys profile -c cudaProfilerApi to only collect specific iterations.Enable verbose NVTX markers for debugging.Example:
TLLM_NVTX_DEBUG=1Effect: Adds detailed NVTX markers throughout execution for granular analysis.Enable garbage collection NVTX markers.Example:
TLLM_PROFILE_RECORD_GC=1Use case: Identify if Python GC is causing performance hiccups.Path to save PyTorch profiler trace (PyTorch workflow only).Example:
TLLM_TORCH_PROFILE_TRACE=trace.jsonVisualization: Open the JSON file in chrome://tracing/Enable NVTX markers in the LLM API layer.Example:
TLLM_LLMAPI_ENABLE_NVTX=1Basic Profiling Workflow
Run with Profiling
Profile specific iterations (e.g., iterations 100-150):This creates
trace.nsys-rep.Advanced Profiling Example
Comprehensive profiling with all debugging features enabled:-t 'cuda,nvtx,python-gil'- Collect CUDA kernels, NVTX markers, and Python GIL events-c cudaProfilerApi- Only profile iterations specified byTLLM_PROFILE_START_STOP--cuda-graph-trace node- Capture CUDA graph structure-e TLLM_PROFILE_RECORD_GC=1- Record garbage collection events-e TLLM_LLMAPI_ENABLE_NVTX=1- Enable LLM API markers-e TLLM_TORCH_PROFILE_TRACE=trace.json- Save PyTorch trace--trace-fork-before-exec=true- Properly handle process forking
This creates two outputs:
trace.nsys-rep- Nsight Systems report (open withnsys-ui)trace.json- PyTorch profiler trace (open in chrome://tracing/)
Profiling trtllm-serve
The same profiling approach works for the OpenAI-compatible server:PyTorch Profiler Workflow
For detailed PyTorch-level analysis:Visualize in Chrome
- Open Chrome browser
- Navigate to chrome://tracing/
- Click “Load” and select
trace.json - Analyze the timeline:
- CPU operations (Python, data loading)
- GPU kernels
- Memory allocations
- Operator-level breakdown
MoE-Specific Profiling
For Mixture-of-Experts models, you can analyze expert load balancing:Perfect Router Analysis
Isolate routing inefficiencies by comparing with idealized load balancing:- Bypasses learned router
- Uses pre-computed, perfectly balanced routing logits
- Shows theoretical maximum throughput with ideal load distribution
Interpreting MoE Results
| Scenario | Interpretation | Action |
|---|---|---|
| Similar performance with/without perfect router | Routing is not a bottleneck | Focus optimization elsewhere |
| >10% improvement with perfect router | Router causing load imbalance | Optimize routing or try alternative strategies |
Perfect router currently supports:
- GPT-OSS (uses
RenormalizeMoeRoutingMethod) - DeepSeek-V3 / DeepSeek-R1 (uses
DeepSeekV3MoeRoutingMethod)
Key Metrics to Analyze
GPU Utilization
Look for:- SM (Streaming Multiprocessor) utilization - Should be >80% for good throughput
- Memory bandwidth utilization - Should be >70% for memory-bound operations
- Tensor Core utilization - Critical for FP16/BF16/FP8 matmuls
Kernel Patterns
Identify:- Dominant kernels - Which operations consume most time?
- Launch overhead - Are small kernels causing inefficiency?
- Synchronization gaps - Are there idle periods between kernels?
Memory Access Patterns
- KV cache access - Should show efficient blocked/paged access
- Activation memory - Watch for unnecessary copies
- Memory allocations - Frequent allocations indicate inefficiency
Common Profiling Patterns
Pattern 1: Low GPU Utilization
Symptom: GPU utilization less than 50% in Nsight Systems Possible causes:- Batch size too small
- CPU bottleneck in data loading
- Insufficient parallelism
- Increase
max_batch_size - Check Python GIL markers for CPU contention
- Enable CUDA graphs to reduce launch overhead
Pattern 2: Memory Bandwidth Bound
Symptom: Memory bandwidth utilization greater than 90%, SM utilization less than 70% Possible causes:- Large model on limited memory bandwidth
- Inefficient memory access patterns
- Enable FP8 quantization (reduce data movement)
- Use FP8 KV cache (2x memory savings)
- Consider tensor parallelism to aggregate bandwidth
Pattern 3: Compute Bound
Symptom: SM utilization greater than 90%, memory bandwidth less than 60% Possible causes:- Good problem! GPU is being fully utilized.
- Use faster GPUs (e.g., H100 vs A100)
- Enable speculative decoding (reduce compute per token)
Pattern 4: Kernel Launch Overhead
Symptom: Many short-duration kernels with gaps between them Possible causes:- No CUDA graphs enabled
- Small batch sizes
- Enable CUDA graphs:
Profiling Best Practices
Troubleshooting
Profile Files Too Large
Problem: Nsight Systems creates multi-GB files Solution:- Use
-c cudaProfilerApiwithTLLM_PROFILE_START_STOPto limit profiled iterations - Reduce traced iterations (e.g., 10-50 instead of 100-500)
- Disable unnecessary trace sources (e.g., remove
python-gilif not needed)
Cannot Open Profile in Nsight Systems
Problem: Nsight Systems version mismatch Solution:- Ensure Nsight Systems version matches or is newer than the profiler version
- Download latest Nsight Systems from NVIDIA Developer
Missing NVTX Markers
Problem: No NVTX markers visible in trace Solution:- Ensure
-t nvtxis specified innsys profilecommand - For C++/TensorRT workflow: rebuild with
--nvtxflag - Set
TLLM_LLMAPI_ENABLE_NVTX=1for API-level markers
Performance Analysis Checklist
- Profiled representative workload
- Used
TLLM_PROFILE_START_STOPto focus on steady-state iterations - Collected CUDA and NVTX traces
- Analyzed GPU utilization (SM and memory bandwidth)
- Identified dominant kernels
- Checked for CPU bottlenecks (Python GIL markers)
- Compared baseline vs optimized configurations
- Documented findings and optimization opportunities
Related Resources
Benchmarking
Measure throughput and latency with trtllm-bench
Optimization Guide
Performance tuning best practices
Nsight Systems Documentation
Official NVIDIA Nsight Systems guide
PyTorch Profiler
PyTorch profiler documentation