Inference Optimization

The final piece of the puzzle: running real-time object detection on Raspberry Pi and integrating it with your robotic arm. This lesson covers inference pipelines, performance optimization, and complete system integration.

Learning Objectives

By the end of this lesson, you will be able to:

Load and run optimized YOLO models efficiently
Build inference pipelines for real-time video processing
Optimize for maximum FPS on Raspberry Pi
Process detection results for robot control
Integrate vision with serial communication
Debug and monitor system performance

This lesson uses the complete implementation from course/vision_class/, including inference, processing, and video streaming.

Inference Architecture

System Components

Pipeline Stages:

Capture: Get frame from camera
Preprocess: Resize, normalize for model input
Inference: Run YOLO detection
Postprocess: Parse results, filter by confidence
Selection: Choose best detection
Communication: Send target to robot
Visualization: Display results (optional)

Model Loading

Efficient Model Loader

From inference/model_loader.py:1-15:

import os
from typing import Dict
from ultralytics import YOLO

class ModelLoader:
    def __init__(self):
        current_path = os.path.dirname(os.path.abspath(__file__))
        object_model_path: str = current_path + '/models/mnn/yolo11s.mnn'
        self.model: YOLO = YOLO(object_model_path, task='detect')
        
    def get_model(self) -> YOLO:
        return self.model

Key Design Points:

Relative Path: Uses __file__ for portability
MNN Format: Optimized for Raspberry Pi ARM CPU
Single Load: Model loaded once at initialization
Task Specification: Explicit task='detect' for detection mode

Performance Tip: Load models during initialization, not per-frame. Model loading takes 1-2 seconds but only needs to happen once.

Detection Model Wrapper

From inference/detector.py:1-22:

import numpy as np
from typing import Tuple, Dict
from abc import ABC, abstractmethod

from ultralytics.engine.results import Results
from .model_loader import ModelLoader

class DetectionModelInterface(ABC):
    @abstractmethod
    def inference(self, image: np.ndarray) -> Tuple[Results, Dict[int, str]]:
        pass

class DetectionModel(DetectionModelInterface):
    def __init__(self):
        self.object_model = ModelLoader().get_model()

    def inference(self, image: np.ndarray) -> tuple[list[Results], Dict[int, str]]:
        results = self.object_model.predict(
            image, 
            conf=0.55,      # Confidence threshold
            verbose=False,  # Suppress output
            imgsz=640,      # Input size
            stream=True,    # Memory-efficient streaming
            task='detect',  # Detection mode
            half=True       # FP16 precision
        )
        return results, self.object_model.names

Parameters Explained:

conf=0.55: Minimum confidence (55%) to accept detection
- Lower = more detections, more false positives
- Higher = fewer detections, miss some objects
- 0.5-0.6 is typical for robotics
verbose=False: Suppress logging (cleaner output)
imgsz=640: Model input size (must match training)
stream=True: Generator-based results (memory efficient)
- Returns iterator instead of list
- Process results one at a time
- Reduces memory usage
half=True: Use FP16 precision if supported
- ~1.5x faster on supported hardware
- Minimal accuracy loss
- May not work on all platforms (falls back to FP32)

Confidence Threshold Tuning:Too low (0.3): Many false detections, robot confused Just right (0.5-0.6): Good balance Too high (0.8): Miss valid objectsTest with your specific objects and lighting conditions!

Image Processing Pipeline

Complete Processing Class

From process/image_processing.py:8-23:

import cv2
import numpy as np
import logging as log
log.basicConfig(level=log.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

from .detection.main import (DetectionModelInterface, DetectionModel)

class ImageProcessor:
    def __init__(self, confidence_threshold: float = 0.45):
        self.detection: DetectionModelInterface = DetectionModel()
        self.conf_threshold = confidence_threshold
        
    def read_image_path(self, path: str, draw_results: bool = True, 
                       save_drawn_img: bool = True):
        object_image = cv2.imread(path)
        processed_img, best_detection = self.process_image(
            object_image, self.conf_threshold
        )
        
        if draw_results and best_detection is not None and \
           best_detection.get('confidence', 0) > 0:
            self._draw_detection(processed_img, best_detection)
            if save_drawn_img:
                self._save_drawn_image(processed_img, path)

        return processed_img, best_detection

Core Processing Logic

From process/image_processing.py:24-72:

def process_image(self, image: np.ndarray, confidence_threshold: float = 0.45):
    try:
        # 1. Run inference
        copy_image = image.copy()
        object_results, object_classes = self.detection.inference(copy_image)
        
        # 2. Initialize best detection
        best_detection = {
            'class': '', 
            'confidence': 0.0, 
            'box': [], 
            'class_id': -1
        }
        
        # 3. Process all results
        for res in object_results:
            boxes = res.boxes
            
            # No detections in this result
            if boxes.shape[0] == 0:
                continue
                
            # Extract detection data
            confidence = boxes.conf.cpu().numpy()[0]
            class_id = int(boxes.cls[0])
            box_data = boxes.xyxy.cpu().numpy()[0]
                
            # Filter by confidence
            if confidence < confidence_threshold:
                continue
                
            # Get class name
            detected_class = object_classes[class_id]
            clss_object = 'default'
                
            # Filter by target classes
            if detected_class in ['apple', 'orange', 'bottle']:
                clss_object = detected_class
                
            log.info(f'class: {clss_object}')
                    
            # Keep highest confidence detection
            if confidence > best_detection['confidence']:
                best_detection.update({
                    'class': str(clss_object),
                    'confidence': float(confidence),
                    'box': box_data,
                    'class_id': class_id
                })
                    
        # 4. Return results
        if best_detection['confidence'] >= confidence_threshold:
            log.info(f"best detection: {best_detection}")
            return image, best_detection
        else:
            log.info("not found detections")
            return image, best_detection
            
    except Exception as e:
        log.info(f'error in image processing: {e}')
        return image, None

Algorithm Steps:

Inference: Run YOLO model on image
Iterate Results: Stream-based processing
Extract Data: Get confidence, class, bounding box
Filter Confidence: Skip low-confidence detections
Filter Classes: Only keep apple, orange, bottle
Select Best: Highest confidence wins
Return: Best detection or None

Why “best” detection?For pick-and-place, the robot can only grab one object at a time. We select the highest-confidence detection as the target. In production, you might:

Choose closest object (depth sensing)
Pick specific class (“get the apple”)
Use task priority (bottles before apples)

Visualization

From process/image_processing.py:74-95:

def _draw_detection(self, image: np.ndarray, detection: dict):
    """
    Draw bounding box and label on image
    """
    box = detection['box']
    class_name = detection['class']
    confidence = detection['confidence']

    x1, y1, x2, y2 = map(int, box)
    color = (0, 255, 0)  # BGR - green
    
    # Draw rectangle
    cv2.rectangle(image, (x1, y1), (x2, y2), color, 2)

    # Draw label
    label = f"{class_name} {confidence:.2f}"
    cv2.putText(
        image, label, (x1, y1 - 10),
        cv2.FONT_HERSHEY_SIMPLEX, 0.7, color, 2
    )

def _save_drawn_image(self, image: np.ndarray, original_path: str):
    """
    Save annotated image
    """
    out_path = original_path.replace('.jpg', '_detected.jpg')
    cv2.imwrite(out_path, image)
    log.info(f"save image with draw detections: {out_path}")

Visualization Features:

Green bounding box around object
Label with class name and confidence
Optional: Save annotated image

Real-Time Video Stream

Video Processing System

From video_stream.py:9-51:

import os
import sys
import cv2
import time
import logging as log

from process.image_processing import ImageProcessor

class VideoStream:
    def __init__(self, source: str = 0, confidence_threshold: float = 0.45):
        self.source = source
        self.confidence_threshold = confidence_threshold
        self.image_processor = ImageProcessor(
            confidence_threshold=confidence_threshold
        )
        self.video_capture = None
        self.prev_time = 0

    def start_stream(self):
        self.video_capture = cv2.VideoCapture(self.source)
        if not self.video_capture.isOpened():
            log.error(f"Error opening video stream or file: {self.source}")
            return False

        while True:
            ret, frame = self.video_capture.read()
            if not ret:
                log.error("Failed to capture frame from video stream.")
                break
            
            # Calculate FPS
            current_time = time.time()
            fps = 1 / (current_time - self.prev_time) if self.prev_time > 0 else 0
            self.prev_time = current_time
            
            # Run inference
            processed_frame, best_detection = self.image_processor.process_image(
                frame, self.confidence_threshold
            )
            
            # Draw best detection
            if best_detection is not None and best_detection.get('confidence', 0) > 0:
                self.image_processor._draw_detection(processed_frame, best_detection)
                
            # Display FPS
            cv2.putText(
                processed_frame, f"FPS: {fps:.2f}", (10, 30), 
                cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2
            )
                
            cv2.imshow('Video Stream', processed_frame)
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break

        self.video_capture.release()
        cv2.destroyAllWindows()
        return True

Key Features:

Camera Source: 0 for default webcam, or video file path
FPS Calculation: Real-time performance monitoring
Frame-by-Frame: Process each frame independently
Best Detection: Highlight highest-confidence object
Display: Show annotated video with FPS
Exit: Press ‘q’ to quit

Usage:

if __name__ == "__main__":
    real_time_inference = VideoStream(0, 0.45)
    real_time_inference.start_stream()

Frame Rate Factors:

Model size (yolo11n > yolo11s > yolo11m)
Input resolution (640x640 vs 1280x1280)
Confidence threshold (affects postprocessing)
Camera resolution (lower = faster capture)
Display rendering (imshow takes time)

On Raspberry Pi 4 with yolo11s.mnn: Expect 5-10 FPS

Performance Optimization Techniques

1. Reduce Input Resolution

# Before inference, resize frame
height, width = frame.shape[:2]
if width > 640:
    scale = 640 / width
    new_width = 640
    new_height = int(height * scale)
    frame = cv2.resize(frame, (new_width, new_height))

Impact: 2x speedup (1920x1080 → 640x480)

2. Skip Frames

frame_count = 0
process_every_n = 2  # Process every 2nd frame

while True:
    ret, frame = camera.read()
    frame_count += 1
    
    if frame_count % process_every_n == 0:
        results = model.predict(frame)
    
    # Use previous results for skipped frames

Impact: 2x speedup (process_every_n=2)

3. Disable Visualization During Operation

# Development: show video
cv2.imshow('Stream', frame)  # SLOW!

# Production: no display
# (just process and send to robot)

Impact: 20-30% speedup

4. Use Threading for I/O

import threading
import queue

class ThreadedCamera:
    def __init__(self, source=0):
        self.camera = cv2.VideoCapture(source)
        self.queue = queue.Queue(maxsize=2)
        self.stopped = False
        
    def start(self):
        threading.Thread(target=self._read_frames, daemon=True).start()
        return self
    
    def _read_frames(self):
        while not self.stopped:
            ret, frame = self.camera.read()
            if not ret:
                break
            
            if not self.queue.full():
                self.queue.put(frame)
    
    def read(self):
        return self.queue.get()
    
    def stop(self):
        self.stopped = True

# Usage
camera = ThreadedCamera(0).start()
while True:
    frame = camera.read()
    results = model.predict(frame)

Impact: 10-15% speedup (camera I/O overlaps with inference)

5. Adjust Confidence Threshold

# Lower threshold = more detections to process
results = model.predict(frame, conf=0.3)  # SLOW

# Higher threshold = fewer detections
results = model.predict(frame, conf=0.6)  # FASTER

Impact: 5-10% speedup (fewer postprocessing operations)

Optimization Priority:

Use optimized model format (MNN/NCNN)
Reduce input resolution
Disable visualization in production
Skip frames if tolerable
Use threading for I/O

Integration with Robot Control

Complete Vision-Control System

import cv2
import time
from vision_class.process.image_processing import ImageProcessor
from comm_class.raspberry_comm.json_data import SerialCommunication

class RobotVisionController:
    def __init__(self):
        # Initialize components
        self.processor = ImageProcessor(confidence_threshold=0.5)
        self.serial = SerialCommunication()
        self.camera = cv2.VideoCapture(0)
        
        # State
        self.last_detection = None
        self.last_send_time = 0
        self.send_interval = 0.5  # Send updates every 500ms
        
    def start(self):
        # Connect to robot
        if not self.serial.connect():
            print("Failed to connect to robot")
            return
        
        print("Vision-control system started")
        print("Press 'q' to quit")
        
        try:
            while True:
                # Capture frame
                ret, frame = self.camera.read()
                if not ret:
                    break
                
                # Process frame
                processed, detection = self.processor.process_image(frame, 0.5)
                
                # Send detection to robot (rate-limited)
                if detection and detection['confidence'] > 0:
                    current_time = time.time()
                    if current_time - self.last_send_time > self.send_interval:
                        self._send_detection(detection)
                        self.last_send_time = current_time
                
                # Display (optional - remove for production)
                if detection and detection['confidence'] > 0:
                    self.processor._draw_detection(processed, detection)
                cv2.imshow('Robot Vision', processed)
                
                if cv2.waitKey(1) & 0xFF == ord('q'):
                    break
        
        finally:
            self.cleanup()
    
    def _send_detection(self, detection: dict):
        """
        Send detection to robot via serial
        """
        box = detection['box']
        center_x = int((box[0] + box[2]) / 2)
        center_y = int((box[1] + box[3]) / 2)
        
        message_data = {
            'class': detection['class'],
            'confidence': float(detection['confidence']),
            'center_x': center_x,
            'center_y': center_y,
            'bbox': box.tolist() if hasattr(box, 'tolist') else list(box)
        }
        
        self.serial.writing_data('detection', message_data)
        print(f"Sent: {detection['class']} at ({center_x}, {center_y})")
    
    def cleanup(self):
        print("Shutting down...")
        self.camera.release()
        cv2.destroyAllWindows()
        self.serial.close()

if __name__ == "__main__":
    controller = RobotVisionController()
    controller.start()

System Flow:

Initialize: Load model, connect to robot
Capture: Get frame from camera
Detect: Run YOLO inference
Filter: Select best detection
Rate Limit: Don’t spam robot with messages
Send: Transmit detection via serial JSON
Visualize: Display results (optional)
Repeat: Process next frame

Rate LimitingSending detections every frame (10 FPS = 10 messages/sec) can overwhelm the robot. Rate limiting (1-2 messages/sec) provides stable control while the robot executes movements.

Monitoring and Debugging

Performance Metrics

import time
from collections import deque

class PerformanceMonitor:
    def __init__(self, window_size=30):
        self.inference_times = deque(maxlen=window_size)
        self.frame_times = deque(maxlen=window_size)
        self.detection_counts = deque(maxlen=window_size)
        
    def record_inference(self, duration: float):
        self.inference_times.append(duration)
    
    def record_frame(self, duration: float):
        self.frame_times.append(duration)
    
    def record_detections(self, count: int):
        self.detection_counts.append(count)
    
    def get_stats(self) -> dict:
        return {
            'avg_inference_ms': sum(self.inference_times) / len(self.inference_times) * 1000,
            'avg_fps': 1.0 / (sum(self.frame_times) / len(self.frame_times)),
            'avg_detections': sum(self.detection_counts) / len(self.detection_counts)
        }

# Usage
monitor = PerformanceMonitor()

while True:
    frame_start = time.time()
    
    # Inference
    inf_start = time.time()
    results = model.predict(frame)
    inf_time = time.time() - inf_start
    monitor.record_inference(inf_time)
    
    # Count detections
    num_detections = len(results[0].boxes)
    monitor.record_detections(num_detections)
    
    # Frame time
    frame_time = time.time() - frame_start
    monitor.record_frame(frame_time)
    
    # Print stats every 30 frames
    if len(monitor.frame_times) == 30:
        stats = monitor.get_stats()
        print(f"Inference: {stats['avg_inference_ms']:.1f}ms, "
              f"FPS: {stats['avg_fps']:.1f}, "
              f"Detections: {stats['avg_detections']:.1f}")

Debug Logging

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('robot_vision.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger('RobotVision')

# In processing code
logger.info(f'Detected {class_name} with confidence {conf:.2f}')
logger.warning(f'Low FPS: {fps:.1f}')
logger.error(f'Serial communication failed: {e}')

Common Issues

Problem: Low FPS (< 5) Checks:

# Check model format
print(f'Model path: {model.ckpt_path}')  # Should be .mnn or .ncnn

# Check input size
print(f'Input size: {imgsz}')  # 640 is standard, 1280 is slow

# Check camera resolution
print(f'Frame size: {frame.shape}')  # Lower is faster

Problem: No detections Checks:

# Check confidence threshold
results = model.predict(frame, conf=0.3)  # Try lower

# Check classes
print(f'Model classes: {model.names}')  # Verify apple, orange, bottle present

# Check lighting
cv2.imwrite('debug_frame.jpg', frame)  # Inspect lighting quality

Problem: High latency Measure stages:

t0 = time.time()
ret, frame = camera.read()
t1 = time.time()
results = model.predict(frame)
t2 = time.time()
processed = parse_results(results)
t3 = time.time()
serial.send(data)
t4 = time.time()

print(f'Capture: {(t1-t0)*1000:.1f}ms')
print(f'Inference: {(t2-t1)*1000:.1f}ms')
print(f'Processing: {(t3-t2)*1000:.1f}ms')
print(f'Serial: {(t4-t3)*1000:.1f}ms')

Typical Raspberry Pi 4 Timings (yolo11s.mnn):

Capture: 10-20ms
Inference: 100-150ms (dominant)
Processing: 1-5ms
Serial: 1-2ms
Total: ~120-180ms (5-8 FPS)

If inference > 200ms, check model format and size.

Production Deployment

Systemd Service

Run vision system as a service on Raspberry Pi: robot_vision.service:

[Unit]
Description=Robot Vision System
After=network.target

[Service]
Type=simple
User=pi
WorkingDirectory=/home/pi/robot_arm
ExecStart=/usr/bin/python3 /home/pi/robot_arm/main.py
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Install:

sudo cp robot_vision.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable robot_vision
sudo systemctl start robot_vision

# Check status
sudo systemctl status robot_vision

# View logs
journalctl -u robot_vision -f

Error Recovery

class RobotVisionController:
    def start(self):
        retry_count = 0
        max_retries = 5
        
        while retry_count < max_retries:
            try:
                self._run()
                break  # Successful
            except Exception as e:
                retry_count += 1
                logger.error(f'System error: {e}, retry {retry_count}/{max_retries}')
                time.sleep(5)
                
                # Reinitialize
                self.cleanup()
                self._initialize()
        
        if retry_count >= max_retries:
            logger.critical('System failed after max retries')
    
    def _run(self):
        # Main processing loop
        pass

Practice Exercise

Build Complete Vision System

Task: Create integrated vision-robot controller Requirements:

Load MNN model
Capture video from camera
Detect apples, oranges, bottles
Send detections to VEX Brain via serial
Display FPS and detection info
Log performance metrics

Success Criteria:

System runs at greater than 5 FPS on Raspberry Pi
Detections sent with less than 200ms latency
Confidence threshold configurable
Graceful error handling
Can run for 30+ minutes without crashes

Extension: Multi-Object Tracking

Instead of “best” detection, track multiple objects:

Assign IDs to detected objects
Track positions across frames
Send updates only when objects move significantly
Prioritize closest or newest objects

Summary

You’ve learned:

✓ Loading and running optimized YOLO models
✓ Building real-time inference pipelines
✓ Processing and filtering detection results
✓ Integrating vision with serial communication
✓ Performance optimization techniques
✓ Monitoring and debugging methods
✓ Production deployment strategies

Course Completion

Congratulations! You’ve completed the Computer Vision module and the entire Robotic Arm with Computer Vision course. You can now:

Design serial communication protocols
Train custom YOLO models
Optimize models for edge deployment
Build real-time vision systems
Integrate vision with robot control
Deploy production-ready systems

Course Overview

Review course structure and learning outcomes

Theoretical Concepts

Revisit foundational concepts

Next Steps

Capstone Project

Build the complete integrated system:

Hardware Setup:
- Raspberry Pi with camera
- VEX IQ2 Brain with robotic arm
- Serial connection
Software Integration:
- Train model on your specific objects
- Export to MNN format
- Implement vision controller
- Implement robot control logic on VEX
- Test end-to-end: detection → serial → movement
Testing:
- Place object in workspace
- Verify detection and classification
- Confirm serial message transmission
- Validate robot movement to target
- Test error cases (no object, multiple objects)
Optimization:
- Tune confidence thresholds
- Optimize for your lighting conditions
- Calibrate camera-to-robot coordinates
- Add safety checks (collision avoidance)

Further Learning

Advanced Vision: Depth estimation, 3D pose estimation
Robot Planning: Path planning, inverse kinematics
Control Theory: PID tuning, trajectory optimization
Multi-Robot: Coordinate multiple robots
ROS Integration: Use Robot Operating System

Reference Code: course/vision_class/

inference/detector.py:15-22: Inference parameters and setup
inference/model_loader.py:1-15: MNN model loading
process/image_processing.py:24-72: Complete detection pipeline
video_stream.py:9-55: Real-time video processing

Complete, production-ready implementations!

Learning Path

Communication Class

Vision Class

​Inference Optimization

​Learning Objectives

​Inference Architecture

​System Components

​Model Loading

​Efficient Model Loader

​Detection Model Wrapper

​Image Processing Pipeline

​Complete Processing Class

​Core Processing Logic

​Visualization

​Real-Time Video Stream

​Video Processing System

​Performance Optimization Techniques

​1. Reduce Input Resolution

​2. Skip Frames

​3. Disable Visualization During Operation

​4. Use Threading for I/O

​5. Adjust Confidence Threshold

​Integration with Robot Control

​Complete Vision-Control System

​Monitoring and Debugging

​Performance Metrics

​Debug Logging

​Common Issues

​Production Deployment

​Systemd Service

​Error Recovery

​Practice Exercise

​Build Complete Vision System

​Extension: Multi-Object Tracking

​Summary

​Course Completion

Course Overview

Theoretical Concepts

​Next Steps

​Capstone Project

​Further Learning

Build docs developers (and LLMs) love

Inference Optimization

Learning Objectives

Inference Architecture

System Components

Model Loading

Efficient Model Loader

Detection Model Wrapper

Image Processing Pipeline

Complete Processing Class

Core Processing Logic

Visualization

Real-Time Video Stream

Video Processing System

Performance Optimization Techniques

1. Reduce Input Resolution

2. Skip Frames

3. Disable Visualization During Operation

4. Use Threading for I/O

5. Adjust Confidence Threshold

Integration with Robot Control

Complete Vision-Control System

Monitoring and Debugging

Performance Metrics

Debug Logging

Common Issues

Production Deployment

Systemd Service

Error Recovery

Practice Exercise

Build Complete Vision System

Extension: Multi-Object Tracking

Summary

Course Completion

Next Steps

Capstone Project

Further Learning