Skip to main content

Inference Optimization

The final piece of the puzzle: running real-time object detection on Raspberry Pi and integrating it with your robotic arm. This lesson covers inference pipelines, performance optimization, and complete system integration.

Learning Objectives

By the end of this lesson, you will be able to:
  • Load and run optimized YOLO models efficiently
  • Build inference pipelines for real-time video processing
  • Optimize for maximum FPS on Raspberry Pi
  • Process detection results for robot control
  • Integrate vision with serial communication
  • Debug and monitor system performance
This lesson uses the complete implementation from course/vision_class/, including inference, processing, and video streaming.

Inference Architecture

System Components

Pipeline Stages:
  1. Capture: Get frame from camera
  2. Preprocess: Resize, normalize for model input
  3. Inference: Run YOLO detection
  4. Postprocess: Parse results, filter by confidence
  5. Selection: Choose best detection
  6. Communication: Send target to robot
  7. Visualization: Display results (optional)

Model Loading

Efficient Model Loader

From inference/model_loader.py:1-15:
import os
from typing import Dict
from ultralytics import YOLO

class ModelLoader:
    def __init__(self):
        current_path = os.path.dirname(os.path.abspath(__file__))
        object_model_path: str = current_path + '/models/mnn/yolo11s.mnn'
        self.model: YOLO = YOLO(object_model_path, task='detect')
        
    def get_model(self) -> YOLO:
        return self.model
Key Design Points:
  • Relative Path: Uses __file__ for portability
  • MNN Format: Optimized for Raspberry Pi ARM CPU
  • Single Load: Model loaded once at initialization
  • Task Specification: Explicit task='detect' for detection mode
Performance Tip: Load models during initialization, not per-frame. Model loading takes 1-2 seconds but only needs to happen once.

Detection Model Wrapper

From inference/detector.py:1-22:
import numpy as np
from typing import Tuple, Dict
from abc import ABC, abstractmethod

from ultralytics.engine.results import Results
from .model_loader import ModelLoader

class DetectionModelInterface(ABC):
    @abstractmethod
    def inference(self, image: np.ndarray) -> Tuple[Results, Dict[int, str]]:
        pass

class DetectionModel(DetectionModelInterface):
    def __init__(self):
        self.object_model = ModelLoader().get_model()

    def inference(self, image: np.ndarray) -> tuple[list[Results], Dict[int, str]]:
        results = self.object_model.predict(
            image, 
            conf=0.55,      # Confidence threshold
            verbose=False,  # Suppress output
            imgsz=640,      # Input size
            stream=True,    # Memory-efficient streaming
            task='detect',  # Detection mode
            half=True       # FP16 precision
        )
        return results, self.object_model.names
Parameters Explained:
  • conf=0.55: Minimum confidence (55%) to accept detection
    • Lower = more detections, more false positives
    • Higher = fewer detections, miss some objects
    • 0.5-0.6 is typical for robotics
  • verbose=False: Suppress logging (cleaner output)
  • imgsz=640: Model input size (must match training)
  • stream=True: Generator-based results (memory efficient)
    • Returns iterator instead of list
    • Process results one at a time
    • Reduces memory usage
  • half=True: Use FP16 precision if supported
    • ~1.5x faster on supported hardware
    • Minimal accuracy loss
    • May not work on all platforms (falls back to FP32)
Confidence Threshold Tuning:Too low (0.3): Many false detections, robot confused Just right (0.5-0.6): Good balance Too high (0.8): Miss valid objectsTest with your specific objects and lighting conditions!

Image Processing Pipeline

Complete Processing Class

From process/image_processing.py:8-23:
import cv2
import numpy as np
import logging as log
log.basicConfig(level=log.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

from .detection.main import (DetectionModelInterface, DetectionModel)

class ImageProcessor:
    def __init__(self, confidence_threshold: float = 0.45):
        self.detection: DetectionModelInterface = DetectionModel()
        self.conf_threshold = confidence_threshold
        
    def read_image_path(self, path: str, draw_results: bool = True, 
                       save_drawn_img: bool = True):
        object_image = cv2.imread(path)
        processed_img, best_detection = self.process_image(
            object_image, self.conf_threshold
        )
        
        if draw_results and best_detection is not None and \
           best_detection.get('confidence', 0) > 0:
            self._draw_detection(processed_img, best_detection)
            if save_drawn_img:
                self._save_drawn_image(processed_img, path)

        return processed_img, best_detection

Core Processing Logic

From process/image_processing.py:24-72:
def process_image(self, image: np.ndarray, confidence_threshold: float = 0.45):
    try:
        # 1. Run inference
        copy_image = image.copy()
        object_results, object_classes = self.detection.inference(copy_image)
        
        # 2. Initialize best detection
        best_detection = {
            'class': '', 
            'confidence': 0.0, 
            'box': [], 
            'class_id': -1
        }
        
        # 3. Process all results
        for res in object_results:
            boxes = res.boxes
            
            # No detections in this result
            if boxes.shape[0] == 0:
                continue
                
            # Extract detection data
            confidence = boxes.conf.cpu().numpy()[0]
            class_id = int(boxes.cls[0])
            box_data = boxes.xyxy.cpu().numpy()[0]
                
            # Filter by confidence
            if confidence < confidence_threshold:
                continue
                
            # Get class name
            detected_class = object_classes[class_id]
            clss_object = 'default'
                
            # Filter by target classes
            if detected_class in ['apple', 'orange', 'bottle']:
                clss_object = detected_class
                
            log.info(f'class: {clss_object}')
                    
            # Keep highest confidence detection
            if confidence > best_detection['confidence']:
                best_detection.update({
                    'class': str(clss_object),
                    'confidence': float(confidence),
                    'box': box_data,
                    'class_id': class_id
                })
                    
        # 4. Return results
        if best_detection['confidence'] >= confidence_threshold:
            log.info(f"best detection: {best_detection}")
            return image, best_detection
        else:
            log.info("not found detections")
            return image, best_detection
            
    except Exception as e:
        log.info(f'error in image processing: {e}')
        return image, None
Algorithm Steps:
  1. Inference: Run YOLO model on image
  2. Iterate Results: Stream-based processing
  3. Extract Data: Get confidence, class, bounding box
  4. Filter Confidence: Skip low-confidence detections
  5. Filter Classes: Only keep apple, orange, bottle
  6. Select Best: Highest confidence wins
  7. Return: Best detection or None
Why “best” detection?For pick-and-place, the robot can only grab one object at a time. We select the highest-confidence detection as the target. In production, you might:
  • Choose closest object (depth sensing)
  • Pick specific class (“get the apple”)
  • Use task priority (bottles before apples)

Visualization

From process/image_processing.py:74-95:
def _draw_detection(self, image: np.ndarray, detection: dict):
    """
    Draw bounding box and label on image
    """
    box = detection['box']
    class_name = detection['class']
    confidence = detection['confidence']

    x1, y1, x2, y2 = map(int, box)
    color = (0, 255, 0)  # BGR - green
    
    # Draw rectangle
    cv2.rectangle(image, (x1, y1), (x2, y2), color, 2)

    # Draw label
    label = f"{class_name} {confidence:.2f}"
    cv2.putText(
        image, label, (x1, y1 - 10),
        cv2.FONT_HERSHEY_SIMPLEX, 0.7, color, 2
    )

def _save_drawn_image(self, image: np.ndarray, original_path: str):
    """
    Save annotated image
    """
    out_path = original_path.replace('.jpg', '_detected.jpg')
    cv2.imwrite(out_path, image)
    log.info(f"save image with draw detections: {out_path}")
Visualization Features:
  • Green bounding box around object
  • Label with class name and confidence
  • Optional: Save annotated image

Real-Time Video Stream

Video Processing System

From video_stream.py:9-51:
import os
import sys
import cv2
import time
import logging as log

from process.image_processing import ImageProcessor

class VideoStream:
    def __init__(self, source: str = 0, confidence_threshold: float = 0.45):
        self.source = source
        self.confidence_threshold = confidence_threshold
        self.image_processor = ImageProcessor(
            confidence_threshold=confidence_threshold
        )
        self.video_capture = None
        self.prev_time = 0

    def start_stream(self):
        self.video_capture = cv2.VideoCapture(self.source)
        if not self.video_capture.isOpened():
            log.error(f"Error opening video stream or file: {self.source}")
            return False

        while True:
            ret, frame = self.video_capture.read()
            if not ret:
                log.error("Failed to capture frame from video stream.")
                break
            
            # Calculate FPS
            current_time = time.time()
            fps = 1 / (current_time - self.prev_time) if self.prev_time > 0 else 0
            self.prev_time = current_time
            
            # Run inference
            processed_frame, best_detection = self.image_processor.process_image(
                frame, self.confidence_threshold
            )
            
            # Draw best detection
            if best_detection is not None and best_detection.get('confidence', 0) > 0:
                self.image_processor._draw_detection(processed_frame, best_detection)
                
            # Display FPS
            cv2.putText(
                processed_frame, f"FPS: {fps:.2f}", (10, 30), 
                cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2
            )
                
            cv2.imshow('Video Stream', processed_frame)
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break

        self.video_capture.release()
        cv2.destroyAllWindows()
        return True
Key Features:
  1. Camera Source: 0 for default webcam, or video file path
  2. FPS Calculation: Real-time performance monitoring
  3. Frame-by-Frame: Process each frame independently
  4. Best Detection: Highlight highest-confidence object
  5. Display: Show annotated video with FPS
  6. Exit: Press ‘q’ to quit
Usage:
if __name__ == "__main__":
    real_time_inference = VideoStream(0, 0.45)
    real_time_inference.start_stream()
Frame Rate Factors:
  • Model size (yolo11n > yolo11s > yolo11m)
  • Input resolution (640x640 vs 1280x1280)
  • Confidence threshold (affects postprocessing)
  • Camera resolution (lower = faster capture)
  • Display rendering (imshow takes time)
On Raspberry Pi 4 with yolo11s.mnn: Expect 5-10 FPS

Performance Optimization Techniques

1. Reduce Input Resolution

# Before inference, resize frame
height, width = frame.shape[:2]
if width > 640:
    scale = 640 / width
    new_width = 640
    new_height = int(height * scale)
    frame = cv2.resize(frame, (new_width, new_height))
Impact: 2x speedup (1920x1080 → 640x480)

2. Skip Frames

frame_count = 0
process_every_n = 2  # Process every 2nd frame

while True:
    ret, frame = camera.read()
    frame_count += 1
    
    if frame_count % process_every_n == 0:
        results = model.predict(frame)
    
    # Use previous results for skipped frames
Impact: 2x speedup (process_every_n=2)

3. Disable Visualization During Operation

# Development: show video
cv2.imshow('Stream', frame)  # SLOW!

# Production: no display
# (just process and send to robot)
Impact: 20-30% speedup

4. Use Threading for I/O

import threading
import queue

class ThreadedCamera:
    def __init__(self, source=0):
        self.camera = cv2.VideoCapture(source)
        self.queue = queue.Queue(maxsize=2)
        self.stopped = False
        
    def start(self):
        threading.Thread(target=self._read_frames, daemon=True).start()
        return self
    
    def _read_frames(self):
        while not self.stopped:
            ret, frame = self.camera.read()
            if not ret:
                break
            
            if not self.queue.full():
                self.queue.put(frame)
    
    def read(self):
        return self.queue.get()
    
    def stop(self):
        self.stopped = True

# Usage
camera = ThreadedCamera(0).start()
while True:
    frame = camera.read()
    results = model.predict(frame)
Impact: 10-15% speedup (camera I/O overlaps with inference)

5. Adjust Confidence Threshold

# Lower threshold = more detections to process
results = model.predict(frame, conf=0.3)  # SLOW

# Higher threshold = fewer detections
results = model.predict(frame, conf=0.6)  # FASTER
Impact: 5-10% speedup (fewer postprocessing operations)
Optimization Priority:
  1. Use optimized model format (MNN/NCNN)
  2. Reduce input resolution
  3. Disable visualization in production
  4. Skip frames if tolerable
  5. Use threading for I/O

Integration with Robot Control

Complete Vision-Control System

import cv2
import time
from vision_class.process.image_processing import ImageProcessor
from comm_class.raspberry_comm.json_data import SerialCommunication

class RobotVisionController:
    def __init__(self):
        # Initialize components
        self.processor = ImageProcessor(confidence_threshold=0.5)
        self.serial = SerialCommunication()
        self.camera = cv2.VideoCapture(0)
        
        # State
        self.last_detection = None
        self.last_send_time = 0
        self.send_interval = 0.5  # Send updates every 500ms
        
    def start(self):
        # Connect to robot
        if not self.serial.connect():
            print("Failed to connect to robot")
            return
        
        print("Vision-control system started")
        print("Press 'q' to quit")
        
        try:
            while True:
                # Capture frame
                ret, frame = self.camera.read()
                if not ret:
                    break
                
                # Process frame
                processed, detection = self.processor.process_image(frame, 0.5)
                
                # Send detection to robot (rate-limited)
                if detection and detection['confidence'] > 0:
                    current_time = time.time()
                    if current_time - self.last_send_time > self.send_interval:
                        self._send_detection(detection)
                        self.last_send_time = current_time
                
                # Display (optional - remove for production)
                if detection and detection['confidence'] > 0:
                    self.processor._draw_detection(processed, detection)
                cv2.imshow('Robot Vision', processed)
                
                if cv2.waitKey(1) & 0xFF == ord('q'):
                    break
        
        finally:
            self.cleanup()
    
    def _send_detection(self, detection: dict):
        """
        Send detection to robot via serial
        """
        box = detection['box']
        center_x = int((box[0] + box[2]) / 2)
        center_y = int((box[1] + box[3]) / 2)
        
        message_data = {
            'class': detection['class'],
            'confidence': float(detection['confidence']),
            'center_x': center_x,
            'center_y': center_y,
            'bbox': box.tolist() if hasattr(box, 'tolist') else list(box)
        }
        
        self.serial.writing_data('detection', message_data)
        print(f"Sent: {detection['class']} at ({center_x}, {center_y})")
    
    def cleanup(self):
        print("Shutting down...")
        self.camera.release()
        cv2.destroyAllWindows()
        self.serial.close()

if __name__ == "__main__":
    controller = RobotVisionController()
    controller.start()
System Flow:
  1. Initialize: Load model, connect to robot
  2. Capture: Get frame from camera
  3. Detect: Run YOLO inference
  4. Filter: Select best detection
  5. Rate Limit: Don’t spam robot with messages
  6. Send: Transmit detection via serial JSON
  7. Visualize: Display results (optional)
  8. Repeat: Process next frame
Rate LimitingSending detections every frame (10 FPS = 10 messages/sec) can overwhelm the robot. Rate limiting (1-2 messages/sec) provides stable control while the robot executes movements.

Monitoring and Debugging

Performance Metrics

import time
from collections import deque

class PerformanceMonitor:
    def __init__(self, window_size=30):
        self.inference_times = deque(maxlen=window_size)
        self.frame_times = deque(maxlen=window_size)
        self.detection_counts = deque(maxlen=window_size)
        
    def record_inference(self, duration: float):
        self.inference_times.append(duration)
    
    def record_frame(self, duration: float):
        self.frame_times.append(duration)
    
    def record_detections(self, count: int):
        self.detection_counts.append(count)
    
    def get_stats(self) -> dict:
        return {
            'avg_inference_ms': sum(self.inference_times) / len(self.inference_times) * 1000,
            'avg_fps': 1.0 / (sum(self.frame_times) / len(self.frame_times)),
            'avg_detections': sum(self.detection_counts) / len(self.detection_counts)
        }

# Usage
monitor = PerformanceMonitor()

while True:
    frame_start = time.time()
    
    # Inference
    inf_start = time.time()
    results = model.predict(frame)
    inf_time = time.time() - inf_start
    monitor.record_inference(inf_time)
    
    # Count detections
    num_detections = len(results[0].boxes)
    monitor.record_detections(num_detections)
    
    # Frame time
    frame_time = time.time() - frame_start
    monitor.record_frame(frame_time)
    
    # Print stats every 30 frames
    if len(monitor.frame_times) == 30:
        stats = monitor.get_stats()
        print(f"Inference: {stats['avg_inference_ms']:.1f}ms, "
              f"FPS: {stats['avg_fps']:.1f}, "
              f"Detections: {stats['avg_detections']:.1f}")

Debug Logging

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('robot_vision.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger('RobotVision')

# In processing code
logger.info(f'Detected {class_name} with confidence {conf:.2f}')
logger.warning(f'Low FPS: {fps:.1f}')
logger.error(f'Serial communication failed: {e}')

Common Issues

Problem: Low FPS (< 5) Checks:
# Check model format
print(f'Model path: {model.ckpt_path}')  # Should be .mnn or .ncnn

# Check input size
print(f'Input size: {imgsz}')  # 640 is standard, 1280 is slow

# Check camera resolution
print(f'Frame size: {frame.shape}')  # Lower is faster
Problem: No detections Checks:
# Check confidence threshold
results = model.predict(frame, conf=0.3)  # Try lower

# Check classes
print(f'Model classes: {model.names}')  # Verify apple, orange, bottle present

# Check lighting
cv2.imwrite('debug_frame.jpg', frame)  # Inspect lighting quality
Problem: High latency Measure stages:
t0 = time.time()
ret, frame = camera.read()
t1 = time.time()
results = model.predict(frame)
t2 = time.time()
processed = parse_results(results)
t3 = time.time()
serial.send(data)
t4 = time.time()

print(f'Capture: {(t1-t0)*1000:.1f}ms')
print(f'Inference: {(t2-t1)*1000:.1f}ms')
print(f'Processing: {(t3-t2)*1000:.1f}ms')
print(f'Serial: {(t4-t3)*1000:.1f}ms')
Typical Raspberry Pi 4 Timings (yolo11s.mnn):
  • Capture: 10-20ms
  • Inference: 100-150ms (dominant)
  • Processing: 1-5ms
  • Serial: 1-2ms
  • Total: ~120-180ms (5-8 FPS)
If inference > 200ms, check model format and size.

Production Deployment

Systemd Service

Run vision system as a service on Raspberry Pi: robot_vision.service:
[Unit]
Description=Robot Vision System
After=network.target

[Service]
Type=simple
User=pi
WorkingDirectory=/home/pi/robot_arm
ExecStart=/usr/bin/python3 /home/pi/robot_arm/main.py
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
Install:
sudo cp robot_vision.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable robot_vision
sudo systemctl start robot_vision

# Check status
sudo systemctl status robot_vision

# View logs
journalctl -u robot_vision -f

Error Recovery

class RobotVisionController:
    def start(self):
        retry_count = 0
        max_retries = 5
        
        while retry_count < max_retries:
            try:
                self._run()
                break  # Successful
            except Exception as e:
                retry_count += 1
                logger.error(f'System error: {e}, retry {retry_count}/{max_retries}')
                time.sleep(5)
                
                # Reinitialize
                self.cleanup()
                self._initialize()
        
        if retry_count >= max_retries:
            logger.critical('System failed after max retries')
    
    def _run(self):
        # Main processing loop
        pass

Practice Exercise

Build Complete Vision System

Task: Create integrated vision-robot controller Requirements:
  1. Load MNN model
  2. Capture video from camera
  3. Detect apples, oranges, bottles
  4. Send detections to VEX Brain via serial
  5. Display FPS and detection info
  6. Log performance metrics
Success Criteria:
  • System runs at greater than 5 FPS on Raspberry Pi
  • Detections sent with less than 200ms latency
  • Confidence threshold configurable
  • Graceful error handling
  • Can run for 30+ minutes without crashes

Extension: Multi-Object Tracking

Instead of “best” detection, track multiple objects:
  • Assign IDs to detected objects
  • Track positions across frames
  • Send updates only when objects move significantly
  • Prioritize closest or newest objects

Summary

You’ve learned:
  • ✓ Loading and running optimized YOLO models
  • ✓ Building real-time inference pipelines
  • ✓ Processing and filtering detection results
  • ✓ Integrating vision with serial communication
  • ✓ Performance optimization techniques
  • ✓ Monitoring and debugging methods
  • ✓ Production deployment strategies

Course Completion

Congratulations! You’ve completed the Computer Vision module and the entire Robotic Arm with Computer Vision course. You can now:
  • Design serial communication protocols
  • Train custom YOLO models
  • Optimize models for edge deployment
  • Build real-time vision systems
  • Integrate vision with robot control
  • Deploy production-ready systems

Course Overview

Review course structure and learning outcomes

Theoretical Concepts

Revisit foundational concepts

Next Steps

Capstone Project

Build the complete integrated system:
  1. Hardware Setup:
    • Raspberry Pi with camera
    • VEX IQ2 Brain with robotic arm
    • Serial connection
  2. Software Integration:
    • Train model on your specific objects
    • Export to MNN format
    • Implement vision controller
    • Implement robot control logic on VEX
    • Test end-to-end: detection → serial → movement
  3. Testing:
    • Place object in workspace
    • Verify detection and classification
    • Confirm serial message transmission
    • Validate robot movement to target
    • Test error cases (no object, multiple objects)
  4. Optimization:
    • Tune confidence thresholds
    • Optimize for your lighting conditions
    • Calibrate camera-to-robot coordinates
    • Add safety checks (collision avoidance)

Further Learning

  • Advanced Vision: Depth estimation, 3D pose estimation
  • Robot Planning: Path planning, inverse kinematics
  • Control Theory: PID tuning, trajectory optimization
  • Multi-Robot: Coordinate multiple robots
  • ROS Integration: Use Robot Operating System
Reference Code: course/vision_class/
  • inference/detector.py:15-22: Inference parameters and setup
  • inference/model_loader.py:1-15: MNN model loading
  • process/image_processing.py:24-72: Complete detection pipeline
  • video_stream.py:9-55: Real-time video processing
Complete, production-ready implementations!

Build docs developers (and LLMs) love