Inference Optimization
The final piece of the puzzle: running real-time object detection on Raspberry Pi and integrating it with your robotic arm. This lesson covers inference pipelines, performance optimization, and complete system integration.
Learning Objectives
By the end of this lesson, you will be able to:
Load and run optimized YOLO models efficiently
Build inference pipelines for real-time video processing
Optimize for maximum FPS on Raspberry Pi
Process detection results for robot control
Integrate vision with serial communication
Debug and monitor system performance
This lesson uses the complete implementation from course/vision_class/, including inference, processing, and video streaming.
Inference Architecture
System Components
Pipeline Stages :
Capture : Get frame from camera
Preprocess : Resize, normalize for model input
Inference : Run YOLO detection
Postprocess : Parse results, filter by confidence
Selection : Choose best detection
Communication : Send target to robot
Visualization : Display results (optional)
Model Loading
Efficient Model Loader
From inference/model_loader.py:1-15:
import os
from typing import Dict
from ultralytics import YOLO
class ModelLoader :
def __init__ ( self ):
current_path = os.path.dirname(os.path.abspath( __file__ ))
object_model_path: str = current_path + '/models/mnn/yolo11s.mnn'
self .model: YOLO = YOLO(object_model_path, task = 'detect' )
def get_model ( self ) -> YOLO :
return self .model
Key Design Points :
Relative Path : Uses __file__ for portability
MNN Format : Optimized for Raspberry Pi ARM CPU
Single Load : Model loaded once at initialization
Task Specification : Explicit task='detect' for detection mode
Performance Tip : Load models during initialization, not per-frame. Model loading takes 1-2 seconds but only needs to happen once.
Detection Model Wrapper
From inference/detector.py:1-22:
import numpy as np
from typing import Tuple, Dict
from abc import ABC , abstractmethod
from ultralytics.engine.results import Results
from .model_loader import ModelLoader
class DetectionModelInterface ( ABC ):
@abstractmethod
def inference ( self , image : np.ndarray) -> Tuple[Results, Dict[ int , str ]]:
pass
class DetectionModel ( DetectionModelInterface ):
def __init__ ( self ):
self .object_model = ModelLoader().get_model()
def inference ( self , image : np.ndarray) -> tuple[list[Results], Dict[ int , str ]]:
results = self .object_model.predict(
image,
conf = 0.55 , # Confidence threshold
verbose = False , # Suppress output
imgsz = 640 , # Input size
stream = True , # Memory-efficient streaming
task = 'detect' , # Detection mode
half = True # FP16 precision
)
return results, self .object_model.names
Parameters Explained :
conf=0.55 : Minimum confidence (55%) to accept detection
Lower = more detections, more false positives
Higher = fewer detections, miss some objects
0.5-0.6 is typical for robotics
verbose=False : Suppress logging (cleaner output)
imgsz=640 : Model input size (must match training)
stream=True : Generator-based results (memory efficient)
Returns iterator instead of list
Process results one at a time
Reduces memory usage
half=True : Use FP16 precision if supported
~1.5x faster on supported hardware
Minimal accuracy loss
May not work on all platforms (falls back to FP32)
Confidence Threshold Tuning :Too low (0.3): Many false detections, robot confused
Just right (0.5-0.6): Good balance
Too high (0.8): Miss valid objects Test with your specific objects and lighting conditions!
Image Processing Pipeline
Complete Processing Class
From process/image_processing.py:8-23:
import cv2
import numpy as np
import logging as log
log.basicConfig( level = log. INFO , format = " %(asctime)s - %(levelname)s - %(message)s " )
from .detection.main import (DetectionModelInterface, DetectionModel)
class ImageProcessor :
def __init__ ( self , confidence_threshold : float = 0.45 ):
self .detection: DetectionModelInterface = DetectionModel()
self .conf_threshold = confidence_threshold
def read_image_path ( self , path : str , draw_results : bool = True ,
save_drawn_img : bool = True ):
object_image = cv2.imread(path)
processed_img, best_detection = self .process_image(
object_image, self .conf_threshold
)
if draw_results and best_detection is not None and \
best_detection.get( 'confidence' , 0 ) > 0 :
self ._draw_detection(processed_img, best_detection)
if save_drawn_img:
self ._save_drawn_image(processed_img, path)
return processed_img, best_detection
Core Processing Logic
From process/image_processing.py:24-72:
def process_image ( self , image : np.ndarray, confidence_threshold : float = 0.45 ):
try :
# 1. Run inference
copy_image = image.copy()
object_results, object_classes = self .detection.inference(copy_image)
# 2. Initialize best detection
best_detection = {
'class' : '' ,
'confidence' : 0.0 ,
'box' : [],
'class_id' : - 1
}
# 3. Process all results
for res in object_results:
boxes = res.boxes
# No detections in this result
if boxes.shape[ 0 ] == 0 :
continue
# Extract detection data
confidence = boxes.conf.cpu().numpy()[ 0 ]
class_id = int (boxes.cls[ 0 ])
box_data = boxes.xyxy.cpu().numpy()[ 0 ]
# Filter by confidence
if confidence < confidence_threshold:
continue
# Get class name
detected_class = object_classes[class_id]
clss_object = 'default'
# Filter by target classes
if detected_class in [ 'apple' , 'orange' , 'bottle' ]:
clss_object = detected_class
log.info( f 'class: { clss_object } ' )
# Keep highest confidence detection
if confidence > best_detection[ 'confidence' ]:
best_detection.update({
'class' : str (clss_object),
'confidence' : float (confidence),
'box' : box_data,
'class_id' : class_id
})
# 4. Return results
if best_detection[ 'confidence' ] >= confidence_threshold:
log.info( f "best detection: { best_detection } " )
return image, best_detection
else :
log.info( "not found detections" )
return image, best_detection
except Exception as e:
log.info( f 'error in image processing: { e } ' )
return image, None
Algorithm Steps :
Inference : Run YOLO model on image
Iterate Results : Stream-based processing
Extract Data : Get confidence, class, bounding box
Filter Confidence : Skip low-confidence detections
Filter Classes : Only keep apple, orange, bottle
Select Best : Highest confidence wins
Return : Best detection or None
Why “best” detection? For pick-and-place, the robot can only grab one object at a time. We select the highest-confidence detection as the target. In production, you might:
Choose closest object (depth sensing)
Pick specific class (“get the apple”)
Use task priority (bottles before apples)
Visualization
From process/image_processing.py:74-95:
def _draw_detection ( self , image : np.ndarray, detection : dict ):
"""
Draw bounding box and label on image
"""
box = detection[ 'box' ]
class_name = detection[ 'class' ]
confidence = detection[ 'confidence' ]
x1, y1, x2, y2 = map ( int , box)
color = ( 0 , 255 , 0 ) # BGR - green
# Draw rectangle
cv2.rectangle(image, (x1, y1), (x2, y2), color, 2 )
# Draw label
label = f " { class_name } { confidence :.2f} "
cv2.putText(
image, label, (x1, y1 - 10 ),
cv2. FONT_HERSHEY_SIMPLEX , 0.7 , color, 2
)
def _save_drawn_image ( self , image : np.ndarray, original_path : str ):
"""
Save annotated image
"""
out_path = original_path.replace( '.jpg' , '_detected.jpg' )
cv2.imwrite(out_path, image)
log.info( f "save image with draw detections: { out_path } " )
Visualization Features :
Green bounding box around object
Label with class name and confidence
Optional: Save annotated image
Real-Time Video Stream
Video Processing System
From video_stream.py:9-51:
import os
import sys
import cv2
import time
import logging as log
from process.image_processing import ImageProcessor
class VideoStream :
def __init__ ( self , source : str = 0 , confidence_threshold : float = 0.45 ):
self .source = source
self .confidence_threshold = confidence_threshold
self .image_processor = ImageProcessor(
confidence_threshold = confidence_threshold
)
self .video_capture = None
self .prev_time = 0
def start_stream ( self ):
self .video_capture = cv2.VideoCapture( self .source)
if not self .video_capture.isOpened():
log.error( f "Error opening video stream or file: { self .source } " )
return False
while True :
ret, frame = self .video_capture.read()
if not ret:
log.error( "Failed to capture frame from video stream." )
break
# Calculate FPS
current_time = time.time()
fps = 1 / (current_time - self .prev_time) if self .prev_time > 0 else 0
self .prev_time = current_time
# Run inference
processed_frame, best_detection = self .image_processor.process_image(
frame, self .confidence_threshold
)
# Draw best detection
if best_detection is not None and best_detection.get( 'confidence' , 0 ) > 0 :
self .image_processor._draw_detection(processed_frame, best_detection)
# Display FPS
cv2.putText(
processed_frame, f "FPS: { fps :.2f} " , ( 10 , 30 ),
cv2. FONT_HERSHEY_SIMPLEX , 1 , ( 0 , 255 , 0 ), 2
)
cv2.imshow( 'Video Stream' , processed_frame)
if cv2.waitKey( 1 ) & 0x FF == ord ( 'q' ):
break
self .video_capture.release()
cv2.destroyAllWindows()
return True
Key Features :
Camera Source : 0 for default webcam, or video file path
FPS Calculation : Real-time performance monitoring
Frame-by-Frame : Process each frame independently
Best Detection : Highlight highest-confidence object
Display : Show annotated video with FPS
Exit : Press ‘q’ to quit
Usage :
if __name__ == "__main__" :
real_time_inference = VideoStream( 0 , 0.45 )
real_time_inference.start_stream()
Frame Rate Factors :
Model size (yolo11n > yolo11s > yolo11m)
Input resolution (640x640 vs 1280x1280)
Confidence threshold (affects postprocessing)
Camera resolution (lower = faster capture)
Display rendering (imshow takes time)
On Raspberry Pi 4 with yolo11s.mnn: Expect 5-10 FPS
# Before inference, resize frame
height, width = frame.shape[: 2 ]
if width > 640 :
scale = 640 / width
new_width = 640
new_height = int (height * scale)
frame = cv2.resize(frame, (new_width, new_height))
Impact : 2x speedup (1920x1080 → 640x480)
2. Skip Frames
frame_count = 0
process_every_n = 2 # Process every 2nd frame
while True :
ret, frame = camera.read()
frame_count += 1
if frame_count % process_every_n == 0 :
results = model.predict(frame)
# Use previous results for skipped frames
Impact : 2x speedup (process_every_n=2)
3. Disable Visualization During Operation
# Development: show video
cv2.imshow( 'Stream' , frame) # SLOW!
# Production: no display
# (just process and send to robot)
Impact : 20-30% speedup
4. Use Threading for I/O
import threading
import queue
class ThreadedCamera :
def __init__ ( self , source = 0 ):
self .camera = cv2.VideoCapture(source)
self .queue = queue.Queue( maxsize = 2 )
self .stopped = False
def start ( self ):
threading.Thread( target = self ._read_frames, daemon = True ).start()
return self
def _read_frames ( self ):
while not self .stopped:
ret, frame = self .camera.read()
if not ret:
break
if not self .queue.full():
self .queue.put(frame)
def read ( self ):
return self .queue.get()
def stop ( self ):
self .stopped = True
# Usage
camera = ThreadedCamera( 0 ).start()
while True :
frame = camera.read()
results = model.predict(frame)
Impact : 10-15% speedup (camera I/O overlaps with inference)
5. Adjust Confidence Threshold
# Lower threshold = more detections to process
results = model.predict(frame, conf = 0.3 ) # SLOW
# Higher threshold = fewer detections
results = model.predict(frame, conf = 0.6 ) # FASTER
Impact : 5-10% speedup (fewer postprocessing operations)
Optimization Priority :
Use optimized model format (MNN/NCNN)
Reduce input resolution
Disable visualization in production
Skip frames if tolerable
Use threading for I/O
Integration with Robot Control
Complete Vision-Control System
import cv2
import time
from vision_class.process.image_processing import ImageProcessor
from comm_class.raspberry_comm.json_data import SerialCommunication
class RobotVisionController :
def __init__ ( self ):
# Initialize components
self .processor = ImageProcessor( confidence_threshold = 0.5 )
self .serial = SerialCommunication()
self .camera = cv2.VideoCapture( 0 )
# State
self .last_detection = None
self .last_send_time = 0
self .send_interval = 0.5 # Send updates every 500ms
def start ( self ):
# Connect to robot
if not self .serial.connect():
print ( "Failed to connect to robot" )
return
print ( "Vision-control system started" )
print ( "Press 'q' to quit" )
try :
while True :
# Capture frame
ret, frame = self .camera.read()
if not ret:
break
# Process frame
processed, detection = self .processor.process_image(frame, 0.5 )
# Send detection to robot (rate-limited)
if detection and detection[ 'confidence' ] > 0 :
current_time = time.time()
if current_time - self .last_send_time > self .send_interval:
self ._send_detection(detection)
self .last_send_time = current_time
# Display (optional - remove for production)
if detection and detection[ 'confidence' ] > 0 :
self .processor._draw_detection(processed, detection)
cv2.imshow( 'Robot Vision' , processed)
if cv2.waitKey( 1 ) & 0x FF == ord ( 'q' ):
break
finally :
self .cleanup()
def _send_detection ( self , detection : dict ):
"""
Send detection to robot via serial
"""
box = detection[ 'box' ]
center_x = int ((box[ 0 ] + box[ 2 ]) / 2 )
center_y = int ((box[ 1 ] + box[ 3 ]) / 2 )
message_data = {
'class' : detection[ 'class' ],
'confidence' : float (detection[ 'confidence' ]),
'center_x' : center_x,
'center_y' : center_y,
'bbox' : box.tolist() if hasattr (box, 'tolist' ) else list (box)
}
self .serial.writing_data( 'detection' , message_data)
print ( f "Sent: { detection[ 'class' ] } at ( { center_x } , { center_y } )" )
def cleanup ( self ):
print ( "Shutting down..." )
self .camera.release()
cv2.destroyAllWindows()
self .serial.close()
if __name__ == "__main__" :
controller = RobotVisionController()
controller.start()
System Flow :
Initialize : Load model, connect to robot
Capture : Get frame from camera
Detect : Run YOLO inference
Filter : Select best detection
Rate Limit : Don’t spam robot with messages
Send : Transmit detection via serial JSON
Visualize : Display results (optional)
Repeat : Process next frame
Rate Limiting Sending detections every frame (10 FPS = 10 messages/sec) can overwhelm the robot. Rate limiting (1-2 messages/sec) provides stable control while the robot executes movements.
Monitoring and Debugging
import time
from collections import deque
class PerformanceMonitor :
def __init__ ( self , window_size = 30 ):
self .inference_times = deque( maxlen = window_size)
self .frame_times = deque( maxlen = window_size)
self .detection_counts = deque( maxlen = window_size)
def record_inference ( self , duration : float ):
self .inference_times.append(duration)
def record_frame ( self , duration : float ):
self .frame_times.append(duration)
def record_detections ( self , count : int ):
self .detection_counts.append(count)
def get_stats ( self ) -> dict :
return {
'avg_inference_ms' : sum ( self .inference_times) / len ( self .inference_times) * 1000 ,
'avg_fps' : 1.0 / ( sum ( self .frame_times) / len ( self .frame_times)),
'avg_detections' : sum ( self .detection_counts) / len ( self .detection_counts)
}
# Usage
monitor = PerformanceMonitor()
while True :
frame_start = time.time()
# Inference
inf_start = time.time()
results = model.predict(frame)
inf_time = time.time() - inf_start
monitor.record_inference(inf_time)
# Count detections
num_detections = len (results[ 0 ].boxes)
monitor.record_detections(num_detections)
# Frame time
frame_time = time.time() - frame_start
monitor.record_frame(frame_time)
# Print stats every 30 frames
if len (monitor.frame_times) == 30 :
stats = monitor.get_stats()
print ( f "Inference: { stats[ 'avg_inference_ms' ] :.1f} ms, "
f "FPS: { stats[ 'avg_fps' ] :.1f} , "
f "Detections: { stats[ 'avg_detections' ] :.1f} " )
Debug Logging
import logging
logging.basicConfig(
level = logging. INFO ,
format = ' %(asctime)s - %(name)s - %(levelname)s - %(message)s ' ,
handlers = [
logging.FileHandler( 'robot_vision.log' ),
logging.StreamHandler()
]
)
logger = logging.getLogger( 'RobotVision' )
# In processing code
logger.info( f 'Detected { class_name } with confidence { conf :.2f} ' )
logger.warning( f 'Low FPS: { fps :.1f} ' )
logger.error( f 'Serial communication failed: { e } ' )
Common Issues
Problem: Low FPS (< 5)
Checks:
# Check model format
print ( f 'Model path: { model.ckpt_path } ' ) # Should be .mnn or .ncnn
# Check input size
print ( f 'Input size: { imgsz } ' ) # 640 is standard, 1280 is slow
# Check camera resolution
print ( f 'Frame size: { frame.shape } ' ) # Lower is faster
Problem: No detections
Checks:
# Check confidence threshold
results = model.predict(frame, conf = 0.3 ) # Try lower
# Check classes
print ( f 'Model classes: { model.names } ' ) # Verify apple, orange, bottle present
# Check lighting
cv2.imwrite( 'debug_frame.jpg' , frame) # Inspect lighting quality
Problem: High latency
Measure stages:
t0 = time.time()
ret, frame = camera.read()
t1 = time.time()
results = model.predict(frame)
t2 = time.time()
processed = parse_results(results)
t3 = time.time()
serial.send(data)
t4 = time.time()
print ( f 'Capture: { (t1 - t0) * 1000 :.1f} ms' )
print ( f 'Inference: { (t2 - t1) * 1000 :.1f} ms' )
print ( f 'Processing: { (t3 - t2) * 1000 :.1f} ms' )
print ( f 'Serial: { (t4 - t3) * 1000 :.1f} ms' )
Typical Raspberry Pi 4 Timings (yolo11s.mnn) :
Capture: 10-20ms
Inference: 100-150ms (dominant)
Processing: 1-5ms
Serial: 1-2ms
Total : ~120-180ms (5-8 FPS)
If inference > 200ms, check model format and size.
Production Deployment
Systemd Service
Run vision system as a service on Raspberry Pi:
robot_vision.service :
[Unit]
Description =Robot Vision System
After =network.target
[Service]
Type =simple
User =pi
WorkingDirectory =/home/pi/robot_arm
ExecStart =/usr/bin/python3 /home/pi/robot_arm/main.py
Restart =on-failure
RestartSec =5
[Install]
WantedBy =multi-user.target
Install:
sudo cp robot_vision.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable robot_vision
sudo systemctl start robot_vision
# Check status
sudo systemctl status robot_vision
# View logs
journalctl -u robot_vision -f
Error Recovery
class RobotVisionController :
def start ( self ):
retry_count = 0
max_retries = 5
while retry_count < max_retries:
try :
self ._run()
break # Successful
except Exception as e:
retry_count += 1
logger.error( f 'System error: { e } , retry { retry_count } / { max_retries } ' )
time.sleep( 5 )
# Reinitialize
self .cleanup()
self ._initialize()
if retry_count >= max_retries:
logger.critical( 'System failed after max retries' )
def _run ( self ):
# Main processing loop
pass
Practice Exercise
Build Complete Vision System
Task : Create integrated vision-robot controller
Requirements :
Load MNN model
Capture video from camera
Detect apples, oranges, bottles
Send detections to VEX Brain via serial
Display FPS and detection info
Log performance metrics
Success Criteria :
System runs at greater than 5 FPS on Raspberry Pi
Detections sent with less than 200ms latency
Confidence threshold configurable
Graceful error handling
Can run for 30+ minutes without crashes
Extension: Multi-Object Tracking
Instead of “best” detection, track multiple objects:
Assign IDs to detected objects
Track positions across frames
Send updates only when objects move significantly
Prioritize closest or newest objects
Summary
You’ve learned:
✓ Loading and running optimized YOLO models
✓ Building real-time inference pipelines
✓ Processing and filtering detection results
✓ Integrating vision with serial communication
✓ Performance optimization techniques
✓ Monitoring and debugging methods
✓ Production deployment strategies
Course Completion
Congratulations! You’ve completed the Computer Vision module and the entire Robotic Arm with Computer Vision course.
You can now :
Design serial communication protocols
Train custom YOLO models
Optimize models for edge deployment
Build real-time vision systems
Integrate vision with robot control
Deploy production-ready systems
Course Overview Review course structure and learning outcomes
Theoretical Concepts Revisit foundational concepts
Next Steps
Capstone Project
Build the complete integrated system:
Hardware Setup :
Raspberry Pi with camera
VEX IQ2 Brain with robotic arm
Serial connection
Software Integration :
Train model on your specific objects
Export to MNN format
Implement vision controller
Implement robot control logic on VEX
Test end-to-end: detection → serial → movement
Testing :
Place object in workspace
Verify detection and classification
Confirm serial message transmission
Validate robot movement to target
Test error cases (no object, multiple objects)
Optimization :
Tune confidence thresholds
Optimize for your lighting conditions
Calibrate camera-to-robot coordinates
Add safety checks (collision avoidance)
Further Learning
Advanced Vision : Depth estimation, 3D pose estimation
Robot Planning : Path planning, inverse kinematics
Control Theory : PID tuning, trajectory optimization
Multi-Robot : Coordinate multiple robots
ROS Integration : Use Robot Operating System
Reference Code : course/vision_class/
inference/detector.py:15-22: Inference parameters and setup
inference/model_loader.py:1-15: MNN model loading
process/image_processing.py:24-72: Complete detection pipeline
video_stream.py:9-55: Real-time video processing
Complete, production-ready implementations!