Overview
The vision system combines camera capture with YOLOv11-based object detection to identify objects in the robot’s workspace. This tutorial covers the complete pipeline from image capture to detection results.
The system uses YOLOv11s (small) model with NCNN backend for fast inference on embedded devices.
Architecture
The vision pipeline consists of three main components:
CameraManager : Captures images from camera
ImageProcessor : Preprocesses and coordinates detection
DetectionModel : Runs YOLOv11 inference
Camera Setup
Initialize Camera Manager
from arm_system.perception.vision.camera.main import CameraManager
# Initialize camera with custom settings
camera = CameraManager(
camera_index = 0 , # /dev/video0
width = 1280 , # Capture width
height = 720 # Capture height
)
The camera manager uses OpenCV’s VideoCapture with configured resolution.
Capture an Image
# Capture image
img_path = camera.capture_image()
if img_path:
print ( f "Image saved to: { img_path } " )
else :
print ( "Failed to capture image" )
What happens internally: # From camera/main.py:11-26
def capture_image ( self ):
# Flush camera buffer (grab 5 frames)
for _ in range ( 5 ):
self .cap.grab()
# Capture frame
ret, frame = self .cap.read()
if not ret:
return None
# Generate filename with timestamp
current_dir = os.path.dirname(os.path.abspath( __file__ ))
timestamp = time.strftime( "%Y%m %d -%H%M%S" )
filename = f " { current_dir } /objects_images/ { timestamp } .png"
# Save image
os.makedirs(os.path.dirname(filename), exist_ok = True )
cv2.imwrite(filename, frame)
return filename
The camera buffer is flushed with 5 grab() calls to ensure you get the most recent frame.
Object Detection
Initialize Detection Model
from arm_system.perception.vision.image_processing import ImageProcessor
# Initialize with confidence threshold
processor = ImageProcessor( confidence_threshold = 0.45 )
The processor loads the YOLOv11s NCNN model automatically: # From detection/model_loader.py:7-14
class ModelLoader :
def __init__ ( self ):
current_path = os.path.dirname(os.path.abspath( __file__ ))
object_model_path = current_path + '/models/yolo11s_ncnn_model'
self .model = YOLO(object_model_path, task = 'detect' )
def get_model ( self ):
return self .model
Run Inference on Image
# Process image from file path
processed_img, detection = processor.read_image_path(
path = img_path,
draw_results = True , # Draw bounding boxes
save_drawn_img = True # Save annotated image
)
if detection:
print ( f "Class: { detection[ 'class' ] } " )
print ( f "Confidence: { detection[ 'confidence' ] :.2f} " )
print ( f "Bounding Box: { detection[ 'box' ] } " )
else :
print ( "No objects detected" )
Process Detection Results
The detection result is a dictionary: detection = {
'class' : 'apple' , # Object class name
'confidence' : 0.89 , # Detection confidence (0-1)
'box' : [x1, y1, x2, y2], # Bounding box coordinates
'class_id' : 47 # COCO class ID
}
Supported object classes:
apple
orange
bottle
default (for any other detected object)
Complete Pipeline Example
Capture and Detect
import cv2
from arm_system.perception.vision.camera.main import CameraManager
from arm_system.perception.vision.image_processing import ImageProcessor
# Initialize components
camera = CameraManager( camera_index = 0 )
processor = ImageProcessor( confidence_threshold = 0.45 )
# Capture image
print ( "Capturing image..." )
img_path = camera.capture_image()
if not img_path:
print ( "Failed to capture image" )
exit ( 1 )
print ( f "Image saved: { img_path } " )
# Run detection
print ( "Running inference..." )
processed_img, detection = processor.read_image_path(
path = img_path,
draw_results = True ,
save_drawn_img = True
)
# Display results
if detection and detection[ 'confidence' ] > 0 :
print ( f " \n Detection Results:" )
print ( f " Class: { detection[ 'class' ] } " )
print ( f " Confidence: { detection[ 'confidence' ] :.2%} " )
print ( f " Bounding Box: { detection[ 'box' ] } " )
# Show image with detection
cv2.imshow( 'Detection' , processed_img)
cv2.waitKey( 0 )
cv2.destroyAllWindows()
else :
print ( "No objects detected above threshold" )
Process Raw Image Array
import cv2
import numpy as np
from arm_system.perception.vision.image_processing import ImageProcessor
# Initialize processor
processor = ImageProcessor( confidence_threshold = 0.45 )
# Load image as numpy array
image = cv2.imread( 'path/to/image.jpg' )
# Run inference directly on array
processed_img, detection = processor.process_image(
image = image,
confidence_threshold = 0.45
)
if detection:
print ( f "Detected: { detection[ 'class' ] } ( { detection[ 'confidence' ] :.2%} )" )
Advanced Usage
Understanding the Inference Process
Here’s what happens inside process_image():
# From image_processing.py:24-72
def process_image ( self , image : np.ndarray, confidence_threshold : float = 0.45 ):
try :
# 1. Run inference
copy_image = image.copy()
object_results, object_classes = self .detection.inference(copy_image)
# 2. Initialize best detection tracker
best_detection = {
'class' : '' ,
'confidence' : 0.0 ,
'box' : [],
'class_id' : - 1
}
# 3. Process all detections
for res in object_results:
boxes = res.boxes
if boxes.shape[ 0 ] == 0 :
continue
# Extract detection data
confidence = boxes.conf.cpu().numpy()[ 0 ]
class_id = int (boxes.cls[ 0 ])
box_data = boxes.xyxy.cpu().numpy()[ 0 ]
# Filter by confidence
if confidence < confidence_threshold:
continue
# Map to supported classes
detected_class = object_classes[class_id]
clss_object = 'default'
if detected_class in [ 'apple' , 'orange' , 'bottle' ]:
clss_object = detected_class
# Keep highest confidence detection
if confidence > best_detection[ 'confidence' ]:
best_detection.update({
'class' : str (clss_object),
'confidence' : float (confidence),
'box' : box_data,
'class_id' : class_id
})
# 4. Return best detection
if best_detection[ 'confidence' ] >= confidence_threshold:
return image, best_detection
else :
return image, None
except Exception as e:
log.info( f 'error in image processing: { e } ' )
return image, None
Custom Detection Model
The detection model can be customized:
# From detection/main.py:15-21
class DetectionModel ( DetectionModelInterface ):
def __init__ ( self ):
self .object_model = ModelLoader().get_model()
def inference ( self , image : np.ndarray) -> tuple[list[Results], Dict[ int , str ]]:
results = self .object_model.predict(
image,
conf = 0.55 , # Base confidence threshold
verbose = False ,
imgsz = 640 , # Input image size
stream = True , # Stream results
task = 'detect' ,
half = True # Use FP16 inference
)
return results, self .object_model.names
Drawing Detections
The system automatically draws bounding boxes:
# From image_processing.py:74-94
def _draw_detection ( self , image : np.ndarray, detection : dict ):
"""Draw bounding box and label on image"""
box = detection[ 'box' ]
class_name = detection[ 'class' ]
confidence = detection[ 'confidence' ]
x1, y1, x2, y2 = map ( int , box)
color = ( 0 , 255 , 0 ) # Green in BGR
# Draw rectangle
cv2.rectangle(image, (x1, y1), (x2, y2), color, 2 )
# Draw label
label = f " { class_name } { confidence :.2f} "
cv2.putText(
image, label, (x1, y1 - 10 ),
cv2. FONT_HERSHEY_SIMPLEX , 0.7 , color, 2
)
def _save_drawn_image ( self , image : np.ndarray, original_path : str ):
"""Save annotated image"""
out_path = original_path.replace( '.jpg' , '_detected.jpg' )
cv2.imwrite(out_path, image)
log.info( f "Saved annotated image: { out_path } " )
Integration with Serial Communication
The vision system integrates seamlessly with the serial communication manager:
# From serial_manager.py:174-202
def _handle_object_detection ( self , data : dict ):
"""Triggered when VEX brain detects an object position"""
try :
# 1. Capture image at current position
img_path = self .camera.capture_image()
if not img_path:
log.error( "Camera could not capture image" )
return
# 2. Run YOLO detection
image, yolo_result = self .object_detect_model.read_image_path(
img_path,
draw_results = True ,
save_drawn_img = True
)
if yolo_result is None :
log.info( "No detections" )
return
# 3. Augment data with detection results
data.update({
'class' : yolo_result[ 'class' ],
'confidence' : yolo_result[ 'confidence' ],
'timestamp' : time.time(),
'image_path' : img_path
})
# 4. Notify via callback
if self .callbacks.get( 'scan_service' ):
self .callbacks[ 'scan_service' ](data)
except Exception as e:
log.error( f "Error in object detection: { str (e) } " )
This creates a complete workflow:
VEX brain sends position data
Camera captures image
YOLO runs inference
Results sent to callback
Robot decides next action
Adjust Confidence Threshold
Adjust Image Size
Use FP16 Inference
# Higher threshold = fewer false positives, might miss objects
processor = ImageProcessor( confidence_threshold = 0.65 )
# Lower threshold = more detections, more false positives
processor = ImageProcessor( confidence_threshold = 0.35 )
Troubleshooting
Error: Camera capture returns NoneSolutions:
Check camera is connected: ls /dev/video*
Verify permissions: sudo chmod 666 /dev/video0
Try different camera index (0, 1, 2…)
Test with: v4l2-ctl --list-devices
Ensure no other process is using camera
Error: Cannot load YOLO modelSolutions:
Verify model path exists: models/yolo11s_ncnn_model
Check ultralytics package installed: pip install ultralytics
Ensure NCNN model files present (.param and .bin)
Try exporting model again if corrupted
Issue: Detection always returns NoneSolutions:
Lower confidence threshold: confidence_threshold=0.3
Check image quality and lighting
Verify object is in supported classes
Test with known good image
Print raw inference results for debugging
Issue: Detection takes too longSolutions:
Enable half precision: half=True
Reduce image size: imgsz=320
Use smaller model variant
Ensure NCNN backend is being used
Check CPU/GPU utilization
Next Steps
Pick and Place Use detection results to pick and place objects
Serial Communication Integrate vision with robot control
Vision API Complete API documentation
Model Training Train custom detection models