Skip to main content

Overview

The vision system combines camera capture with YOLOv11-based object detection to identify objects in the robot’s workspace. This tutorial covers the complete pipeline from image capture to detection results.
The system uses YOLOv11s (small) model with NCNN backend for fast inference on embedded devices.

Architecture

The vision pipeline consists of three main components:
  1. CameraManager: Captures images from camera
  2. ImageProcessor: Preprocesses and coordinates detection
  3. DetectionModel: Runs YOLOv11 inference

Camera Setup

1

Initialize Camera Manager

from arm_system.perception.vision.camera.main import CameraManager

# Initialize camera with custom settings
camera = CameraManager(
    camera_index=0,      # /dev/video0
    width=1280,          # Capture width
    height=720           # Capture height
)
The camera manager uses OpenCV’s VideoCapture with configured resolution.
2

Capture an Image

# Capture image
img_path = camera.capture_image()

if img_path:
    print(f"Image saved to: {img_path}")
else:
    print("Failed to capture image")
What happens internally:
# From camera/main.py:11-26
def capture_image(self):
    # Flush camera buffer (grab 5 frames)
    for _ in range(5):
        self.cap.grab()
    
    # Capture frame
    ret, frame = self.cap.read()
    if not ret:
        return None
    
    # Generate filename with timestamp
    current_dir = os.path.dirname(os.path.abspath(__file__))
    timestamp = time.strftime("%Y%m%d-%H%M%S")
    filename = f"{current_dir}/objects_images/{timestamp}.png"
    
    # Save image
    os.makedirs(os.path.dirname(filename), exist_ok=True)
    cv2.imwrite(filename, frame)
    return filename
The camera buffer is flushed with 5 grab() calls to ensure you get the most recent frame.

Object Detection

1

Initialize Detection Model

from arm_system.perception.vision.image_processing import ImageProcessor

# Initialize with confidence threshold
processor = ImageProcessor(confidence_threshold=0.45)
The processor loads the YOLOv11s NCNN model automatically:
# From detection/model_loader.py:7-14
class ModelLoader:
    def __init__(self):
        current_path = os.path.dirname(os.path.abspath(__file__))
        object_model_path = current_path + '/models/yolo11s_ncnn_model'
        self.model = YOLO(object_model_path, task='detect')
        
    def get_model(self):
        return self.model
2

Run Inference on Image

# Process image from file path
processed_img, detection = processor.read_image_path(
    path=img_path,
    draw_results=True,      # Draw bounding boxes
    save_drawn_img=True     # Save annotated image
)

if detection:
    print(f"Class: {detection['class']}")
    print(f"Confidence: {detection['confidence']:.2f}")
    print(f"Bounding Box: {detection['box']}")
else:
    print("No objects detected")
3

Process Detection Results

The detection result is a dictionary:
detection = {
    'class': 'apple',           # Object class name
    'confidence': 0.89,         # Detection confidence (0-1)
    'box': [x1, y1, x2, y2],   # Bounding box coordinates
    'class_id': 47              # COCO class ID
}
Supported object classes:
  • apple
  • orange
  • bottle
  • default (for any other detected object)

Complete Pipeline Example

Capture and Detect

capture_and_detect.py
import cv2
from arm_system.perception.vision.camera.main import CameraManager
from arm_system.perception.vision.image_processing import ImageProcessor

# Initialize components
camera = CameraManager(camera_index=0)
processor = ImageProcessor(confidence_threshold=0.45)

# Capture image
print("Capturing image...")
img_path = camera.capture_image()

if not img_path:
    print("Failed to capture image")
    exit(1)

print(f"Image saved: {img_path}")

# Run detection
print("Running inference...")
processed_img, detection = processor.read_image_path(
    path=img_path,
    draw_results=True,
    save_drawn_img=True
)

# Display results
if detection and detection['confidence'] > 0:
    print(f"\nDetection Results:")
    print(f"  Class: {detection['class']}")
    print(f"  Confidence: {detection['confidence']:.2%}")
    print(f"  Bounding Box: {detection['box']}")
    
    # Show image with detection
    cv2.imshow('Detection', processed_img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
else:
    print("No objects detected above threshold")

Process Raw Image Array

process_image_array.py
import cv2
import numpy as np
from arm_system.perception.vision.image_processing import ImageProcessor

# Initialize processor
processor = ImageProcessor(confidence_threshold=0.45)

# Load image as numpy array
image = cv2.imread('path/to/image.jpg')

# Run inference directly on array
processed_img, detection = processor.process_image(
    image=image,
    confidence_threshold=0.45
)

if detection:
    print(f"Detected: {detection['class']} ({detection['confidence']:.2%})")

Advanced Usage

Understanding the Inference Process

Here’s what happens inside process_image():
# From image_processing.py:24-72
def process_image(self, image: np.ndarray, confidence_threshold: float = 0.45):
    try:
        # 1. Run inference
        copy_image = image.copy()
        object_results, object_classes = self.detection.inference(copy_image)
        
        # 2. Initialize best detection tracker
        best_detection = {
            'class': '',
            'confidence': 0.0,
            'box': [],
            'class_id': -1
        }
        
        # 3. Process all detections
        for res in object_results:
            boxes = res.boxes
            
            if boxes.shape[0] == 0:
                continue
            
            # Extract detection data
            confidence = boxes.conf.cpu().numpy()[0]
            class_id = int(boxes.cls[0])
            box_data = boxes.xyxy.cpu().numpy()[0]
            
            # Filter by confidence
            if confidence < confidence_threshold:
                continue
            
            # Map to supported classes
            detected_class = object_classes[class_id]
            clss_object = 'default'
            
            if detected_class in ['apple', 'orange', 'bottle']:
                clss_object = detected_class
            
            # Keep highest confidence detection
            if confidence > best_detection['confidence']:
                best_detection.update({
                    'class': str(clss_object),
                    'confidence': float(confidence),
                    'box': box_data,
                    'class_id': class_id
                })
        
        # 4. Return best detection
        if best_detection['confidence'] >= confidence_threshold:
            return image, best_detection
        else:
            return image, None
            
    except Exception as e:
        log.info(f'error in image processing: {e}')
        return image, None

Custom Detection Model

The detection model can be customized:
# From detection/main.py:15-21
class DetectionModel(DetectionModelInterface):
    def __init__(self):
        self.object_model = ModelLoader().get_model()

    def inference(self, image: np.ndarray) -> tuple[list[Results], Dict[int, str]]:
        results = self.object_model.predict(
            image,
            conf=0.55,        # Base confidence threshold
            verbose=False,
            imgsz=640,        # Input image size
            stream=True,      # Stream results
            task='detect',
            half=True         # Use FP16 inference
        )
        return results, self.object_model.names

Drawing Detections

The system automatically draws bounding boxes:
# From image_processing.py:74-94
def _draw_detection(self, image: np.ndarray, detection: dict):
    """Draw bounding box and label on image"""
    box = detection['box']
    class_name = detection['class']
    confidence = detection['confidence']

    x1, y1, x2, y2 = map(int, box)
    color = (0, 255, 0)  # Green in BGR
    
    # Draw rectangle
    cv2.rectangle(image, (x1, y1), (x2, y2), color, 2)

    # Draw label
    label = f"{class_name} {confidence:.2f}"
    cv2.putText(
        image, label, (x1, y1 - 10),
        cv2.FONT_HERSHEY_SIMPLEX, 0.7, color, 2
    )

def _save_drawn_image(self, image: np.ndarray, original_path: str):
    """Save annotated image"""
    out_path = original_path.replace('.jpg', '_detected.jpg')
    cv2.imwrite(out_path, image)
    log.info(f"Saved annotated image: {out_path}")

Integration with Serial Communication

The vision system integrates seamlessly with the serial communication manager:
# From serial_manager.py:174-202
def _handle_object_detection(self, data: dict):
    """Triggered when VEX brain detects an object position"""
    try:
        # 1. Capture image at current position
        img_path = self.camera.capture_image()
        if not img_path:
            log.error("Camera could not capture image")
            return
        
        # 2. Run YOLO detection
        image, yolo_result = self.object_detect_model.read_image_path(
            img_path,
            draw_results=True,
            save_drawn_img=True
        )
        
        if yolo_result is None:
            log.info("No detections")
            return
        
        # 3. Augment data with detection results
        data.update({
            'class': yolo_result['class'],
            'confidence': yolo_result['confidence'],
            'timestamp': time.time(),
            'image_path': img_path
        })
        
        # 4. Notify via callback
        if self.callbacks.get('scan_service'):
            self.callbacks['scan_service'](data)
            
    except Exception as e:
        log.error(f"Error in object detection: {str(e)}")
This creates a complete workflow:
  1. VEX brain sends position data
  2. Camera captures image
  3. YOLO runs inference
  4. Results sent to callback
  5. Robot decides next action

Performance Optimization

# Higher threshold = fewer false positives, might miss objects
processor = ImageProcessor(confidence_threshold=0.65)

# Lower threshold = more detections, more false positives
processor = ImageProcessor(confidence_threshold=0.35)

Troubleshooting

Error: Camera capture returns NoneSolutions:
  • Check camera is connected: ls /dev/video*
  • Verify permissions: sudo chmod 666 /dev/video0
  • Try different camera index (0, 1, 2…)
  • Test with: v4l2-ctl --list-devices
  • Ensure no other process is using camera
Error: Cannot load YOLO modelSolutions:
  • Verify model path exists: models/yolo11s_ncnn_model
  • Check ultralytics package installed: pip install ultralytics
  • Ensure NCNN model files present (.param and .bin)
  • Try exporting model again if corrupted
Issue: Detection always returns NoneSolutions:
  • Lower confidence threshold: confidence_threshold=0.3
  • Check image quality and lighting
  • Verify object is in supported classes
  • Test with known good image
  • Print raw inference results for debugging
Issue: Detection takes too longSolutions:
  • Enable half precision: half=True
  • Reduce image size: imgsz=320
  • Use smaller model variant
  • Ensure NCNN backend is being used
  • Check CPU/GPU utilization

Next Steps

Pick and Place

Use detection results to pick and place objects

Serial Communication

Integrate vision with robot control

Vision API

Complete API documentation

Model Training

Train custom detection models

Build docs developers (and LLMs) love