Vision & Inference Tutorial

Overview

The vision system combines camera capture with YOLOv11-based object detection to identify objects in the robot’s workspace. This tutorial covers the complete pipeline from image capture to detection results.

The system uses YOLOv11s (small) model with NCNN backend for fast inference on embedded devices.

Architecture

The vision pipeline consists of three main components:

CameraManager: Captures images from camera
ImageProcessor: Preprocesses and coordinates detection
DetectionModel: Runs YOLOv11 inference

Camera Setup

Initialize Camera Manager

from arm_system.perception.vision.camera.main import CameraManager

# Initialize camera with custom settings
camera = CameraManager(
    camera_index=0,      # /dev/video0
    width=1280,          # Capture width
    height=720           # Capture height
)

The camera manager uses OpenCV’s VideoCapture with configured resolution.

Capture an Image

# Capture image
img_path = camera.capture_image()

if img_path:
    print(f"Image saved to: {img_path}")
else:
    print("Failed to capture image")

What happens internally:

# From camera/main.py:11-26
def capture_image(self):
    # Flush camera buffer (grab 5 frames)
    for _ in range(5):
        self.cap.grab()
    
    # Capture frame
    ret, frame = self.cap.read()
    if not ret:
        return None
    
    # Generate filename with timestamp
    current_dir = os.path.dirname(os.path.abspath(__file__))
    timestamp = time.strftime("%Y%m%d-%H%M%S")
    filename = f"{current_dir}/objects_images/{timestamp}.png"
    
    # Save image
    os.makedirs(os.path.dirname(filename), exist_ok=True)
    cv2.imwrite(filename, frame)
    return filename

The camera buffer is flushed with 5 grab() calls to ensure you get the most recent frame.

Object Detection

Initialize Detection Model

from arm_system.perception.vision.image_processing import ImageProcessor

# Initialize with confidence threshold
processor = ImageProcessor(confidence_threshold=0.45)

The processor loads the YOLOv11s NCNN model automatically:

# From detection/model_loader.py:7-14
class ModelLoader:
    def __init__(self):
        current_path = os.path.dirname(os.path.abspath(__file__))
        object_model_path = current_path + '/models/yolo11s_ncnn_model'
        self.model = YOLO(object_model_path, task='detect')
        
    def get_model(self):
        return self.model

Run Inference on Image

# Process image from file path
processed_img, detection = processor.read_image_path(
    path=img_path,
    draw_results=True,      # Draw bounding boxes
    save_drawn_img=True     # Save annotated image
)

if detection:
    print(f"Class: {detection['class']}")
    print(f"Confidence: {detection['confidence']:.2f}")
    print(f"Bounding Box: {detection['box']}")
else:
    print("No objects detected")

Process Detection Results

The detection result is a dictionary:

detection = {
    'class': 'apple',           # Object class name
    'confidence': 0.89,         # Detection confidence (0-1)
    'box': [x1, y1, x2, y2],   # Bounding box coordinates
    'class_id': 47              # COCO class ID
}

Supported object classes:

apple
orange
bottle
default (for any other detected object)

Complete Pipeline Example

Capture and Detect

capture_and_detect.py

import cv2
from arm_system.perception.vision.camera.main import CameraManager
from arm_system.perception.vision.image_processing import ImageProcessor

# Initialize components
camera = CameraManager(camera_index=0)
processor = ImageProcessor(confidence_threshold=0.45)

# Capture image
print("Capturing image...")
img_path = camera.capture_image()

if not img_path:
    print("Failed to capture image")
    exit(1)

print(f"Image saved: {img_path}")

# Run detection
print("Running inference...")
processed_img, detection = processor.read_image_path(
    path=img_path,
    draw_results=True,
    save_drawn_img=True
)

# Display results
if detection and detection['confidence'] > 0:
    print(f"\nDetection Results:")
    print(f"  Class: {detection['class']}")
    print(f"  Confidence: {detection['confidence']:.2%}")
    print(f"  Bounding Box: {detection['box']}")
    
    # Show image with detection
    cv2.imshow('Detection', processed_img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
else:
    print("No objects detected above threshold")

Process Raw Image Array

process_image_array.py

import cv2
import numpy as np
from arm_system.perception.vision.image_processing import ImageProcessor

# Initialize processor
processor = ImageProcessor(confidence_threshold=0.45)

# Load image as numpy array
image = cv2.imread('path/to/image.jpg')

# Run inference directly on array
processed_img, detection = processor.process_image(
    image=image,
    confidence_threshold=0.45
)

if detection:
    print(f"Detected: {detection['class']} ({detection['confidence']:.2%})")

Advanced Usage

Understanding the Inference Process

Here’s what happens inside process_image():

# From image_processing.py:24-72
def process_image(self, image: np.ndarray, confidence_threshold: float = 0.45):
    try:
        # 1. Run inference
        copy_image = image.copy()
        object_results, object_classes = self.detection.inference(copy_image)
        
        # 2. Initialize best detection tracker
        best_detection = {
            'class': '',
            'confidence': 0.0,
            'box': [],
            'class_id': -1
        }
        
        # 3. Process all detections
        for res in object_results:
            boxes = res.boxes
            
            if boxes.shape[0] == 0:
                continue
            
            # Extract detection data
            confidence = boxes.conf.cpu().numpy()[0]
            class_id = int(boxes.cls[0])
            box_data = boxes.xyxy.cpu().numpy()[0]
            
            # Filter by confidence
            if confidence < confidence_threshold:
                continue
            
            # Map to supported classes
            detected_class = object_classes[class_id]
            clss_object = 'default'
            
            if detected_class in ['apple', 'orange', 'bottle']:
                clss_object = detected_class
            
            # Keep highest confidence detection
            if confidence > best_detection['confidence']:
                best_detection.update({
                    'class': str(clss_object),
                    'confidence': float(confidence),
                    'box': box_data,
                    'class_id': class_id
                })
        
        # 4. Return best detection
        if best_detection['confidence'] >= confidence_threshold:
            return image, best_detection
        else:
            return image, None
            
    except Exception as e:
        log.info(f'error in image processing: {e}')
        return image, None

Custom Detection Model

The detection model can be customized:

# From detection/main.py:15-21
class DetectionModel(DetectionModelInterface):
    def __init__(self):
        self.object_model = ModelLoader().get_model()

    def inference(self, image: np.ndarray) -> tuple[list[Results], Dict[int, str]]:
        results = self.object_model.predict(
            image,
            conf=0.55,        # Base confidence threshold
            verbose=False,
            imgsz=640,        # Input image size
            stream=True,      # Stream results
            task='detect',
            half=True         # Use FP16 inference
        )
        return results, self.object_model.names

Drawing Detections

The system automatically draws bounding boxes:

# From image_processing.py:74-94
def _draw_detection(self, image: np.ndarray, detection: dict):
    """Draw bounding box and label on image"""
    box = detection['box']
    class_name = detection['class']
    confidence = detection['confidence']

    x1, y1, x2, y2 = map(int, box)
    color = (0, 255, 0)  # Green in BGR
    
    # Draw rectangle
    cv2.rectangle(image, (x1, y1), (x2, y2), color, 2)

    # Draw label
    label = f"{class_name} {confidence:.2f}"
    cv2.putText(
        image, label, (x1, y1 - 10),
        cv2.FONT_HERSHEY_SIMPLEX, 0.7, color, 2
    )

def _save_drawn_image(self, image: np.ndarray, original_path: str):
    """Save annotated image"""
    out_path = original_path.replace('.jpg', '_detected.jpg')
    cv2.imwrite(out_path, image)
    log.info(f"Saved annotated image: {out_path}")

Integration with Serial Communication

The vision system integrates seamlessly with the serial communication manager:

# From serial_manager.py:174-202
def _handle_object_detection(self, data: dict):
    """Triggered when VEX brain detects an object position"""
    try:
        # 1. Capture image at current position
        img_path = self.camera.capture_image()
        if not img_path:
            log.error("Camera could not capture image")
            return
        
        # 2. Run YOLO detection
        image, yolo_result = self.object_detect_model.read_image_path(
            img_path,
            draw_results=True,
            save_drawn_img=True
        )
        
        if yolo_result is None:
            log.info("No detections")
            return
        
        # 3. Augment data with detection results
        data.update({
            'class': yolo_result['class'],
            'confidence': yolo_result['confidence'],
            'timestamp': time.time(),
            'image_path': img_path
        })
        
        # 4. Notify via callback
        if self.callbacks.get('scan_service'):
            self.callbacks['scan_service'](data)
            
    except Exception as e:
        log.error(f"Error in object detection: {str(e)}")

This creates a complete workflow:

VEX brain sends position data
Camera captures image
YOLO runs inference
Results sent to callback
Robot decides next action

Performance Optimization

# Higher threshold = fewer false positives, might miss objects
processor = ImageProcessor(confidence_threshold=0.65)

# Lower threshold = more detections, more false positives
processor = ImageProcessor(confidence_threshold=0.35)

Troubleshooting

Camera Not Opening

Error: Camera capture returns NoneSolutions:

Check camera is connected: ls /dev/video*
Verify permissions: sudo chmod 666 /dev/video0
Try different camera index (0, 1, 2…)
Test with: v4l2-ctl --list-devices
Ensure no other process is using camera

Model Loading Failed

Error: Cannot load YOLO modelSolutions:

Verify model path exists: models/yolo11s_ncnn_model
Check ultralytics package installed: pip install ultralytics
Ensure NCNN model files present (.param and .bin)
Try exporting model again if corrupted

No Objects Detected

Issue: Detection always returns NoneSolutions:

Lower confidence threshold: confidence_threshold=0.3
Check image quality and lighting
Verify object is in supported classes
Test with known good image
Print raw inference results for debugging

Slow Inference

Issue: Detection takes too longSolutions:

Enable half precision: half=True
Reduce image size: imgsz=320
Use smaller model variant
Ensure NCNN backend is being used
Check CPU/GPU utilization

Next Steps

Pick and Place

Use detection results to pick and place objects

Serial Communication

Integrate vision with robot control

Vision API

Complete API documentation

Model Training

Train custom detection models

Getting Started

Hardware Setup

Software Architecture

Computer Vision

Tutorials

Overview

Architecture

Camera Setup

Object Detection

Complete Pipeline Example

Capture and Detect

Process Raw Image Array

Advanced Usage

Understanding the Inference Process

Custom Detection Model

Drawing Detections

Integration with Serial Communication

Performance Optimization

Troubleshooting

Next Steps

Pick and Place

Serial Communication

Vision API

Model Training

Build docs developers (and LLMs) love

Getting Started

Hardware Setup

Software Architecture

Computer Vision

Tutorials

​Overview

​Architecture

​Camera Setup

​Object Detection

​Complete Pipeline Example

​Capture and Detect

​Process Raw Image Array

​Advanced Usage

​Understanding the Inference Process

​Custom Detection Model

​Drawing Detections

​Integration with Serial Communication

​Performance Optimization

​Troubleshooting

​Next Steps

Pick and Place

Serial Communication

Vision API

Model Training

Build docs developers (and LLMs) love

Overview

Architecture

Camera Setup

Object Detection

Complete Pipeline Example

Capture and Detect

Process Raw Image Array

Advanced Usage

Understanding the Inference Process

Custom Detection Model

Drawing Detections

Integration with Serial Communication

Performance Optimization

Troubleshooting

Next Steps