Skip to main content
MLX-VLM includes a dedicated video inference command, mlx_vlm.video_generate, that extracts frames from a video file and passes them as a sequence of images to a compatible vision language model.
Video understanding is a beta feature. Behaviour may differ across models and video formats.

Supported models

The following model families support video input:

Qwen2-VL

Qwen2 Vision-Language series

Qwen2.5-VL

Qwen2.5 Vision-Language series

Idefics3

HuggingFace IDEFICS 3

LLaVA

LLaVA and LLaVA-Next series
Additional models are added regularly. If you use a model that does not natively support video, mlx_vlm.video_generate issues a warning and continues — quality may be reduced.

CLI usage

mlx_vlm.video_generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --video path/to/video.mp4 \
  --prompt "Describe this video." \
  --max-tokens 100 \
  --max-pixels 224 224 \
  --fps 1.0

Flag reference

FlagTypeDefaultDescription
--modelstringmlx-community/Qwen2.5-VL-7B-Instruct-4bitHugging Face repo ID or local model path
--videostring (one or more)(required)Path(s) to the video file(s) to process
--promptstring"Describe this video."Text prompt sent alongside the video frames
--max-tokensint100Maximum number of tokens to generate
--temperaturefloat0.7Sampling temperature
--max-pixelsint int224 224Maximum frame resolution as height width; frames are resized to fit
--max-framesintNoneHard cap on the number of frames extracted; None uses the FPS-derived count
--fpsfloat1.0Frames per second to sample from the video
--verboseflagTruePrint generated tokens and timing statistics

How frame sampling works

mlx_vlm.video_generate reads the video with OpenCV, then selects frames at the rate specified by --fps. For a 10-second clip at --fps 1.0, the command extracts approximately 10 frames. Each frame is resized to fit within --max-pixels while preserving the aspect ratio. The frame count is always rounded to an even number (FRAME_FACTOR = 2) and clamped between 4 and 768 frames. You can override the upper bound with --max-frames.
Start with --fps 1.0 and --max-pixels 224 224 for fast prototyping. Increase --fps or --max-pixels only when the task requires more temporal or spatial detail, as higher values significantly increase memory and processing time.

Example: video captioning

mlx_vlm.video_generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --video path/to/video.mp4 \
  --max-tokens 100 \
  --prompt "Describe this video" \
  --max-pixels 224 224 \
  --fps 1.0

Example: dense description at higher FPS

mlx_vlm.video_generate \
  --model mlx-community/Qwen2.5-VL-7B-Instruct-4bit \
  --video path/to/video.mp4 \
  --prompt "Provide a step-by-step description of what happens in this video." \
  --max-tokens 500 \
  --max-pixels 336 336 \
  --fps 2.0 \
  --temperature 0.3

Build docs developers (and LLMs) love