Video understanding

MLX-VLM includes a dedicated video inference command, mlx_vlm.video_generate, that extracts frames from a video file and passes them as a sequence of images to a compatible vision language model.

Video understanding is a beta feature. Behaviour may differ across models and video formats.

Supported models

The following model families support video input:

Qwen2-VL

Qwen2 Vision-Language series

Qwen2.5-VL

Qwen2.5 Vision-Language series

Idefics3

HuggingFace IDEFICS 3

LLaVA

LLaVA and LLaVA-Next series

Additional models are added regularly. If you use a model that does not natively support video, mlx_vlm.video_generate issues a warning and continues — quality may be reduced.

CLI usage

mlx_vlm.video_generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --video path/to/video.mp4 \
  --prompt "Describe this video." \
  --max-tokens 100 \
  --max-pixels 224 224 \
  --fps 1.0

Flag reference

Flag	Type	Default	Description
`--model`	string	`mlx-community/Qwen2.5-VL-7B-Instruct-4bit`	Hugging Face repo ID or local model path
`--video`	string (one or more)	(required)	Path(s) to the video file(s) to process
`--prompt`	string	`"Describe this video."`	Text prompt sent alongside the video frames
`--max-tokens`	int	`100`	Maximum number of tokens to generate
`--temperature`	float	`0.7`	Sampling temperature
`--max-pixels`	int int	`224 224`	Maximum frame resolution as `height width`; frames are resized to fit
`--max-frames`	int	`None`	Hard cap on the number of frames extracted; `None` uses the FPS-derived count
`--fps`	float	`1.0`	Frames per second to sample from the video
`--verbose`	flag	`True`	Print generated tokens and timing statistics

How frame sampling works

mlx_vlm.video_generate reads the video with OpenCV, then selects frames at the rate specified by --fps. For a 10-second clip at --fps 1.0, the command extracts approximately 10 frames. Each frame is resized to fit within --max-pixels while preserving the aspect ratio. The frame count is always rounded to an even number (FRAME_FACTOR = 2) and clamped between 4 and 768 frames. You can override the upper bound with --max-frames.

Start with --fps 1.0 and --max-pixels 224 224 for fast prototyping. Increase --fps or --max-pixels only when the task requires more temporal or spatial detail, as higher values significantly increase memory and processing time.

Example: video captioning

mlx_vlm.video_generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --video path/to/video.mp4 \
  --max-tokens 100 \
  --prompt "Describe this video" \
  --max-pixels 224 224 \
  --fps 1.0

Example: dense description at higher FPS

mlx_vlm.video_generate \
  --model mlx-community/Qwen2.5-VL-7B-Instruct-4bit \
  --video path/to/video.mp4 \
  --prompt "Provide a step-by-step description of what happens in this video." \
  --max-tokens 500 \
  --max-pixels 336 336 \
  --fps 2.0 \
  --temperature 0.3

Get Started

Inference

Fine-Tuning

Advanced

Models

Video understanding

Supported models

Qwen2-VL

Qwen2.5-VL

Idefics3

LLaVA

CLI usage

Flag reference

How frame sampling works

Example: video captioning

Example: dense description at higher FPS

Build docs developers (and LLMs) love

Get Started

Inference

Fine-Tuning

Advanced

Models

​Supported models

Qwen2-VL

Qwen2.5-VL

Idefics3

LLaVA

​CLI usage

​Flag reference

​How frame sampling works

​Example: video captioning

​Example: dense description at higher FPS

Build docs developers (and LLMs) love

Supported models

CLI usage

Flag reference

How frame sampling works

Example: video captioning

Example: dense description at higher FPS