mlx_vlm.video_generate, that extracts frames from a video file and passes them as a sequence of images to a compatible vision language model.
Supported models
The following model families support video input:Qwen2-VL
Qwen2 Vision-Language series
Qwen2.5-VL
Qwen2.5 Vision-Language series
Idefics3
HuggingFace IDEFICS 3
LLaVA
LLaVA and LLaVA-Next series
Additional models are added regularly. If you use a model that does not natively support video,
mlx_vlm.video_generate issues a warning and continues — quality may be reduced.CLI usage
Flag reference
| Flag | Type | Default | Description |
|---|---|---|---|
--model | string | mlx-community/Qwen2.5-VL-7B-Instruct-4bit | Hugging Face repo ID or local model path |
--video | string (one or more) | (required) | Path(s) to the video file(s) to process |
--prompt | string | "Describe this video." | Text prompt sent alongside the video frames |
--max-tokens | int | 100 | Maximum number of tokens to generate |
--temperature | float | 0.7 | Sampling temperature |
--max-pixels | int int | 224 224 | Maximum frame resolution as height width; frames are resized to fit |
--max-frames | int | None | Hard cap on the number of frames extracted; None uses the FPS-derived count |
--fps | float | 1.0 | Frames per second to sample from the video |
--verbose | flag | True | Print generated tokens and timing statistics |
How frame sampling works
mlx_vlm.video_generate reads the video with OpenCV, then selects frames at the rate specified by --fps. For a 10-second clip at --fps 1.0, the command extracts approximately 10 frames. Each frame is resized to fit within --max-pixels while preserving the aspect ratio.
The frame count is always rounded to an even number (FRAME_FACTOR = 2) and clamped between 4 and 768 frames. You can override the upper bound with --max-frames.