Overview
The PPTX backend (MsPowerpointDocumentBackend) parses Microsoft PowerPoint presentations (.pptx files) and converts them directly to DoclingDocument format. Each slide becomes a page with extracted content including text, tables, and images.
Features
- Slide-by-page conversion - Each slide becomes a document page
- Text extraction - Titles, subtitles, body text, and notes
- List detection - Bullet points and numbered lists with hierarchy
- Table extraction - Tables with cell spans and structure
- Image extraction - Embedded pictures and shapes
- Notes preservation - Speaker notes as furniture content
- Grouped shapes - Handles grouped shape content
- Placeholder detection - Identifies title, subtitle, and body placeholders
Usage
Basic Conversion
With Format Options
Slide Structure
Each slide is organized as a chapter group:Supported Elements
Text Content
Titles and Subtitles
Titles and Subtitles
Slide titles and subtitles are automatically detected based on placeholder types:
Body Text
Body Text
Regular text content from slide bodies:
Speaker Notes
Speaker Notes
Speaker notes are extracted as furniture-layer content:
Lists
PowerPoint lists are detected and preserved:- Bullet lists - Unordered list items with bullet markers
- Numbered lists - Ordered lists with automatic numbering
- Multi-level lists - Nested list hierarchies based on indentation
List Detection Algorithm
The backend uses PowerPoint’s paragraph properties to determine list items:- Checks direct paragraph properties (
<a:pPr>) - Falls back to shape-level list styles (
<a:lstStyle>) - Checks layout placeholder styles
- Uses slide master text styles
<a:buChar>- Character bullets (•, ○, ■, etc.)<a:buAutoNum>- Automatic numbering<a:buBlip>- Picture bullets<a:buNone>- Explicitly no bullet
Tables
Complete table extraction with structure:- Cell content and formatting
- Merged cells (rowSpan, gridSpan)
- Header row/column detection
- Empty cell handling
Images
Extracts embedded pictures:- Inline pictures
- Picture shapes
- Image formats: JPEG, PNG, BMP, etc.
- DPI information preserved
Slide Layout
Slide dimensions and layout information:Provenance Information
All extracted items include provenance with position on slide:Grouped Shapes
Handles PowerPoint shape groups:- Shape groups
- Nested groups
- Individual shapes within groups
Advanced Features
Placeholder Types
Automatic detection of PowerPoint placeholders:PP_PLACEHOLDER.TITLE- Slide titlePP_PLACEHOLDER.CENTER_TITLE- Centered titlePP_PLACEHOLDER.SUBTITLE- SubtitlePP_PLACEHOLDER.BODY- Content placeholderPP_PLACEHOLDER.OBJECT- Object placeholder
Line Breaks
Line breaks in PowerPoint text are converted to spaces for better text flow:Empty Slides
Slides without content still create page entries:Performance
- Speed: Fast declarative conversion (no ML models)
- Memory: Low memory footprint
- Concurrency: Thread-safe per document instance
Limitations
Troubleshooting
Missing bullet points
Missing bullet points
Cause: Complex list style inheritanceCheck: Verify list formatting in PowerPoint sourceNote: Backend checks multiple levels of style inheritance
Incorrect list numbering
Incorrect list numbering
Cause: Custom start numbers or broken numberingSolution: Backend respects
start attribute on ordered listsMissing notes
Missing notes
Check: Verify slides have speaker notes in PowerPoint
Image quality issues
Image quality issues
Note: Images are extracted at original embedded resolutionWorkaround: Use higher resolution images in source presentation
Export Formats
After conversion, export to various formats:Use Cases
Content Extraction
Extract text and tables from presentations for analysis or archival
Slide Summarization
Convert presentations to text format for LLM processing and summarization
Training Materials
Extract course content from educational presentations
Documentation
Convert technical presentations to structured documentation
See Also
- Backends Overview - Backend architecture
- DOCX Backend - Word document processing
- XLSX Backend - Excel processing
- DocumentConverter - Main conversion API