Format Categories
Office Documents
Word, PowerPoint, Excel, and Outlook files
PDF Documents
PDF files with table extraction and text processing
Images
JPEG and PNG with EXIF metadata and OCR
Audio Files
Audio files with metadata and speech transcription
Web Content
HTML, RSS, Wikipedia, YouTube, and Bing SERP
Other Formats
CSV, JSON, XML, ZIP, EPUB, and Jupyter notebooks
All Supported Formats
Office Documents
| Format | Extension | Dependencies |
|---|---|---|
| Word | .docx | mammoth |
| PowerPoint | .pptx | python-pptx |
| Excel (modern) | .xlsx | pandas, openpyxl |
| Excel (legacy) | .xls | pandas, xlrd |
| Outlook | .msg | olefile |
Documents
| Format | Extension | Dependencies |
|---|---|---|
.pdf | pdfminer.six, pdfplumber | |
| EPUB | .epub | Built-in |
| Jupyter Notebook | .ipynb | Built-in |
Media
| Format | Extension | Dependencies |
|---|---|---|
| JPEG Images | .jpg, .jpeg | exiftool (optional) |
| PNG Images | .png | exiftool (optional) |
| Audio (WAV) | .wav | speech_recognition, pydub |
| Audio (MP3) | .mp3 | speech_recognition, pydub |
| Audio (M4A) | .m4a | speech_recognition, pydub |
| Video (MP4) | .mp4 | speech_recognition, pydub |
Web & Data
| Format | Extension | Dependencies |
|---|---|---|
| HTML | .html, .htm | beautifulsoup4 |
| RSS/Atom | .rss, .atom, .xml | beautifulsoup4, defusedxml |
| CSV | .csv | Built-in |
| JSON | .json, .jsonl | Built-in |
| Plain Text | .txt, .md | Built-in |
| ZIP Archives | .zip | Built-in |
Web Services
| Service | URL Pattern | Dependencies |
|---|---|---|
| Wikipedia | *.wikipedia.org | beautifulsoup4 |
| YouTube | youtube.com/watch?v=* | beautifulsoup4, youtube-transcript-api |
| Bing Search | bing.com/search?q=* | beautifulsoup4 |
Feature Matrix
| Format Category | Text Extraction | Table Support | Metadata | Images | Advanced Features |
|---|---|---|---|---|---|
| Office Documents | ✓ | ✓ | ✓ | ✓ | Charts, slide notes |
| ✓ | ✓ | ✗ | ✗ | Form detection | |
| Images | ✗ | ✗ | ✓ | ✓ | EXIF, LLM captioning, OCR |
| Audio | ✗ | ✗ | ✓ | ✗ | Speech transcription |
| Web Content | ✓ | ✓ | ✓ | ✗ | Feed parsing |
| Data Formats | ✓ | ✓ | ✓ | ✗ | Structure preservation |
Installation by Format
Install dependencies for specific format categories:Next Steps
Quick Start
Get started with basic conversion
Python API
Learn the programmatic interface