Markdown Functions

Markdown functions provide tools for parsing, chunking, and extracting content from Markdown documents. All functions are available via fc.markdown.*.

to_json

Converts a column of Markdown-formatted strings into a hierarchical JSON representation.

fc.markdown.to_json(column: ColumnOrName) -> Column

column

ColumnOrName

required

Input column containing Markdown strings.

return

Column

A column of JSON-formatted strings representing the structured document tree.

This function parses Markdown into a structured JSON format optimized for document chunking, semantic analysis, and jq queries. The output conforms to a custom schema that organizes content into nested sections based on heading levels. The full JSON schema is available at: docs.fenic.ai/topics/markdown-json

Supported Markdown Features

Headings with nested hierarchy (e.g., h2 → h3 → h4)
Paragraphs with inline formatting (bold, italics, links, code, etc.)
Lists (ordered, unordered, task lists)
Tables with header alignment and inline content
Code blocks with language info
Blockquotes, horizontal rules, and inline/flow HTML

Examples

df.select(fc.markdown.to_json(fc.col("markdown_text")))

get_code_blocks

Extracts all code blocks from a column of Markdown-formatted strings.

fc.markdown.get_code_blocks(
    column: ColumnOrName,
    language_filter: Optional[str] = None
) -> Column

column

ColumnOrName

required

Input column containing Markdown strings.

language_filter

Optional[str]

Optional language filter to extract only code blocks with a specific language. By default, all code blocks are extracted.

return

Column

A column of code blocks. The output column type is ArrayType(StructType([StructField("language", StringType), StructField("code", StringType)])).

Code blocks are parsed from fenced Markdown blocks (e.g., triple backticks).
Language identifiers are optional and may be null if not provided in the original Markdown.
Indented code blocks without fences are not currently supported.

Examples

df.select(fc.markdown.get_code_blocks(fc.col("markdown_text")))

generate_toc

Generates a table of contents from markdown headings.

fc.markdown.generate_toc(
    column: ColumnOrName,
    max_level: Optional[int] = None
) -> Column

column

ColumnOrName

required

Input column containing Markdown strings.

max_level

Optional[int]

Maximum heading level to include in the TOC (1-6). Defaults to 6 (all levels).

return

Column

A column of Markdown-formatted table of contents strings.

The TOC is generated using markdown heading syntax (# ## ### etc.)
Each heading in the source document becomes a line in the TOC
The heading level is preserved in the output
This creates a valid markdown document that can be rendered or processed further

Examples

df.select(fc.markdown.generate_toc(fc.col("documentation")))

extract_header_chunks

Splits markdown documents into logical chunks based on heading hierarchy.

fc.markdown.extract_header_chunks(
    column: ColumnOrName,
    header_level: int
) -> Column

column

ColumnOrName

required

Input column containing Markdown strings.

header_level

int

required

Heading level to split on (1-6). Creates a new chunk at every heading of this level, including all nested content and subsections.

return

Column

A column of arrays containing chunk objects with the following structure:

ArrayType(StructType([
    StructField("heading", StringType),        # Heading text (clean, no markdown)
    StructField("level", IntegerType),         # Heading level (1-6)
    StructField("content", StringType),        # All content under this heading (clean text)
    StructField("parent_heading", StringType), # Parent heading text (or null)
    StructField("full_path", StringType),      # Full breadcrumb path
]))

Features

Context-preserving: Each chunk contains all content and subsections under the heading
Hierarchical awareness: Includes parent heading context for better LLM understanding
Clean text output: Strips markdown formatting for direct LLM consumption

Chunking Behavior

With header_level=2, this markdown:

# Introduction
Overview text

## Getting Started
Setup instructions

### Prerequisites
Python 3.8+ required

## API Reference
Function documentation

Produces 2 chunks:

Getting Started chunk (includes Prerequisites subsection)
API Reference chunk

Examples

df.select(fc.markdown.extract_header_chunks(fc.col("articles"), header_level=1))

Core

Functions

I/O

Types

Configuration

MCP

Markdown Functions

to_json

Supported Markdown Features

Examples

get_code_blocks

Examples

generate_toc

Examples

extract_header_chunks

Features

Chunking Behavior

Examples

Build docs developers (and LLMs) love

Core

Functions

I/O

Types

Configuration

MCP

​to_json

​Supported Markdown Features

​Examples

​get_code_blocks

​Examples

​generate_toc

​Examples

​extract_header_chunks

​Features

​Chunking Behavior

​Examples

Build docs developers (and LLMs) love

to_json

Supported Markdown Features

Examples

get_code_blocks

Examples

generate_toc

Examples

extract_header_chunks

Features

Chunking Behavior

Examples