Skip to main content
Text functions provide string manipulation, chunking, template rendering, and text extraction capabilities. All functions are available via fc.text.*.

Chunking Functions

recursive_character_chunk

Chunks a string column into chunks of a specified size (in characters) with an optional overlap, preserving text structure.
fc.text.recursive_character_chunk(
    column: ColumnOrName,
    chunk_size: int,
    chunk_overlap_percentage: int,
    chunking_character_set_custom_characters: Optional[list[str]] = None,
) -> Column
column
ColumnOrName
required
The input string column or column name to chunk.
chunk_size
int
required
The size of each chunk in characters.
chunk_overlap_percentage
int
required
The overlap between each chunk as a percentage of the chunk size.
chunking_character_set_custom_characters
Optional[list[str]]
List of alternative characters to split on. Characters should be ordered from coarsest to finest desired granularity. Default is ['\n\n', '\n', '.', ';', ':', ' ', '-', ''].
return
Column
A column containing the chunks as an array of strings.

Example

df.select(
    fc.text.recursive_character_chunk(fc.col("text"), 100, 20).alias("chunks")
)

recursive_word_chunk

Chunks a string column into chunks of a specified size (in words) with an optional overlap.
fc.text.recursive_word_chunk(
    column: ColumnOrName,
    chunk_size: int,
    chunk_overlap_percentage: int,
    chunking_character_set_custom_characters: Optional[list[str]] = None,
) -> Column
column
ColumnOrName
required
The input string column or column name to chunk.
chunk_size
int
required
The size of each chunk in words.
chunk_overlap_percentage
int
required
The overlap between each chunk as a percentage of the chunk size.
chunking_character_set_custom_characters
Optional[list[str]]
List of alternative characters to split on.
return
Column
A column containing the chunks as an array of strings.

Example

df.select(
    fc.text.recursive_word_chunk(fc.col("text"), 100, 20).alias("chunks")
)

recursive_token_chunk

Chunks a string column into chunks of a specified size (in tokens) with an optional overlap.
fc.text.recursive_token_chunk(
    column: ColumnOrName,
    chunk_size: int,
    chunk_overlap_percentage: int,
    chunking_character_set_custom_characters: Optional[list[str]] = None,
) -> Column
column
ColumnOrName
required
The input string column or column name to chunk.
chunk_size
int
required
The size of each chunk in tokens.
chunk_overlap_percentage
int
required
The overlap between each chunk as a percentage of the chunk size.
chunking_character_set_custom_characters
Optional[list[str]]
List of alternative characters to split on.
return
Column
A column containing the chunks as an array of strings.

Example

df.select(
    fc.text.recursive_token_chunk(fc.col("text"), 100, 20).alias("chunks")
)

character_chunk

Chunks a string column into chunks of a specified size (in characters) with an optional overlap using a simple sliding window.
fc.text.character_chunk(
    column: ColumnOrName,
    chunk_size: int,
    chunk_overlap_percentage: int = 0
) -> Column
column
ColumnOrName
required
The input string column or column name to chunk.
chunk_size
int
required
The size of each chunk in characters.
chunk_overlap_percentage
int
default:"0"
The overlap between chunks as a percentage of the chunk size.
return
Column
A column containing the chunks as an array of strings.

word_chunk

Chunks a string column into chunks of a specified size (in words) with an optional overlap using a simple sliding window.
fc.text.word_chunk(
    column: ColumnOrName,
    chunk_size: int,
    chunk_overlap_percentage: int = 0
) -> Column

token_chunk

Chunks a string column into chunks of a specified size (in tokens) with an optional overlap using a simple sliding window.
fc.text.token_chunk(
    column: ColumnOrName,
    chunk_size: int,
    chunk_overlap_percentage: int = 0
) -> Column

count_tokens

Returns the number of tokens in a string using OpenAI’s cl100k_base encoding (tiktoken).
fc.text.count_tokens(column: ColumnOrName) -> Column
column
ColumnOrName
required
The input string column.
return
Column
A column with the token counts for each input string.

Example

df.select(fc.text.count_tokens(fc.col("text")))

String Manipulation

extract

Extracts structured data from text using template-based pattern matching.
fc.text.extract(column: ColumnOrName, template: str) -> Column
column
ColumnOrName
required
Input text column to extract from.
template
str
required
Template string with placeholders as ${field_name} or ${field_name:format}. Available formats: none, csv, json, quoted.
return
Column
Struct column with fields corresponding to template placeholders.

Examples

fc.text.extract(fc.col("log"), "${date} ${level} ${message}")
# Input: "2024-01-15 ERROR Connection failed"
# Output: {date: "2024-01-15", level: "ERROR", message: "Connection failed"}

concat

Concatenates multiple columns or strings into a single string.
fc.text.concat(*cols: ColumnOrName) -> Column
*cols
ColumnOrName
required
Columns or strings to concatenate.
return
Column
A column containing the concatenated strings.

Example

df.select(fc.text.concat(fc.col("col1"), fc.lit(" "), fc.col("col2")))

concat_ws

Concatenates multiple columns or strings into a single string with a separator.
fc.text.concat_ws(separator: str, *cols: ColumnOrName) -> Column
separator
str
required
The separator to use.
*cols
ColumnOrName
required
Columns or strings to concatenate.
return
Column
A column containing the concatenated strings.

Example

df.select(fc.text.concat_ws(",", fc.col("col1"), fc.col("col2")))

replace

Replace all occurrences of a pattern with a new string, treating pattern as a literal string.
fc.text.replace(
    src: ColumnOrName,
    search: Union[Column, str],
    replace: Union[Column, str]
) -> Column
src
ColumnOrName
required
The input string column or column name to perform replacements on.
The pattern to search for (can be a string or column expression).
replace
Union[Column, str]
required
The string to replace with (can be a string or column expression).
return
Column
A column containing the strings with replacements applied.

Example

df.select(fc.text.replace(fc.col("name"), "foo", "bar"))

regexp_replace

Replace all occurrences of a pattern with a new string, treating pattern as a regular expression.
fc.text.regexp_replace(
    src: ColumnOrName,
    pattern: Union[Column, str],
    replacement: Union[Column, str],
) -> Column
src
ColumnOrName
required
The input string column or column name to perform replacements on.
pattern
Union[Column, str]
required
The regular expression pattern to search for.
replacement
Union[Column, str]
required
The string to replace with.
return
Column
A column containing the strings with replacements applied.

Example

df.select(fc.text.regexp_replace(fc.col("text"), r"\d+", "--"))

Regular Expressions

regexp_count

Count the number of times a regex pattern is matched in a string.
fc.text.regexp_count(src: ColumnOrName, pattern: Union[Column, str]) -> Column

regexp_extract

Extract a specific regex group from a string.
fc.text.regexp_extract(
    src: ColumnOrName,
    pattern: Union[Column, str],
    idx: int
) -> Column
idx
int
required
The group index to extract (0 = entire match, 1+ = capture groups).

Example

fc.text.regexp_extract("email", r"([^@]+)@", 1)
# Input: "[email protected]"
# Output: "user"

regexp_extract_all

Extract all strings matching a regex pattern, optionally from a specific group.
fc.text.regexp_extract_all(
    src: ColumnOrName,
    pattern: Union[Column, str],
    idx: Union[Column, int] = 0
) -> Column
return
Column
An array column containing all matches.

Example

fc.text.regexp_extract_all("text", r"\d+")
# Input: "abc123def456"
# Output: ["123", "456"]

split

Split a string column into an array using a regular expression pattern.
fc.text.split(src: ColumnOrName, pattern: str, limit: int = -1) -> Column
pattern
str
required
The regular expression pattern to split on.
limit
int
default:"-1"
Maximum number of splits to perform. If > 0, returns at most limit+1 elements.

Example

df.select(fc.text.split(fc.col("text"), r"\s+"))

split_part

Split a string and return a specific part using 1-based indexing.
fc.text.split_part(
    src: ColumnOrName,
    delimiter: Union[Column, str],
    part_number: Union[Column, int]
) -> Column
delimiter
Union[Column, str]
required
The delimiter to split on.
part_number
Union[Column, int]
required
Which part to return (1-based integer index). Negative values count from the end.

Example

df.select(fc.text.split_part(fc.col("text"), ",", 2))

String Transformations

upper

Convert all characters in a string column to uppercase.
fc.text.upper(column: ColumnOrName) -> Column

lower

Convert all characters in a string column to lowercase.
fc.text.lower(column: ColumnOrName) -> Column

title_case

Convert the first character of each word in a string column to uppercase.
fc.text.title_case(column: ColumnOrName) -> Column

trim

Remove whitespace from both sides of strings in a column.
fc.text.trim(column: ColumnOrName) -> Column

ltrim

Remove whitespace from the start of strings in a column.
fc.text.ltrim(col: ColumnOrName) -> Column

rtrim

Remove whitespace from the end of strings in a column.
fc.text.rtrim(col: ColumnOrName) -> Column

length

Calculate the character length of each string in the column.
fc.text.length(column: ColumnOrName) -> Column

byte_length

Calculate the byte length of each string in the column.
fc.text.byte_length(column: ColumnOrName) -> Column

Template Rendering

jinja

Render a Jinja template using values from the specified columns.
fc.text.jinja(
    jinja_template: str,
    /,
    strict: bool = True,
    **columns: Column
) -> Column
jinja_template
str
required
A Jinja2 template string to render for each row. Variables are referenced using double braces: {{ variable_name }}.
strict
bool
default:"True"
If True, when any of the provided columns has a None value for a row, the entire row’s output will be None.
**columns
Column
required
Keyword arguments mapping variable names to columns.
return
Column
A string column containing the rendered template for each row.

Example

LLM prompt formatting
prompt_template = '''
Answer the user's question.

{% if context %}
Context: {{ context }}
{% endif %}

Question: {{ query }}

Please provide a {{ style }} response.'''

result = df.select(
    fc.text.jinja(
        prompt_template,
        query=fc.col("user_question"),
        context=fc.col("retrieved_context"),
        style=fc.when(fc.col("query_type") == "technical", "detailed")
              .otherwise("concise")
    ).alias("llm_prompt")
)

Fuzzy Matching

compute_fuzzy_ratio

Compute the similarity between two strings using a fuzzy string matching algorithm.
fc.text.compute_fuzzy_ratio(
    column: ColumnOrName,
    other: Union[Column, str],
    method: FuzzySimilarityMethod = "indel"
) -> Column
column
ColumnOrName
required
A string column or column name.
other
Union[Column, str]
required
A second string column or literal string.
method
FuzzySimilarityMethod
default:"indel"
Similarity method: "indel", "levenshtein", "damerau_levenshtein", "jaro", "jaro_winkler", or "hamming".
return
Column
A double column with similarity scores in the range [0, 100].

Example

result = df.select(
    fc.text.compute_fuzzy_ratio(fc.col("a"), fc.col("b"), method="levenshtein").alias("sim")
)

compute_fuzzy_token_sort_ratio

Compute fuzzy similarity after sorting tokens in each string.
fc.text.compute_fuzzy_token_sort_ratio(
    column: ColumnOrName,
    other: Union[Column, str],
    method: FuzzySimilarityMethod = "indel"
) -> Column

Transcript Parsing

parse_transcript

Parses a transcript from text to a structured format with unified schema.
fc.text.parse_transcript(
    column: ColumnOrName,
    format: TranscriptFormatType
) -> Column
column
ColumnOrName
required
The input string column or column name containing transcript text.
format
TranscriptFormatType
required
The format of the transcript: "srt", "webvtt", or "generic".
return
Column
A column containing an array of structured transcript entries with fields: index, speaker, start_time, end_time, duration, content, format.

Example

df.select(fc.text.parse_transcript(fc.col("transcript"), "srt"))

Build docs developers (and LLMs) love