fc.text.*.
Chunking Functions
recursive_character_chunk
Chunks a string column into chunks of a specified size (in characters) with an optional overlap, preserving text structure.The input string column or column name to chunk.
The size of each chunk in characters.
The overlap between each chunk as a percentage of the chunk size.
List of alternative characters to split on. Characters should be ordered from coarsest to finest desired granularity. Default is
['\n\n', '\n', '.', ';', ':', ' ', '-', ''].A column containing the chunks as an array of strings.
Example
recursive_word_chunk
Chunks a string column into chunks of a specified size (in words) with an optional overlap.The input string column or column name to chunk.
The size of each chunk in words.
The overlap between each chunk as a percentage of the chunk size.
List of alternative characters to split on.
A column containing the chunks as an array of strings.
Example
recursive_token_chunk
Chunks a string column into chunks of a specified size (in tokens) with an optional overlap.The input string column or column name to chunk.
The size of each chunk in tokens.
The overlap between each chunk as a percentage of the chunk size.
List of alternative characters to split on.
A column containing the chunks as an array of strings.
Example
character_chunk
Chunks a string column into chunks of a specified size (in characters) with an optional overlap using a simple sliding window.The input string column or column name to chunk.
The size of each chunk in characters.
The overlap between chunks as a percentage of the chunk size.
A column containing the chunks as an array of strings.
word_chunk
Chunks a string column into chunks of a specified size (in words) with an optional overlap using a simple sliding window.token_chunk
Chunks a string column into chunks of a specified size (in tokens) with an optional overlap using a simple sliding window.count_tokens
Returns the number of tokens in a string using OpenAI’s cl100k_base encoding (tiktoken).The input string column.
A column with the token counts for each input string.
Example
String Manipulation
extract
Extracts structured data from text using template-based pattern matching.Input text column to extract from.
Template string with placeholders as
${field_name} or ${field_name:format}. Available formats: none, csv, json, quoted.Struct column with fields corresponding to template placeholders.
Examples
concat
Concatenates multiple columns or strings into a single string.Columns or strings to concatenate.
A column containing the concatenated strings.
Example
concat_ws
Concatenates multiple columns or strings into a single string with a separator.The separator to use.
Columns or strings to concatenate.
A column containing the concatenated strings.
Example
replace
Replace all occurrences of a pattern with a new string, treating pattern as a literal string.The input string column or column name to perform replacements on.
The pattern to search for (can be a string or column expression).
The string to replace with (can be a string or column expression).
A column containing the strings with replacements applied.
Example
regexp_replace
Replace all occurrences of a pattern with a new string, treating pattern as a regular expression.The input string column or column name to perform replacements on.
The regular expression pattern to search for.
The string to replace with.
A column containing the strings with replacements applied.
Example
Regular Expressions
regexp_count
Count the number of times a regex pattern is matched in a string.regexp_extract
Extract a specific regex group from a string.The group index to extract (0 = entire match, 1+ = capture groups).
Example
regexp_extract_all
Extract all strings matching a regex pattern, optionally from a specific group.An array column containing all matches.
Example
split
Split a string column into an array using a regular expression pattern.The regular expression pattern to split on.
Maximum number of splits to perform. If > 0, returns at most limit+1 elements.
Example
split_part
Split a string and return a specific part using 1-based indexing.The delimiter to split on.
Which part to return (1-based integer index). Negative values count from the end.
Example
String Transformations
upper
Convert all characters in a string column to uppercase.lower
Convert all characters in a string column to lowercase.title_case
Convert the first character of each word in a string column to uppercase.trim
Remove whitespace from both sides of strings in a column.ltrim
Remove whitespace from the start of strings in a column.rtrim
Remove whitespace from the end of strings in a column.length
Calculate the character length of each string in the column.byte_length
Calculate the byte length of each string in the column.Template Rendering
jinja
Render a Jinja template using values from the specified columns.A Jinja2 template string to render for each row. Variables are referenced using double braces:
{{ variable_name }}.If True, when any of the provided columns has a None value for a row, the entire row’s output will be None.
Keyword arguments mapping variable names to columns.
A string column containing the rendered template for each row.
Example
LLM prompt formatting
Fuzzy Matching
compute_fuzzy_ratio
Compute the similarity between two strings using a fuzzy string matching algorithm.A string column or column name.
A second string column or literal string.
Similarity method:
"indel", "levenshtein", "damerau_levenshtein", "jaro", "jaro_winkler", or "hamming".A double column with similarity scores in the range [0, 100].
Example
compute_fuzzy_token_sort_ratio
Compute fuzzy similarity after sorting tokens in each string.Transcript Parsing
parse_transcript
Parses a transcript from text to a structured format with unified schema.The input string column or column name containing transcript text.
The format of the transcript:
"srt", "webvtt", or "generic".A column containing an array of structured transcript entries with fields: index, speaker, start_time, end_time, duration, content, format.
