Skip to main content
The parseContent() function transforms the raw text of a WhatsApp chat export into an array of structured news item objects. Each object carries the fields needed for categorization, curation, and dispatch.

Message splitting

The parser uses a lookahead regex to split the raw log at each message boundary without consuming the delimiter:
const messages = content.split(
  /\n(?=\[\d{1,2}\/\d{1,2}\/\d{2,4}, \d{1,2}:\d{2}:\d{2}.*?\])/
);
The pattern matches a newline that is immediately followed by a WhatsApp timestamp in the format [DD/MM/YY, HH:MM:SS]. This preserves the timestamp as the first character of each message segment rather than discarding it.
WhatsApp uses different timestamp formats across locales and versions. The regex accommodates one- or two-digit day/month values and two- or four-digit year values.

ID generation

Each parsed message receives a unique string ID derived from its date and time components:
const id_ = splitDate[0] + splitDate[1] + splitDate[2]
           + splitTime[0] + splitTime[1] + splitTime[2];
The timestamp is first split into a date array ([day, month, year]) and a time array ([hours, minutes, seconds]). Concatenating all six parts produces an ID like 250325143022, which is stable across re-uploads of the same log as long as the original message timestamp does not change.

Extracted fields

For each message segment the parser builds an object with the following fields:
FieldTypeDescription
id_stringUnique identifier derived from the message timestamp
sendDatestringDate the message was sent (from the WhatsApp timestamp)
sendTimestringTime the message was sent
mediastringMedia outlet name extracted from the message body
programstringProgram or section name extracted from the message body
textstringFull message body text
resumestringRaw summary text (operator-provided or extracted)
iaResumestringEditable summary used for dispatch (initially mirrors resume)
linkstringFirst URL found in the message body
startTimestringBroadcast start time, if present in the message body
endTimestringBroadcast end time, if present in the message body

Media and program extraction

The extractMediaAndProgram() helper attempts to determine the outlet and program name from the message body using three strategies, tried in order:
1

Dot-separated format

The most common format used by the monitoring team is OUTLET. Program name. The helper splits on the first period and trims whitespace to separate the two parts.
LVI. El Show de la Mañana → media: "LVI", program: "El Show de la Mañana"
2

URL-based fallback

If no period is found, the helper checks for a URL in the message body. If a link is present, the domain is used as the media name and the program field is left blank.
3

Acronym fallback

If neither a period nor a URL is found, the helper looks for an all-caps token at the start of the message and treats it as the media acronym. The remaining text becomes the program field.

Time range extraction

Broadcast time ranges (e.g., 08.00-09.00 or 14:30-15:00) are extracted using:
const timeRangeRegex = /(\d{2}[.:]\d{2})-(\d{2}[.:]\d{2})/;
The regex captures two groups separated by a hyphen. Both colon (:) and dot (.) separators are accepted because operators use both conventions. Capture group 1 becomes startTime and capture group 2 becomes endTime. The first URL in the message body is extracted using:
const linkRegex = /(https?:\/\/[^\s]+)/;
The result is stored in the link field. If no URL is present the field is set to an empty string.

IA.TXT transcript splitting

Some messages contain an AI-generated transcript appended after an IA.TXT: marker. The parser splits these messages at the marker:
const transcriptionRegex = /IA\.TXT:\s*/;
The text before the marker is treated as the operator summary (resume). The text after it is stored separately and used to populate iaResume for curation.

Automatic filtering

The parser discards messages that do not represent press clippings. The following are silently filtered out:
Messages generated by WhatsApp itself are excluded. This includes:
  • The end-to-end encryption notice (“Messages and calls are end-to-end encrypted…”)
  • Group creation notices (“[Operator] created group [Group name]”)
  • Welcome messages added when someone joins the group
These are identified by matching against known system message patterns before any field extraction is attempted.
Messages whose body consists solely of a media attachment placeholder (e.g., image omitted) with no accompanying text are filtered out. These entries carry no useful press clipping data and would produce empty news items.
Filtering is applied before ID generation. Filtered messages do not appear in the ingestion view and cannot be recovered without re-uploading the original file.

Ingestion

How the source file is uploaded and validated before parsing.

Categorization

How parsed items are matched to government areas.

Build docs developers (and LLMs) love