parseContent() function transforms the raw text of a WhatsApp chat export into an array of structured news item objects. Each object carries the fields needed for categorization, curation, and dispatch.
Message splitting
The parser uses a lookahead regex to split the raw log at each message boundary without consuming the delimiter:[DD/MM/YY, HH:MM:SS]. This preserves the timestamp as the first character of each message segment rather than discarding it.
WhatsApp uses different timestamp formats across locales and versions. The regex accommodates one- or two-digit day/month values and two- or four-digit year values.
ID generation
Each parsed message receives a unique string ID derived from its date and time components:[day, month, year]) and a time array ([hours, minutes, seconds]). Concatenating all six parts produces an ID like 250325143022, which is stable across re-uploads of the same log as long as the original message timestamp does not change.
Extracted fields
For each message segment the parser builds an object with the following fields:| Field | Type | Description |
|---|---|---|
id_ | string | Unique identifier derived from the message timestamp |
sendDate | string | Date the message was sent (from the WhatsApp timestamp) |
sendTime | string | Time the message was sent |
media | string | Media outlet name extracted from the message body |
program | string | Program or section name extracted from the message body |
text | string | Full message body text |
resume | string | Raw summary text (operator-provided or extracted) |
iaResume | string | Editable summary used for dispatch (initially mirrors resume) |
link | string | First URL found in the message body |
startTime | string | Broadcast start time, if present in the message body |
endTime | string | Broadcast end time, if present in the message body |
Media and program extraction
TheextractMediaAndProgram() helper attempts to determine the outlet and program name from the message body using three strategies, tried in order:
Dot-separated format
The most common format used by the monitoring team is
OUTLET. Program name. The helper splits on the first period and trims whitespace to separate the two parts.URL-based fallback
If no period is found, the helper checks for a URL in the message body. If a link is present, the domain is used as the media name and the program field is left blank.
Time range extraction
Broadcast time ranges (e.g.,08.00-09.00 or 14:30-15:00) are extracted using:
:) and dot (.) separators are accepted because operators use both conventions. Capture group 1 becomes startTime and capture group 2 becomes endTime.
Link extraction
The first URL in the message body is extracted using:link field. If no URL is present the field is set to an empty string.
IA.TXT transcript splitting
Some messages contain an AI-generated transcript appended after anIA.TXT: marker. The parser splits these messages at the marker:
resume). The text after it is stored separately and used to populate iaResume for curation.
Automatic filtering
The parser discards messages that do not represent press clippings. The following are silently filtered out:WhatsApp system messages
WhatsApp system messages
Messages generated by WhatsApp itself are excluded. This includes:
- The end-to-end encryption notice (“Messages and calls are end-to-end encrypted…”)
- Group creation notices (“[Operator] created group [Group name]”)
- Welcome messages added when someone joins the group
Image-only messages
Image-only messages
Messages whose body consists solely of a media attachment placeholder (e.g.,
image omitted) with no accompanying text are filtered out. These entries carry no useful press clipping data and would produce empty news items.Ingestion
How the source file is uploaded and validated before parsing.
Categorization
How parsed items are matched to government areas.