Overview
The/siaa/keywords/<nombre_doc> endpoint returns the TF-IDF (Term Frequency-Inverse Document Frequency) keywords that have been automatically extracted from a specified document. This is useful for debugging routing issues and understanding which terms the system considers most relevant for each document.
Endpoint
Parameters
Document filename (case-insensitive). Spaces can be encoded as
+ or %20.Example: acuerdo_no._psaa16-10476.mdResponse
The normalized document name
List of top TF-IDF keywords for this document, ordered by relevance score. The system extracts up to 20 keywords per document.
Error Response
If the document is not found:Error message indicating the document was not found
404
Example
Request
Response
Use Cases
Debugging Document Routing
When a user query isn’t finding the expected document, check the keywords to understand what terms the routing algorithm associates with that document:KEYWORDS_MANUALES configuration.
Understanding Document Content
Quickly understand the main topics covered in a document without reading the full text:Validating TF-IDF Extraction
After updating documents or changing the tokenization rules, verify that important terms (including alphanumeric codes like “psaa16” or “10476”) are being extracted correctly.Notes
- Keywords are automatically generated during document loading using TF-IDF scoring
- The system filters out stopwords and short tokens (less than 3 characters)
- Alphanumeric terms with letters are always included (e.g., “psaa16”, “art5”)
- Pure numeric terms are included only if they have 4+ digits (e.g., “10476”, “2016”)
- Keywords are cached until documents are reloaded