Skip to main content

Overview

OWASP Nest uses advanced AI to provide intelligent search, contextual recommendations, and automated insights across projects, chapters, events, and community resources.

Architecture

The AI system is built on three core components:

Agentic RAG

Self-correcting retrieval with iterative refinement

Vector Embeddings

Semantic search with pgvector and OpenAI embeddings

Content Extraction

Structured extraction from OWASP entities

AI Models

Chunk Model

Text chunks with vector embeddings for semantic search:
# backend/apps/ai/models/chunk.py:12
class Chunk(TimestampedModel):
    """AI Chunk model for storing text chunks with embeddings."""
    
    context = models.ForeignKey(Context, on_delete=models.CASCADE, related_name="chunks")
    embedding = VectorField(verbose_name="Embedding", dimensions=1536)
    text = models.TextField(verbose_name="Text")
Features:
  • 1536-dimensional embeddings (OpenAI text-embedding-3-small)
  • PostgreSQL pgvector for similarity search
  • Linked to parent context entities
  • Unique constraint on (context, text)

Context Model

Generated context for OWASP entities:
# backend/apps/ai/models/context.py:15
class Context(TimestampedModel):
    """Context model for storing generated text related to OWASP entities."""
    
    content = models.TextField(verbose_name="Generated Text")
    entity_type = models.ForeignKey(ContentType, on_delete=models.CASCADE)
    entity_id = models.PositiveIntegerField()
    entity = GenericForeignKey("entity_type", "entity_id")
    source = models.CharField(max_length=100, blank=True, default="")
Features:
  • Generic foreign key to any entity type
  • Multiple contexts per entity (different sources)
  • Automatic chunking into searchable pieces
  • Unique constraint on (entity_type, entity_id, source)

Content Extraction

Each entity type has a specialized extractor:

Project Extractor

# backend/apps/ai/common/extractors/project.py:6
def extract_project_content(project) -> tuple[str, str]:
    """Extract structured content from project data.
    
    Returns:
        tuple[str, str]: (prose_content, metadata_content)
    """
Extracted Fields: Prose Content:
  • Project description
  • AI-generated summary
  • Repository description
  • Repository topics
Metadata Content:
  • Project name and key
  • Project level and type
  • Programming languages
  • Topics, tags, custom tags
  • Licenses
  • Statistics (stars, forks, contributors, releases, issues)
  • Project leaders
  • Related URLs
  • Timestamps (created, updated, released)
  • Health score
  • Active status

Chapter Extractor

# backend/apps/ai/common/extractors/chapter.py
Extracts:
  • Chapter name and location
  • Country, region, postal code
  • Meetup group information
  • Leader information
  • Geographic coordinates
  • Activity statistics

Event Extractor

# backend/apps/ai/common/extractors/event.py
Extracts:
  • Event name and description
  • Event category
  • Start and end dates
  • Location information
  • Registration URL
  • AI-generated summary

Committee Extractor

# backend/apps/ai/common/extractors/committee.py
Extracts:
  • Committee name and purpose
  • Leadership information
  • Activity and deliverables

Repository Extractor

# backend/apps/ai/common/extractors/repository.py
Extracts:
  • Repository metadata
  • Issue and PR information
  • Contributor statistics
  • Topics and languages

Text Chunking

Content is split into searchable chunks:
# backend/apps/ai/models/chunk.py:37
@staticmethod
def split_text(text: str) -> list[str]:
    """Split text into chunks."""
    return RecursiveCharacterTextSplitter(
        chunk_size=200,
        chunk_overlap=20,
        length_function=len,
        separators=["\n\n", "\n", " ", ""],
    ).split_text(text)
Configuration:
  • Chunk size: 200 characters
  • Chunk overlap: 20 characters
  • Separators: paragraph > line > space > character
Chunk overlap ensures important information isn’t split across boundaries.

Agentic RAG

The Agentic RAG system provides self-correcting question answering:
# backend/apps/ai/agent/agent.py:19
class AgenticRAGAgent:
    """LangGraph-based controller for agentic RAG with self-correcting retrieval."""

Workflow

1

Retrieve

Search vector database for relevant chunks
2

Generate

Generate answer using retrieved context
3

Evaluate

Assess answer quality and relevance
4

Refine or Complete

Either refine with feedback or return final answer

Agent Nodes

# backend/apps/ai/agent/nodes.py
class AgentNodes:
    def retrieve(self, state: dict) -> dict:
        """Retrieve relevant chunks from vector database."""
    
    def generate(self, state: dict) -> dict:
        """Generate answer using retrieved context."""
    
    def evaluate(self, state: dict) -> dict:
        """Evaluate answer quality."""
    
    def route_from_evaluation(self, state: dict) -> str:
        """Route to refine or complete based on evaluation."""

Retrieval Configuration

# backend/apps/ai/common/constants.py
DEFAULT_CHUNKS_RETRIEVAL_LIMIT = 10
DEFAULT_SIMILARITY_THRESHOLD = 0.7
Parameters:
  • limit - Number of chunks to retrieve (default: 10)
  • similarity_threshold - Minimum cosine similarity (default: 0.7)
  • content_types - Filter by entity types

State Management

Agent state tracks:
initial_state = {
    "query": query,
    "iteration": 0,
    "feedback": None,
    "history": [],
    "content_types": [],
    "limit": DEFAULT_CHUNKS_RETRIEVAL_LIMIT,
    "similarity_threshold": DEFAULT_SIMILARITY_THRESHOLD,
}

Response Format

result = agent.run(query="What is OWASP ZAP?")
# {
#     "answer": "...",
#     "iterations": 2,
#     "evaluation": {...},
#     "context_chunks": [...],
#     "history": [...],
#     "extracted_metadata": {...}
# }

Question Detection

The AI system includes question classification:
# backend/apps/slack/common/handlers/ai.py:42
question_detector = QuestionDetector()
if not question_detector.is_owasp_question(text=query):
    return get_default_response()
Filters:
  • Non-question statements
  • Off-topic queries
  • Spam or abuse
Similarity search using pgvector:
# Cosine similarity search
from pgvector.django import CosineDistance

Chunk.objects.annotate(
    distance=CosineDistance('embedding', query_embedding)
).filter(
    distance__lt=1.0 - similarity_threshold
).order_by('distance')[:limit]

Management Commands

AI system maintenance commands:

Update Chunks

# Update chunks for all projects
python manage.py ai_update_project_chunks

# Update chunks for all chapters
python manage.py ai_update_chapter_chunks

# Update chunks for all events
python manage.py ai_update_event_chunks

# Update chunks for all committees
python manage.py ai_update_committee_chunks

# Update chunks for all repositories
python manage.py ai_update_repository_chunks

# Update chunks for Slack messages
python manage.py ai_update_slack_message_chunks

Update Context

# Generate context for all projects
python manage.py ai_update_project_context

# Generate context for all chapters
python manage.py ai_update_chapter_context

# Generate context for all events
python manage.py ai_update_event_context

Run Agent

# Test the agentic RAG system
python manage.py ai_run_agentic_rag "What are OWASP flagship projects?"

Embeddings Pipeline

1

Extract Content

Use entity-specific extractors to generate text
2

Create Context

Store extracted content in Context model
3

Chunk Text

Split content into 200-char chunks with overlap
4

Generate Embeddings

Create 1536-dim vectors using OpenAI API
5

Store Chunks

Save chunks with embeddings to database

RAG Tools

The agent uses specialized tools:
backend/apps/ai/agent/tools/rag/

Retriever

# backend/apps/ai/agent/tools/rag/retriever.py
Retrieves relevant chunks using:
  • Vector similarity search
  • Metadata filtering
  • Content type filtering
  • Similarity threshold

Generator

# backend/apps/ai/agent/tools/rag/generator.py
Generates answers using:
  • Retrieved context chunks
  • Query understanding
  • Answer synthesis
  • Citation of sources

AI Configuration

Environment Variables

OPENAI_API_KEY=sk-...              # OpenAI API key
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
OPENAI_CHAT_MODEL=gpt-4-turbo
AI_CHUNKS_RETRIEVAL_LIMIT=10
AI_SIMILARITY_THRESHOLD=0.7

Database Settings

# PostgreSQL with pgvector extension
INSTALLED_APPS = [
    ...,
    'pgvector',
    'apps.ai',
]

# Create extension
CREATE EXTENSION IF NOT EXISTS vector;

Performance Considerations

Indexing

Vector columns use HNSW (Hierarchical Navigable Small World) indexes for fast approximate nearest neighbor search.

Caching

Frequently accessed embeddings are cached in Redis to reduce API calls.

Batch Processing

Chunk updates process entities in batches to avoid rate limits.

Usage Examples

Programmatic Access

from apps.ai.agent.agent import AgenticRAGAgent

# Initialize agent
agent = AgenticRAGAgent()

# Run query
result = agent.run(query="What is the OWASP Top 10?")

print(result["answer"])
print(f"Iterations: {result['iterations']}")
print(f"Context chunks: {len(result['context_chunks'])}")

Slack Integration

/ai What is OWASP ZAP used for?
/ai How do I contribute to OWASP projects?
/ai What are the flagship projects?

Direct API

curl -X POST https://nest.owasp.org/api/ai/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is OWASP Juice Shop?"}'

Code Reference

Key implementation files:
  • Agent: backend/apps/ai/agent/agent.py:19
  • Nodes: backend/apps/ai/agent/nodes.py
  • Chunk Model: backend/apps/ai/models/chunk.py:12
  • Context Model: backend/apps/ai/models/context.py:15
  • Extractors: backend/apps/ai/common/extractors/
  • RAG Tools: backend/apps/ai/agent/tools/rag/

Monitoring

Enable debug logging to track agent iterations and retrieval performance:
import logging
logging.getLogger('apps.ai').setLevel(logging.DEBUG)

Future Enhancements

Coming Soon

  • Multi-modal embeddings (images, code)
  • Cross-project recommendations
  • Trend analysis and insights
  • Automated issue triage
  • Project health predictions
  • Slack Bot - Access AI via /ai command
  • Search - Combines keyword and semantic search
  • Projects - Projects indexed for AI

Build docs developers (and LLMs) love