AI-Powered Insights - OWASP Nest

Overview

OWASP Nest uses advanced AI to provide intelligent search, contextual recommendations, and automated insights across projects, chapters, events, and community resources.

Architecture

The AI system is built on three core components:

Agentic RAG

Self-correcting retrieval with iterative refinement

Vector Embeddings

Semantic search with pgvector and OpenAI embeddings

Content Extraction

Structured extraction from OWASP entities

AI Models

Chunk Model

Text chunks with vector embeddings for semantic search:

# backend/apps/ai/models/chunk.py:12
class Chunk(TimestampedModel):
    """AI Chunk model for storing text chunks with embeddings."""
    
    context = models.ForeignKey(Context, on_delete=models.CASCADE, related_name="chunks")
    embedding = VectorField(verbose_name="Embedding", dimensions=1536)
    text = models.TextField(verbose_name="Text")

Features:

1536-dimensional embeddings (OpenAI text-embedding-3-small)
PostgreSQL pgvector for similarity search
Linked to parent context entities
Unique constraint on (context, text)

Context Model

Generated context for OWASP entities:

# backend/apps/ai/models/context.py:15
class Context(TimestampedModel):
    """Context model for storing generated text related to OWASP entities."""
    
    content = models.TextField(verbose_name="Generated Text")
    entity_type = models.ForeignKey(ContentType, on_delete=models.CASCADE)
    entity_id = models.PositiveIntegerField()
    entity = GenericForeignKey("entity_type", "entity_id")
    source = models.CharField(max_length=100, blank=True, default="")

Features:

Generic foreign key to any entity type
Multiple contexts per entity (different sources)
Automatic chunking into searchable pieces
Unique constraint on (entity_type, entity_id, source)

Content Extraction

Each entity type has a specialized extractor:

Project Extractor

# backend/apps/ai/common/extractors/project.py:6
def extract_project_content(project) -> tuple[str, str]:
    """Extract structured content from project data.
    
    Returns:
        tuple[str, str]: (prose_content, metadata_content)
    """

Extracted Fields: Prose Content:

Project description
AI-generated summary
Repository description
Repository topics

Metadata Content:

Project name and key
Project level and type
Programming languages
Topics, tags, custom tags
Licenses
Statistics (stars, forks, contributors, releases, issues)
Project leaders
Related URLs
Timestamps (created, updated, released)
Health score
Active status

Chapter Extractor

# backend/apps/ai/common/extractors/chapter.py

Extracts:

Chapter name and location
Country, region, postal code
Meetup group information
Leader information
Geographic coordinates
Activity statistics

Event Extractor

# backend/apps/ai/common/extractors/event.py

Extracts:

Event name and description
Event category
Start and end dates
Location information
Registration URL
AI-generated summary

Committee Extractor

# backend/apps/ai/common/extractors/committee.py

Extracts:

Committee name and purpose
Leadership information
Activity and deliverables

Repository Extractor

# backend/apps/ai/common/extractors/repository.py

Extracts:

Repository metadata
Issue and PR information
Contributor statistics
Topics and languages

Text Chunking

Content is split into searchable chunks:

# backend/apps/ai/models/chunk.py:37
@staticmethod
def split_text(text: str) -> list[str]:
    """Split text into chunks."""
    return RecursiveCharacterTextSplitter(
        chunk_size=200,
        chunk_overlap=20,
        length_function=len,
        separators=["\n\n", "\n", " ", ""],
    ).split_text(text)

Configuration:

Chunk size: 200 characters
Chunk overlap: 20 characters
Separators: paragraph > line > space > character

Chunk overlap ensures important information isn’t split across boundaries.

Agentic RAG

The Agentic RAG system provides self-correcting question answering:

# backend/apps/ai/agent/agent.py:19
class AgenticRAGAgent:
    """LangGraph-based controller for agentic RAG with self-correcting retrieval."""

Workflow

Retrieve

Search vector database for relevant chunks

Generate

Generate answer using retrieved context

Evaluate

Assess answer quality and relevance

Refine or Complete

Either refine with feedback or return final answer

Agent Nodes

# backend/apps/ai/agent/nodes.py
class AgentNodes:
    def retrieve(self, state: dict) -> dict:
        """Retrieve relevant chunks from vector database."""
    
    def generate(self, state: dict) -> dict:
        """Generate answer using retrieved context."""
    
    def evaluate(self, state: dict) -> dict:
        """Evaluate answer quality."""
    
    def route_from_evaluation(self, state: dict) -> str:
        """Route to refine or complete based on evaluation."""

Retrieval Configuration

# backend/apps/ai/common/constants.py
DEFAULT_CHUNKS_RETRIEVAL_LIMIT = 10
DEFAULT_SIMILARITY_THRESHOLD = 0.7

Parameters:

limit - Number of chunks to retrieve (default: 10)
similarity_threshold - Minimum cosine similarity (default: 0.7)
content_types - Filter by entity types

State Management

Agent state tracks:

initial_state = {
    "query": query,
    "iteration": 0,
    "feedback": None,
    "history": [],
    "content_types": [],
    "limit": DEFAULT_CHUNKS_RETRIEVAL_LIMIT,
    "similarity_threshold": DEFAULT_SIMILARITY_THRESHOLD,
}

Response Format

result = agent.run(query="What is OWASP ZAP?")
# {
#     "answer": "...",
#     "iterations": 2,
#     "evaluation": {...},
#     "context_chunks": [...],
#     "history": [...],
#     "extracted_metadata": {...}
# }

Question Detection

The AI system includes question classification:

# backend/apps/slack/common/handlers/ai.py:42
question_detector = QuestionDetector()
if not question_detector.is_owasp_question(text=query):
    return get_default_response()

Filters:

Non-question statements
Off-topic queries
Spam or abuse

Vector Search

Similarity search using pgvector:

# Cosine similarity search
from pgvector.django import CosineDistance

Chunk.objects.annotate(
    distance=CosineDistance('embedding', query_embedding)
).filter(
    distance__lt=1.0 - similarity_threshold
).order_by('distance')[:limit]

Management Commands

AI system maintenance commands:

Update Chunks

# Update chunks for all projects
python manage.py ai_update_project_chunks

# Update chunks for all chapters
python manage.py ai_update_chapter_chunks

# Update chunks for all events
python manage.py ai_update_event_chunks

# Update chunks for all committees
python manage.py ai_update_committee_chunks

# Update chunks for all repositories
python manage.py ai_update_repository_chunks

# Update chunks for Slack messages
python manage.py ai_update_slack_message_chunks

Update Context

# Generate context for all projects
python manage.py ai_update_project_context

# Generate context for all chapters
python manage.py ai_update_chapter_context

# Generate context for all events
python manage.py ai_update_event_context

Run Agent

# Test the agentic RAG system
python manage.py ai_run_agentic_rag "What are OWASP flagship projects?"

Embeddings Pipeline

Extract Content

Use entity-specific extractors to generate text

Create Context

Store extracted content in Context model

Chunk Text

Split content into 200-char chunks with overlap

Generate Embeddings

Create 1536-dim vectors using OpenAI API

Store Chunks

Save chunks with embeddings to database

RAG Tools

The agent uses specialized tools:

backend/apps/ai/agent/tools/rag/

Retriever

# backend/apps/ai/agent/tools/rag/retriever.py

Retrieves relevant chunks using:

Vector similarity search
Metadata filtering
Content type filtering
Similarity threshold

Generator

# backend/apps/ai/agent/tools/rag/generator.py

Generates answers using:

Retrieved context chunks
Query understanding
Answer synthesis
Citation of sources

AI Configuration

Environment Variables

OPENAI_API_KEY=sk-...              # OpenAI API key
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
OPENAI_CHAT_MODEL=gpt-4-turbo
AI_CHUNKS_RETRIEVAL_LIMIT=10
AI_SIMILARITY_THRESHOLD=0.7

Database Settings

# PostgreSQL with pgvector extension
INSTALLED_APPS = [
    ...,
    'pgvector',
    'apps.ai',
]

# Create extension
CREATE EXTENSION IF NOT EXISTS vector;

Performance Considerations

Indexing

Vector columns use HNSW (Hierarchical Navigable Small World) indexes for fast approximate nearest neighbor search.

Caching

Frequently accessed embeddings are cached in Redis to reduce API calls.

Batch Processing

Chunk updates process entities in batches to avoid rate limits.

Usage Examples

Programmatic Access

from apps.ai.agent.agent import AgenticRAGAgent

# Initialize agent
agent = AgenticRAGAgent()

# Run query
result = agent.run(query="What is the OWASP Top 10?")

print(result["answer"])
print(f"Iterations: {result['iterations']}")
print(f"Context chunks: {len(result['context_chunks'])}")

Slack Integration

/ai What is OWASP ZAP used for?
/ai How do I contribute to OWASP projects?
/ai What are the flagship projects?

Direct API

curl -X POST https://nest.owasp.org/api/ai/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is OWASP Juice Shop?"}'

Code Reference

Key implementation files:

Agent: backend/apps/ai/agent/agent.py:19
Nodes: backend/apps/ai/agent/nodes.py
Chunk Model: backend/apps/ai/models/chunk.py:12
Context Model: backend/apps/ai/models/context.py:15
Extractors: backend/apps/ai/common/extractors/
RAG Tools: backend/apps/ai/agent/tools/rag/

Monitoring

Enable debug logging to track agent iterations and retrieval performance:

import logging
logging.getLogger('apps.ai').setLevel(logging.DEBUG)

Future Enhancements

Coming Soon

Multi-modal embeddings (images, code)
Cross-project recommendations
Trend analysis and insights
Automated issue triage
Project health predictions

Slack Bot - Access AI via /ai command
Search - Combines keyword and semantic search
Projects - Projects indexed for AI

Get Started

Platform Features

User Guides

Development

Deployment

​Overview

​Architecture

Agentic RAG

Vector Embeddings

Content Extraction

​AI Models

​Chunk Model

​Context Model

​Content Extraction

​Project Extractor

​Chapter Extractor

​Event Extractor

​Committee Extractor

​Repository Extractor

​Text Chunking

​Agentic RAG

​Workflow

​Agent Nodes

​Retrieval Configuration

​State Management

​Response Format

​Question Detection

​Vector Search

​Management Commands

​Update Chunks

​Update Context

​Run Agent

​Embeddings Pipeline

​RAG Tools

​Retriever

​Generator

​AI Configuration

​Environment Variables

​Database Settings

​Performance Considerations

Indexing

Caching

Batch Processing

​Usage Examples

​Programmatic Access

​Slack Integration

​Direct API

​Code Reference

​Monitoring

​Future Enhancements

Coming Soon

​Related Features

Build docs developers (and LLMs) love

Overview

Architecture

AI Models

Chunk Model

Context Model

Content Extraction

Project Extractor

Chapter Extractor

Event Extractor

Committee Extractor

Repository Extractor

Text Chunking

Agentic RAG

Workflow

Agent Nodes

Retrieval Configuration

State Management

Response Format

Question Detection

Vector Search

Management Commands

Update Chunks

Update Context

Run Agent

Embeddings Pipeline

RAG Tools

Retriever

Generator

AI Configuration

Environment Variables

Database Settings

Performance Considerations

Usage Examples

Programmatic Access

Slack Integration

Direct API

Code Reference

Monitoring

Future Enhancements

Related Features