Skip to main content

Overview

ForkBB includes a powerful full-text search system that indexes posts and topics for fast retrieval. The search engine supports multiple languages, CJK characters, wildcards, and sophisticated word processing.

Search Architecture

The search system is implemented in app/Models/Search/Search.php:15 with separate action handlers for different search types:

Forums

Search within specific forums (ActionF)

Posts

Full-text search of post content (ActionP)

Topics

Search topic titles (ActionT)

Text Processing

Character Normalization

The search system normalizes text before indexing:
app/Models/Search/Search.php
const QUOTES = ['ʹ', 'ʻ', 'ʼ', 'ʽ', 'ʾ', 'ʿ', '΄', '᾿', 'Ꞌ', 'ꞌ', ''', ''', '‛', '′', '´', '`', '`', ''', '`', '''];

public function cleanText(string $text, bool $indexing = false): string
{
    // Extract hashtags separately
    $tags = [];
    $text = \preg_replace_callback(
        '%(?<=^|\s|\n|\r)#(?=[\p{L}\p{N}_]{3})[\p{L}\p{N}]+(?:_+[\p{L}\p{N}]+)*(?=$|\s|\n|\r|\.|,)%u',
        function ($matches) use (&$tags)  {
            $tags[] = $matches[0];
            return ' ';
        },
        $text
    );
    
    // Normalize quotes
    $text = \str_replace(self::QUOTES, '\'', $text);
    
    // Russian normalization
    $text = \str_replace('ё', 'е', $text);
    
    // Separate CJK characters with spaces
    $text = \preg_replace('%' . self::CJK_REGEX . '%u', ' \0 ', $text);
    
    // Reduce repeated characters (4+ to 1)
    $text = \preg_replace('%(\p{L})\1{3,}%u', '\1', $text);
    
    // Remove quotes and hyphens outside words
    $text = \preg_replace('%((?<![\p{L}\p{N}])[\'\'\-]|[\'\'\-](?![\p{L}\p{N}]))%u', ' ', $text);
    
    if (false !== \strpos($text, '-')) {
        // Remove words ending with -либо, -нибудь, -нить
        $text = \preg_replace('%\b[\p{L}\p{N}\-\']+\-(?:либо|нибу[дт]ь|нить)(?![\p{L}\p{N}\'\-])%u', '', $text);
        
        // Remove trailing suffixes like -таки, -чуть
        $text = \preg_replace('%(?<=[\p{L}\p{N}])(\-(?:таки|чуть|[а-я]{1,2}))+(?![\p{L}\p{N}\'\-])%u', '', $text);
    }
    
    // Remove non-alphanumeric characters (keep wildcards if not indexing)
    $text = \preg_replace('%(?![\'\'\-'.($indexing ? '' : '\?\*').'])[^\p{L}\p{N}]+%u', ' ', $text);
    
    // Compress multiple spaces
    $text = \preg_replace('% {2,}%', ' ', $text);
    
    return \trim($text . ' '. \implode(' ', $tags));
}
Text normalization ensures consistent indexing and searching regardless of input variations.

CJK Support

The search system has extensive support for Chinese, Japanese, and Korean characters:
app/Models/Search/Search.php
const CJK_REGEX = '['.  
    '\x{1100}-\x{11FF}'.   // Hangul Jamo
    '\x{3130}-\x{318F}'.   // Hangul Compatibility Jamo
    '\x{AC00}-\x{D7AF}'.   // Hangul Syllables
    
    // Hiragana
    '\x{3040}-\x{309F}'.   // Hiragana
    
    // Katakana
    '\x{30A0}-\x{30FF}'.   // Katakana
    '\x{31F0}-\x{31FF}'.   // Katakana Phonetic Extensions
    
    // CJK Unified Ideographs
    '\x{2E80}-\x{2EFF}'.   // CJK Radicals Supplement
    '\x{2F00}-\x{2FDF}'.   // Kangxi Radicals
    '\x{2FF0}-\x{2FFF}'.   // Ideographic Description Characters
    '\x{3000}-\x{303F}'.   // CJK Symbols and Punctuation
    '\x{31C0}-\x{31EF}'.   // CJK Strokes
    '\x{3200}-\x{32FF}'.   // Enclosed CJK Letters and Months
    '\x{3400}-\x{4DBF}'.   // CJK Unified Ideographs Extension A
    '\x{4E00}-\x{9FFF}'.   // CJK Unified Ideographs
    '\x{20000}-\x{2A6DF}'. // CJK Unified Ideographs Extension B
    '\x{2A700}-\x{2B73F}'. // CJK Unified Ideographs Extension C
    '\x{2B740}-\x{2B81F}'. // CJK Unified Ideographs Extension D
    '\x{2B820}-\x{2CEAF}'. // CJK Unified Ideographs Extension E
    '\x{2CEB0}-\x{2EBEF}'. // CJK Unified Ideographs Extension F
    '\x{2F800}-\x{2FA1F}'. // CJK Compatibility Ideographs Supplement
    '\x{30000}-\x{3134F}'. // CJK Unified Ideographs Extension G
    '\x{31350}-\x{323AF}'. // CJK Unified Ideographs Extension H
    ']';

public function isCJKWord(string $word): bool
{
    return \preg_match('%^' . self::CJK_REGEX . '+$%u', $word) ? true : false;
}

CJK Word Handling

CJK characters are treated differently from alphabetic languages:
  • Each character is separated during text cleaning
  • CJK words bypass length restrictions
  • Individual characters can be searched

Word Processing

Word Validation

app/Models/Search/Search.php
public function word(string $word, bool $indexing = false): ?string
{
    // Check stopwords
    if (isset($this->c->stopwords->list[$word])) {
        return null;
    }
    
    // CJK words are always valid
    if ($this->isCJKWord($word)) {
        return $word;
    }
    
    // Check minimum length (3 characters)
    $len = \mb_strlen(\trim($word, '?*'), 'UTF-8');
    
    if ($len < 3) {
        return null;
    }
    
    // Truncate to maximum length (20 characters)
    if ($len > 20) {
        $word = \mb_substr($word, 0, 20, 'UTF-8');
    }
    
    return $word;
}
Word Length Requirements: Regular words must be 3-20 characters. CJK characters have no length restrictions.

Extracting Words

app/Models/Search/Search.php
public function words(string $text, bool $indexing): array
{
    $text  = $this->cleanText($text, $indexing);
    $words = [];
    
    foreach (\explode(' ', $text) as $word) {
        $word = $this->word($word, $indexing);
        
        if (null !== $word) {
            $words[$word] = $word;
        }
    }
    
    return \array_values($words);
}

Stopwords

Common words are filtered out to improve search relevance:
if (isset($this->c->stopwords->list[$word])) {
    return null;
}
Stopwords typically include:
  • Articles (a, an, the)
  • Prepositions (in, on, at)
  • Common verbs (is, are, was)
  • Pronouns (I, you, they)
Stopwords prevent searching for very common terms. Configure your stopword list based on your forum’s language.

Search Execution

The search system uses separate handlers for different search types:

Search Actions

Searches post content using the full-text index:
// app/Models/Search/ActionP.php
// Searches through post messages
// Returns post IDs matching the query

Search Preparation

// app/Models/Search/Prepare.php
// Validates and prepares search queries
// Handles wildcards and operators

Search Execution

// app/Models/Search/Execute.php
// Runs the prepared query
// Returns sorted, paginated results

Indexing

The search index is maintained automatically:
// app/Models/Search/Index.php
// Indexes new posts as they're created
// Updates index when posts are edited

Index Management

// app/Models/Search/TruncateIndex.php
// Clears the search index
// Used for rebuilding or maintenance
The search index should be rebuilt periodically or after bulk imports.

Pagination and Results

app/Models/Search/Search.php
protected function getlink(): string
{
    return $this->c->Router->link($this->linkMarker, $this->linkArgs);
}

protected function getpagination(): array
{
    return $this->c->Func->paginate($this->numPages, $this->page, $this->linkMarker, $this->linkArgs);
}

public function hasPage(): bool
{
    return $this->page > 0 && $this->page <= $this->numPages;
}

Result Slicing

Efficient result handling for pagination:
app/Models/Search/Search.php
public function slice(string|array $data, int $offset, int $length): array
{
    if (\is_array($data)) {
        return \array_slice($data, $offset, $length);
    }
    
    // For comma-separated string of IDs
    $p = 0;
    $i = 0;
    
    // Skip to offset
    while ($i < $offset) {
        if (false === ($p = \strpos($data, ',', $p))) {
            return [];
        }
        ++$p;
        ++$i;
    }
    
    $e       = $p;
    $offset += $length;
    
    // Extract slice
    while ($i < $offset) {
        if (false === ($e = \strpos($data, ',', $e))) {
            return \array_map('\\intval', \explode(',', \substr($data, $p)));
        }
        ++$e;
        ++$i;
    }
    
    return \array_map('\\intval', \explode(',', \substr($data, $p, $e - $p - 1)));
}

public function count(string|array $data): int
{
    return \is_array($data) ? \count($data) : \substr_count($data, ',') + 1;
}
Results can be stored as arrays or comma-separated strings for memory efficiency.
Searches support wildcards for partial matching:
  • * matches zero or more characters
  • ? matches exactly one character
Examples:
  • test* matches “test”, “testing”, “tester”
  • t?st matches “test”, “tost”, “tast”
Wildcards are removed during indexing but preserved during searching.

Hashtag Support

Hashtags are automatically detected and indexed:
$text = \preg_replace_callback(
    '%(?<=^|\s|\n|\r)#(?=[\p{L}\p{N}_]{3})[\p{L}\p{N}]+(?:_+[\p{L}\p{N}]+)*(?=$|\s|\n|\r|\.|,)%u',
    function ($matches) use (&$tags)  {
        $tags[] = $matches[0];
        return ' ';
    },
    $text
);
Hashtags must:
  • Start with #
  • Contain at least 3 alphanumeric characters or underscores
  • Be preceded by whitespace or line start
  • Be followed by whitespace, line end, or punctuation

Search Deletion

Old searches are cleaned up periodically:
// app/Models/Search/Delete.php
// Removes expired search results
// Keeps database size manageable

Performance Optimization

Indexed Searches

Full-text indexes enable fast queries

Result Caching

Search results are cached temporarily

Word Filtering

Stopwords reduce index size

Efficient Slicing

Smart pagination without loading all results

Best Practices

Rebuild the search index after importing posts or if search results seem stale. Run index maintenance during low-traffic periods.
Customize stopwords for your forum’s primary language. Too few stopwords bloat the index; too many prevent valid searches.
If your forum has CJK content, ensure proper character encoding (UTF-8) throughout your application.
For very large forums (millions of posts), consider external search solutions like Elasticsearch or Sphinx.

Forums & Topics

Understand the content being searched

BBCode

How formatting affects search indexing

Build docs developers (and LLMs) love