Skip to main content
Vespa’s grouping feature allows you to organize search results into groups and compute aggregations like counts, sums, and averages. This is essential for faceted search, analytics, and result organization.

Overview

Grouping is integrated into YQL queries and executed as part of the search pipeline. The implementation is in container-search/src/main/java/com/yahoo/search/grouping/.

Basic Grouping Syntax

Add grouping to YQL queries using the pipe operator:
select * from sources product 
where title contains "laptop" 
| all(group(category) each(output(count())))
This groups results by category and outputs the count in each group.

Grouping Operations

all() Operation

Group all matching documents:
select * from sources product 
where price > 0
| all(group(brand) each(output(count())))
From test case (container-search/src/test/java/com/yahoo/search/grouping/GroupingQueryParserTestCase.java:36-43):
List<GroupingRequest> lst = executeQuery(
    "all(group(foo) each(output(max(bar))))", null, null);

each() Operation

Iterate over each document or group:
select * from sources product 
where price > 0
| all(group(category) each(output(count())))
From test case (container-search/src/test/java/com/yahoo/search/grouping/GroupingQueryParserTestCase.java:46-52):
List<GroupingRequest> lst = executeQuery("all();each()", null, null);
assertEquals(2, lst.size());
assertTrue(lst.get(0).getRootOperation() instanceof AllOperation);
assertTrue(lst.get(1).getRootOperation() instanceof EachOperation);

Grouping by Fields

Single Field Grouping

select * from sources product 
where userQuery()
| all(group(category) each(output(count())))

Multi-Level Grouping

Create hierarchical groups:
select * from sources product 
where userQuery()
| all(
    group(category) each(
      output(count()) 
      all(group(brand) each(output(count())))
    )
  )
This creates a two-level hierarchy: categories contain brands.

Grouping by Time

Group by time intervals:
select * from sources article 
where userQuery()
| all(
    group(time.date(timestamp)) each(output(count()))
  )
Time grouping functions include: time.year(), time.month(), time.date(), time.dayofmonth(), time.hourofday(), time.minuteofhour()

Aggregation Functions

count()

Count documents in each group:
| all(group(category) each(output(count())))

sum()

Sum numeric field values:
| all(group(category) each(output(sum(price))))

avg()

Compute average:
| all(group(category) each(output(avg(rating))))

min() and max()

Find minimum and maximum values:
| all(group(category) each(
    output(min(price), max(price))
  ))

Multiple Aggregations

Combine multiple aggregations:
| all(group(category) each(
    output(
      count(),
      sum(price),
      avg(rating),
      min(price),
      max(price)
    )
  ))

Output Control

Limiting Groups

Limit the number of groups returned:
| all(
    group(category) 
    max(10)  /* Return top 10 groups */
    each(output(count()))
  )

Ordering Groups

Order groups by aggregation result:
| all(
    group(category) 
    order(-count())  /* Order by count descending */
    max(10)
    each(output(count()))
  )
Order functions:
  • order(count()): Ascending by count
  • order(-count()): Descending by count
  • order(sum(price)): By sum of price
  • order(-avg(rating)): By average rating descending

Returning Documents

Return sample documents from each group:
| all(
    group(category) 
    each(
      output(count())
      max(3)  /* Return up to 3 docs per group */
      each(output(summary()))
    )
  )

Precision Control

Control grouping precision for distributed searches:
| all(
    group(category) 
    precision(100)  /* Ensure accuracy for top 100 groups */
    max(10)
    each(output(count()))
  )
From the source (container-search/src/main/java/com/yahoo/search/grouping/GroupingRequest.java:41):
private Double defaultPrecisionFactor;
In distributed deployments, grouping operates on content nodes independently. Use precision() to ensure accurate results for top groups.

Group Labels

Access group identity:
| all(
    group(category) 
    each(
      output(count())
      max(5)
      each(output(summary()))
    )
  )
The group key is automatically included in results.

Expressions in Grouping

Mathematical Expressions

| all(
    group(price / 100)  /* Group by price ranges */
    each(output(count()))
  )

Conditional Expressions

| all(
    group(if(rating > 4, "high", "low"))
    each(output(count()))
  )

String Operations

| all(
    group(category.lowercase())
    each(output(count()))
  )

Faceted Search Example

Complete faceted search implementation:
select * from sources product 
where userQuery() and price > 0
| all(group(category) max(20) each(output(count())))
| all(group(brand) max(20) each(output(count())))
| all(
    group(price / 100) 
    max(20) 
    order(price / 100)
    each(output(count()))
  )
This returns:
  • Search results
  • Category facets with counts
  • Brand facets with counts
  • Price range facets

Advanced Grouping

Continuations

Paginate through group results:
select * from sources product 
where userQuery()
| all(
    group(category) 
    max(10)
    each(output(count()))
  )
Continuation tokens are returned in the result and can be used in subsequent queries. From test case (container-search/src/test/java/com/yahoo/search/grouping/GroupingQueryParserTestCase.java:69-78):
List<GroupingRequest> lst = executeQuery(
    "all(group(foo) each(output(max(bar))))",
    "BCBCBCBEBGBCBKCBACBKCCK BCBBBBBDBF",  // Continuation tokens
    null
);
assertEquals(2, req.continuations().size());

Time Zones

Set time zone for time-based grouping: From test case (container-search/src/test/java/com/yahoo/search/grouping/GroupingQueryParserTestCase.java:81-90):
List<GroupingRequest> lst = executeQuery(
    "all(group(foo) each(output(max(bar))))", 
    null, 
    "cet"  // Time zone
);
TimeZone time = req.getTimeZone();
assertEquals(TimeZone.getTimeZone("cet"), time);

Grouping with Nested Each

From test case (container-search/src/test/java/com/yahoo/search/grouping/GroupingQueryParserTestCase.java:55-66):
select * from sources product 
where userQuery()
| all(
    each(
      output(summary(bar))
    )
  )

Best Practices

1

Limit group counts

Use max() to limit groups returned and improve performance
2

Order strategically

Order by the most relevant aggregation (usually count or sum)
3

Use precision for accuracy

Set appropriate precision() for distributed grouping
4

Combine with filters

Apply filters in WHERE clause before grouping to reduce data
5

Cache facet results

Consider caching facet results for common queries

Performance Considerations

Memory Usage

Grouping requires memory proportional to:
  • Number of unique groups
  • Precision setting
  • Number of aggregations

Optimization Tips

  1. Limit max groups: Return only top N groups
  2. Use appropriate precision: Higher precision = more memory/CPU
  3. Index grouping fields: Ensure fields used for grouping are indexed
  4. Batch group queries: Request multiple facets in a single query

Common Patterns

Category Facets

| all(group(category) max(50) order(-count()) each(output(count())))

Date Histograms

| all(group(time.date(timestamp)) order(time.date(timestamp)) each(output(count())))

Price Ranges

| all(
    group(
      if(price < 100, "0-100",
        if(price < 500, "100-500",
          if(price < 1000, "500-1000", "1000+")
        )
      )
    )
    each(output(count()))
  )

Top Products per Category

| all(
    group(category) 
    max(10)
    each(
      max(3)  /* Top 3 products per category */
      order(-rating)
      each(output(summary()))
    )
  )

Next Steps

Build docs developers (and LLMs) love