Vespa’s grouping feature allows you to organize search results into groups and compute aggregations like counts, sums, and averages. This is essential for faceted search, analytics, and result organization.
Overview
Grouping is integrated into YQL queries and executed as part of the search pipeline. The implementation is in container-search/src/main/java/com/yahoo/search/grouping/.
Basic Grouping Syntax
Add grouping to YQL queries using the pipe operator:
select * from sources product
where title contains "laptop"
| all(group(category) each(output(count())))
This groups results by category and outputs the count in each group.
Grouping Operations
all() Operation
Group all matching documents:
select * from sources product
where price > 0
| all(group(brand) each(output(count())))
From test case (container-search/src/test/java/com/yahoo/search/grouping/GroupingQueryParserTestCase.java:36-43):
List<GroupingRequest> lst = executeQuery(
"all(group(foo) each(output(max(bar))))", null, null);
each() Operation
Iterate over each document or group:
select * from sources product
where price > 0
| all(group(category) each(output(count())))
From test case (container-search/src/test/java/com/yahoo/search/grouping/GroupingQueryParserTestCase.java:46-52):
List<GroupingRequest> lst = executeQuery("all();each()", null, null);
assertEquals(2, lst.size());
assertTrue(lst.get(0).getRootOperation() instanceof AllOperation);
assertTrue(lst.get(1).getRootOperation() instanceof EachOperation);
Grouping by Fields
Single Field Grouping
select * from sources product
where userQuery()
| all(group(category) each(output(count())))
Multi-Level Grouping
Create hierarchical groups:
select * from sources product
where userQuery()
| all(
group(category) each(
output(count())
all(group(brand) each(output(count())))
)
)
This creates a two-level hierarchy: categories contain brands.
Grouping by Time
Group by time intervals:
select * from sources article
where userQuery()
| all(
group(time.date(timestamp)) each(output(count()))
)
Time grouping functions include: time.year(), time.month(), time.date(), time.dayofmonth(), time.hourofday(), time.minuteofhour()
Aggregation Functions
count()
Count documents in each group:
| all(group(category) each(output(count())))
sum()
Sum numeric field values:
| all(group(category) each(output(sum(price))))
avg()
Compute average:
| all(group(category) each(output(avg(rating))))
min() and max()
Find minimum and maximum values:
| all(group(category) each(
output(min(price), max(price))
))
Multiple Aggregations
Combine multiple aggregations:
| all(group(category) each(
output(
count(),
sum(price),
avg(rating),
min(price),
max(price)
)
))
Output Control
Limiting Groups
Limit the number of groups returned:
| all(
group(category)
max(10) /* Return top 10 groups */
each(output(count()))
)
Ordering Groups
Order groups by aggregation result:
| all(
group(category)
order(-count()) /* Order by count descending */
max(10)
each(output(count()))
)
Order functions:
order(count()): Ascending by count
order(-count()): Descending by count
order(sum(price)): By sum of price
order(-avg(rating)): By average rating descending
Returning Documents
Return sample documents from each group:
| all(
group(category)
each(
output(count())
max(3) /* Return up to 3 docs per group */
each(output(summary()))
)
)
Precision Control
Control grouping precision for distributed searches:
| all(
group(category)
precision(100) /* Ensure accuracy for top 100 groups */
max(10)
each(output(count()))
)
From the source (container-search/src/main/java/com/yahoo/search/grouping/GroupingRequest.java:41):
private Double defaultPrecisionFactor;
In distributed deployments, grouping operates on content nodes independently. Use precision() to ensure accurate results for top groups.
Group Labels
Access group identity:
| all(
group(category)
each(
output(count())
max(5)
each(output(summary()))
)
)
The group key is automatically included in results.
Expressions in Grouping
Mathematical Expressions
| all(
group(price / 100) /* Group by price ranges */
each(output(count()))
)
Conditional Expressions
| all(
group(if(rating > 4, "high", "low"))
each(output(count()))
)
String Operations
| all(
group(category.lowercase())
each(output(count()))
)
Faceted Search Example
Complete faceted search implementation:
select * from sources product
where userQuery() and price > 0
| all(group(category) max(20) each(output(count())))
| all(group(brand) max(20) each(output(count())))
| all(
group(price / 100)
max(20)
order(price / 100)
each(output(count()))
)
This returns:
- Search results
- Category facets with counts
- Brand facets with counts
- Price range facets
Advanced Grouping
Continuations
Paginate through group results:
select * from sources product
where userQuery()
| all(
group(category)
max(10)
each(output(count()))
)
Continuation tokens are returned in the result and can be used in subsequent queries.
From test case (container-search/src/test/java/com/yahoo/search/grouping/GroupingQueryParserTestCase.java:69-78):
List<GroupingRequest> lst = executeQuery(
"all(group(foo) each(output(max(bar))))",
"BCBCBCBEBGBCBKCBACBKCCK BCBBBBBDBF", // Continuation tokens
null
);
assertEquals(2, req.continuations().size());
Time Zones
Set time zone for time-based grouping:
From test case (container-search/src/test/java/com/yahoo/search/grouping/GroupingQueryParserTestCase.java:81-90):
List<GroupingRequest> lst = executeQuery(
"all(group(foo) each(output(max(bar))))",
null,
"cet" // Time zone
);
TimeZone time = req.getTimeZone();
assertEquals(TimeZone.getTimeZone("cet"), time);
Grouping with Nested Each
From test case (container-search/src/test/java/com/yahoo/search/grouping/GroupingQueryParserTestCase.java:55-66):
select * from sources product
where userQuery()
| all(
each(
output(summary(bar))
)
)
Best Practices
Limit group counts
Use max() to limit groups returned and improve performance
Order strategically
Order by the most relevant aggregation (usually count or sum)
Use precision for accuracy
Set appropriate precision() for distributed grouping
Combine with filters
Apply filters in WHERE clause before grouping to reduce data
Cache facet results
Consider caching facet results for common queries
Memory Usage
Grouping requires memory proportional to:
- Number of unique groups
- Precision setting
- Number of aggregations
Optimization Tips
- Limit max groups: Return only top N groups
- Use appropriate precision: Higher precision = more memory/CPU
- Index grouping fields: Ensure fields used for grouping are indexed
- Batch group queries: Request multiple facets in a single query
Common Patterns
Category Facets
| all(group(category) max(50) order(-count()) each(output(count())))
Date Histograms
| all(group(time.date(timestamp)) order(time.date(timestamp)) each(output(count())))
Price Ranges
| all(
group(
if(price < 100, "0-100",
if(price < 500, "100-500",
if(price < 1000, "500-1000", "1000+")
)
)
)
each(output(count()))
)
Top Products per Category
| all(
group(category)
max(10)
each(
max(3) /* Top 3 products per category */
order(-rating)
each(output(summary()))
)
)
Next Steps