Skip to main content
Aiven for ClickHouse is a fully managed distributed columnar database based on open-source ClickHouse. Purpose-built for online analytical processing (OLAP), it delivers blazing-fast SQL queries on large datasets for real-time analytical reporting and data warehousing.

Overview

ClickHouse is a highly scalable, fault-tolerant database designed specifically for analytical workloads. Its columnar storage format and advanced compression enable complex SQL queries on billions of rows with sub-second response times.

Why Choose Aiven for ClickHouse

Extreme Performance

Process billions of rows per second with columnar storage and vectorized query execution

Real-Time Analytics

Ingest and query data simultaneously for up-to-the-second insights

SQL Compatible

Use familiar SQL syntax with advanced analytical functions

Built-in Integrations

Native integration with Kafka and PostgreSQL for seamless data pipelines

Key Features

Optimized for analytical queries:
  • Store data by columns instead of rows
  • Read only columns needed for query
  • Achieve 10-100x compression ratios
  • Parallel processing across columns
  • Minimize disk I/O for aggregations
Scale horizontally with sharding:
  • Data distributed across multiple shards
  • Queries executed in parallel
  • Replicas for high availability
  • Automatic data rebalancing
  • ZooKeeper for coordination
Create Distributed Table:
CREATE TABLE events_distributed ON CLUSTER 'default'
(
    timestamp DateTime,
    user_id UInt64,
    event_type String,
    properties String
)
ENGINE = Distributed('default', 'default', 'events_local', rand());
Kafka Integration: Stream data directly from Kafka topics:
CREATE TABLE kafka_events
(
    timestamp DateTime,
    user_id UInt64,
    event String
)
ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'kafka-service:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'clickhouse_consumer',
    kafka_format = 'JSONEachRow';

-- Materialized view to store data
CREATE MATERIALIZED VIEW events_mv TO events AS
SELECT * FROM kafka_events;
PostgreSQL Integration: Query PostgreSQL directly:
SELECT * FROM postgresql(
    'postgres-host:5432',
    'database',
    'table',
    'user',
    'password'
);
Multiple compression algorithms:
  • LZ4 (default, fast)
  • ZSTD (high compression)
  • Delta encoding for timestamps
  • Dictionary compression for strings
  • Automatic codec selection
CREATE TABLE metrics
(
    timestamp DateTime CODEC(DoubleDelta, LZ4),
    value Float64 CODEC(Gorilla, LZ4),
    tag String CODEC(ZSTD)
)
ENGINE = MergeTree()
ORDER BY timestamp;

Getting Started

1

Create ClickHouse Service

Deploy a ClickHouse service:
avn service create my-clickhouse \
  --service-type clickhouse \
  --cloud aws-us-east-1 \
  --plan business-8
2

Connect with clickhouse-client

Install and connect with the native client:
clickhouse-client \
  --host clickhouse-service.aivencloud.com \
  --port 9440 \
  --user avnadmin \
  --password your-password \
  --secure
3

Create Your First Table

CREATE TABLE events
(
    timestamp DateTime,
    user_id UInt64,
    event_type String,
    country String,
    revenue Nullable(Float64)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id);
4

Insert and Query Data

-- Insert data
INSERT INTO events VALUES
    ('2024-03-04 10:00:00', 123, 'purchase', 'US', 99.99),
    ('2024-03-04 10:05:00', 456, 'view', 'UK', NULL),
    ('2024-03-04 10:10:00', 123, 'view', 'US', NULL);

-- Query data
SELECT 
    toStartOfHour(timestamp) AS hour,
    event_type,
    count() AS event_count,
    sum(revenue) AS total_revenue
FROM events
WHERE timestamp >= '2024-03-04'
GROUP BY hour, event_type
ORDER BY hour, event_count DESC;

Connection Examples

from clickhouse_driver import Client

# Connect using native protocol
client = Client(
    host='clickhouse-service.aivencloud.com',
    port=9440,
    user='avnadmin',
    password='your-password',
    secure=True,
    database='default'
)

# Create table
client.execute('''
    CREATE TABLE IF NOT EXISTS page_views
    (
        date Date,
        user_id UInt64,
        page String,
        duration UInt32
    )
    ENGINE = MergeTree()
    PARTITION BY toYYYYMM(date)
    ORDER BY (date, user_id)
''')

# Insert data
data = [
    ('2024-03-04', 123, '/home', 45),
    ('2024-03-04', 456, '/products', 120),
    ('2024-03-04', 789, '/checkout', 300)
]

client.execute(
    'INSERT INTO page_views (date, user_id, page, duration) VALUES',
    data
)

# Query data
result = client.execute('''
    SELECT 
        page,
        count() AS views,
        avg(duration) AS avg_duration
    FROM page_views
    WHERE date = today()
    GROUP BY page
    ORDER BY views DESC
''')

for row in result:
    print(f"Page: {row[0]}, Views: {row[1]}, Avg Duration: {row[2]:.2f}s")

Table Engines

Most common engine for analytical workloads:
CREATE TABLE analytics
(
    date Date,
    hour UInt8,
    user_id UInt64,
    metric Float64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (date, hour, user_id)
SETTINGS index_granularity = 8192;
Features:
  • Primary key indexing
  • Data partitioning
  • Data sampling
  • Data TTL

Advanced Features

Materialized Views

-- Source table
CREATE TABLE events
(
    timestamp DateTime,
    user_id UInt64,
    event_type String,
    revenue Nullable(Float64)
)
ENGINE = MergeTree()
ORDER BY timestamp;

-- Target table for aggregated data
CREATE TABLE hourly_stats
(
    hour DateTime,
    event_type String,
    event_count UInt64,
    total_revenue Float64,
    unique_users UInt64
)
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, event_type);

-- Materialized view
CREATE MATERIALIZED VIEW events_mv TO hourly_stats AS
SELECT
    toStartOfHour(timestamp) AS hour,
    event_type,
    count() AS event_count,
    sum(revenue) AS total_revenue,
    uniq(user_id) AS unique_users
FROM events
GROUP BY hour, event_type;

Dictionaries

-- Create dictionary from external source
CREATE DICTIONARY user_info
(
    user_id UInt64,
    name String,
    country String,
    created_at DateTime
)
PRIMARY KEY user_id
SOURCE(POSTGRESQL(
    host 'postgres-host'
    port 5432
    user 'readonly'
    password 'secret'
    db 'users'
    table 'user_profiles'
))
LAYOUT(HASHED())
LIFETIME(3600);

-- Use dictionary in queries
SELECT
    e.user_id,
    dictGet('user_info', 'name', e.user_id) AS user_name,
    dictGet('user_info', 'country', e.user_id) AS country,
    count() AS events
FROM events e
GROUP BY e.user_id;

Window Functions

SELECT
    date,
    revenue,
    -- Running total
    sum(revenue) OVER (ORDER BY date) AS cumulative_revenue,
    -- Moving average
    avg(revenue) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS ma_7day,
    -- Rank
    row_number() OVER (ORDER BY revenue DESC) AS rank,
    -- Percent change
    (revenue - lag(revenue) OVER (ORDER BY date)) / lag(revenue) OVER (ORDER BY date) * 100 AS pct_change
FROM daily_revenue
ORDER BY date;

Performance Optimization

Partition large tables for faster queries:
-- Partition by month
CREATE TABLE events
(
    timestamp DateTime,
    data String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY timestamp;

-- Query specific partition
SELECT * FROM events
WHERE timestamp >= '2024-03-01' AND timestamp < '2024-04-01';

-- Drop old partitions
ALTER TABLE events DROP PARTITION '202401';
Multiple index types available:
CREATE TABLE search_logs
(
    timestamp DateTime,
    query String,
    user_id UInt64,
    INDEX query_idx query TYPE tokenbf_v1(10000, 3, 0) GRANULARITY 4,
    INDEX user_idx user_id TYPE minmax GRANULARITY 1
)
ENGINE = MergeTree()
ORDER BY timestamp;
Best practices:
  • Use PREWHERE for heavy filtering
  • Minimize SELECT *
  • Use appropriate data types
  • Leverage primary key ordering
  • Use sampling for exploratory queries
-- Use PREWHERE for filtering
SELECT user_id, count()
FROM events
PREWHERE date = today()
WHERE event_type = 'purchase'
GROUP BY user_id;

-- Sample data for fast results
SELECT avg(revenue)
FROM events SAMPLE 0.1
WHERE date >= today() - 30;

Monitoring and Maintenance

System Tables

-- Query statistics
SELECT
    query,
    read_rows,
    read_bytes,
    total_rows_approx,
    elapsed
FROM system.query_log
WHERE type = 'QueryFinish'
ORDER BY elapsed DESC
LIMIT 10;

-- Table sizes
SELECT
    database,
    table,
    formatReadableSize(sum(bytes)) AS size,
    sum(rows) AS rows
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(bytes) DESC;

-- Current queries
SELECT
    query_id,
    user,
    query,
    elapsed,
    read_rows,
    memory_usage
FROM system.processes
ORDER BY elapsed DESC;

Integration with Grafana

avn service integration-create \
  --integration-type clickhouse_grafana \
  --source-service my-clickhouse \
  --dest-service my-grafana

Use Cases

  • Page view tracking
  • User behavior analysis
  • Conversion funnels
  • A/B testing results
  • Real-time dashboards

Best Practices

  • Use appropriate data types (UInt over Int when possible)
  • Order columns by query frequency
  • Use LowCardinality for enum-like strings
  • Denormalize for performance
  • Partition large tables by time
  • Batch inserts (1000+ rows)
  • Use async inserts for high throughput
  • Leverage Kafka integration for streaming
  • Use appropriate format (Native, Arrow, Parquet)
  • Monitor insert performance
  • Filter early with WHERE/PREWHERE
  • Use materialized views for common aggregations
  • Leverage dictionaries for dimension data
  • Sample data for exploratory queries
  • Monitor slow queries

Apache Kafka

Stream data from Kafka to ClickHouse

PostgreSQL

Query PostgreSQL data from ClickHouse

Grafana

Visualize ClickHouse data in Grafana

Apache Flink

Process streams and load to ClickHouse

Resources

Performance Tip: ClickHouse can process billions of rows per second. Design your schemas and queries to take advantage of columnar storage and parallel processing.

Build docs developers (and LLMs) love