Aiven for Apache Flink

Aiven for Apache Flink is a fully managed service for distributed, stateful stream processing. Process and analyze streaming data in real-time using standard SQL, with built-in integrations to Kafka and PostgreSQL.

Overview

Apache Flink is the leading open-source stream processing framework for building real-time data pipelines and streaming applications. Aiven for Apache Flink provides a managed platform with a built-in SQL editor, making it easy to develop, test, and deploy streaming applications without managing infrastructure.

Why Choose Aiven for Apache Flink

SQL-Based Development

Write streaming applications using standard SQL with a built-in editor in Aiven Console

Stateful Processing

Maintain state across stream events for complex event processing and aggregations

Built-in Kafka Integration

Native integration with Aiven for Apache Kafka for seamless data flow

Exactly-Once Semantics

Guarantee data accuracy with exactly-once processing semantics

Key Features

Flink SQL Editor

Built-in SQL editor in Aiven Console:

Write and test Flink SQL queries
Explore table schemas
Interactive query execution
Deploy queries as applications
Version control for SQL statements

Example Streaming Query:

SELECT 
    TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start,
    user_id,
    COUNT(*) AS event_count,
    SUM(amount) AS total_amount
FROM kafka_events
GROUP BY 
    TUMBLE(event_time, INTERVAL '1' MINUTE),
    user_id
HAVING COUNT(*) > 10;

Flink Applications

Abstraction layer for managing streaming jobs:

Source and sink table definitions
Data processing logic
Deployment parameters
Metadata and configuration
Guided wizard in Aiven Console

Application Components:

Source tables (Kafka, PostgreSQL)
Transformation SQL statements
Sink tables (Kafka, PostgreSQL, OpenSearch)
Deployment and scaling settings

Interactive Queries

Preview data without creating sink tables:

Test transformations quickly
Debug streaming logic
Explore data schemas
Validate joins and aggregations

-- Preview Kafka topic data
SELECT * FROM kafka_orders LIMIT 10;

-- Test transformation
SELECT 
    order_id,
    customer_id,
    order_total,
    order_total * 1.1 AS total_with_tax
FROM kafka_orders
WHERE order_status = 'completed'
LIMIT 100;

Built-in Connectors

Apache Kafka Connector:

Auto-complete for Kafka topics
Multiple formats: JSON, Avro, Confluent Avro, Debezium CDC
Upsert Kafka for changelog streams
Schema Registry integration

PostgreSQL Connector:

Read from PostgreSQL tables
Write results back to PostgreSQL
Auto-complete for databases and tables
Support for JDBC connections

OpenSearch Connector:

Sink streaming results to OpenSearch
Full-text search integration
Dynamic index creation

Exactly-Once Semantics

Guarantee data accuracy:

Checkpointing for fault tolerance
Automatic state recovery
Transactional sinks
No data loss or duplication

Getting Started

Create Flink Service

Deploy an Apache Flink service:

avn service create my-flink \
  --service-type flink \
  --cloud aws-us-east-1 \
  --plan business-4

Service creation may be limited based on your subscription. Check with Aiven support for access.

Create Integration with Kafka

Connect Flink to your Kafka service:

avn service integration-create \
  --integration-type flink \
  --source-service my-kafka \
  --dest-service my-flink

This enables Flink to read from and write to Kafka topics.

Create a Flink Application

Use the Aiven Console wizard to:

Create source tables from Kafka topics
Write transformation SQL
Create sink tables for results
Deploy the application

Test with Interactive Queries

Run queries directly in the SQL editor to test before deploying.

Stream Processing Patterns

Filtering and Transformation
Windowed Aggregations
Stream Joins
Change Data Capture

-- Filter high-value orders and enrich with customer data
CREATE TABLE enriched_orders AS
SELECT 
    o.order_id,
    o.order_time,
    o.order_total,
    c.customer_name,
    c.customer_tier,
    c.customer_email
FROM kafka_orders o
JOIN postgres_customers FOR SYSTEM_TIME AS OF o.order_time AS c
    ON o.customer_id = c.customer_id
WHERE o.order_total > 100;

-- Calculate metrics per 5-minute tumbling window
CREATE TABLE window_metrics AS
SELECT
    TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS window_start,
    TUMBLE_END(event_time, INTERVAL '5' MINUTE) AS window_end,
    sensor_id,
    COUNT(*) AS reading_count,
    AVG(temperature) AS avg_temperature,
    MAX(temperature) AS max_temperature,
    MIN(temperature) AS min_temperature
FROM kafka_sensor_data
GROUP BY 
    TUMBLE(event_time, INTERVAL '5' MINUTE),
    sensor_id;

-- Join clickstream with purchases (within 1 hour)
CREATE TABLE attributed_purchases AS
SELECT
    p.purchase_id,
    p.user_id,
    p.amount,
    c.page_url,
    c.referrer,
    c.campaign_id
FROM kafka_purchases p
JOIN kafka_clicks c
    ON p.user_id = c.user_id
    AND p.purchase_time BETWEEN c.click_time 
        AND c.click_time + INTERVAL '1' HOUR;

-- Process database changes from Debezium CDC
CREATE TABLE customer_changes (
    customer_id BIGINT,
    name STRING,
    email STRING,
    op_type STRING -- 'c' for create, 'u' for update, 'd' for delete
) WITH (
    'connector' = 'kafka',
    'topic' = 'dbserver1.customers',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format' = 'debezium-json'
);

-- Materialize current state
CREATE TABLE current_customers AS
SELECT 
    customer_id,
    name,
    email
FROM customer_changes
WHERE op_type <> 'd';

Window Types

Tumbling Windows

Fixed-size, non-overlapping windows:

-- Events grouped into 10-minute buckets
SELECT
    TUMBLE_START(event_time, INTERVAL '10' MINUTE) AS window_start,
    COUNT(*) AS event_count
FROM events
GROUP BY TUMBLE(event_time, INTERVAL '10' MINUTE);

Sliding Windows

Overlapping windows:

-- 10-minute windows sliding every 5 minutes
SELECT
    HOP_START(event_time, INTERVAL '5' MINUTE, INTERVAL '10' MINUTE) AS window_start,
    COUNT(*) AS event_count
FROM events
GROUP BY HOP(event_time, INTERVAL '5' MINUTE, INTERVAL '10' MINUTE);

Session Windows

Dynamic windows based on inactivity:

-- Group events with max 30-minute gap
SELECT
    SESSION_START(event_time, INTERVAL '30' MINUTE) AS session_start,
    SESSION_END(event_time, INTERVAL '30' MINUTE) AS session_end,
    user_id,
    COUNT(*) AS event_count
FROM events
GROUP BY 
    SESSION(event_time, INTERVAL '30' MINUTE),
    user_id;

Table Formats and Connectors

Kafka Table Formats

JSON
Avro
Confluent Avro
Upsert Kafka

CREATE TABLE kafka_events (
    event_id STRING,
    event_time TIMESTAMP(3),
    user_id BIGINT,
    event_type STRING,
    WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
    'connector' = 'kafka',
    'topic' = 'events',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format' = 'json',
    'json.timestamp-format.standard' = 'ISO-8601'
);

CREATE TABLE kafka_events (
    event_id STRING,
    event_time TIMESTAMP(3),
    user_id BIGINT
) WITH (
    'connector' = 'kafka',
    'topic' = 'events',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format' = 'avro',
    'avro.schema' = '{...}'
);

CREATE TABLE kafka_events (
    event_id STRING,
    event_time TIMESTAMP(3),
    user_id BIGINT
) WITH (
    'connector' = 'kafka',
    'topic' = 'events',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format' = 'avro-confluent',
    'avro-confluent.url' = 'https://schema-registry:8081'
);

CREATE TABLE user_stats (
    user_id BIGINT PRIMARY KEY NOT ENFORCED,
    total_purchases BIGINT,
    total_amount DECIMAL(10,2),
    last_purchase_time TIMESTAMP(3)
) WITH (
    'connector' = 'upsert-kafka',
    'topic' = 'user-stats',
    'properties.bootstrap.servers' = 'kafka:9092',
    'key.format' = 'json',
    'value.format' = 'json'
);

Cluster Management

Scaling

Scale up: Increase CPU and memory per TaskManager
Scale out: Add more nodes to the cluster
Configure task slots per TaskManager
Adjust parallelism for jobs

Adjusting task slots requires a cluster restart.

Checkpoints

Automatic fault tolerance:

Periodic checkpoints to object storage
State recovery on failure
Exactly-once guarantees
Configurable checkpoint interval

Checkpoints are automatically configured for your cluster.

Session Mode

Multiple jobs on same cluster:

Share cluster resources
Deploy multiple applications
Maximize resource utilization
Isolated job execution

Monitoring and Operations

Key Metrics

Job Metrics

Records processed per second
Job uptime and restarts
Checkpoint duration
Backpressure indicators

Resource Usage

TaskManager CPU/memory
JobManager status
Network I/O
State size

Integration with Observability

# Send logs to OpenSearch
avn service integration-create \
  --integration-type logs \
  --source-service my-flink \
  --dest-service my-opensearch

# Send metrics to Grafana
avn service integration-create \
  --integration-type metrics \
  --source-service my-flink \
  --dest-service my-grafana

Use Cases

Real-Time Analytics
ETL Pipelines
Event-Driven Apps
Data Integration

Live dashboards
Streaming aggregations
Metric computation
KPI monitoring

Best Practices

State Management

Use proper key partitioning
Implement state TTL for growing state
Monitor state size
Use RocksDB for large state

Watermarks

Define watermarks for event-time processing
Account for late events
Balance latency vs completeness
Use allowed lateness for critical data

Performance

Tune checkpoint intervals
Adjust parallelism appropriately
Use proper join strategies
Monitor backpressure

Apache Kafka

Stream processing on Kafka data

PostgreSQL

Enrich streams with PostgreSQL data

OpenSearch

Sink processed results to OpenSearch

ClickHouse

Load streaming results to ClickHouse

Resources

SQL-Based Development: No Java or Scala knowledge required. Build streaming applications entirely with SQL using the Aiven Console.

Get Started

Platform

Services

Developer Tools

Integrations

Aiven for Apache Flink

Overview

Why Choose Aiven for Apache Flink

SQL-Based Development

Stateful Processing

Built-in Kafka Integration

Exactly-Once Semantics

Key Features

Getting Started

Stream Processing Patterns

Window Types

Table Formats and Connectors

Kafka Table Formats

Cluster Management

Monitoring and Operations

Key Metrics

Job Metrics

Resource Usage

Integration with Observability

Use Cases

Best Practices

Apache Kafka

PostgreSQL

OpenSearch

ClickHouse

Resources

Build docs developers (and LLMs) love

Get Started

Platform

Services

Developer Tools

Integrations

​Overview

​Why Choose Aiven for Apache Flink

SQL-Based Development

Stateful Processing

Built-in Kafka Integration

Exactly-Once Semantics

​Key Features

​Getting Started

​Stream Processing Patterns

​Window Types

​Table Formats and Connectors

​Kafka Table Formats

​Cluster Management

​Monitoring and Operations

​Key Metrics

Job Metrics

Resource Usage

​Integration with Observability

​Use Cases

​Best Practices

​Related Services

Apache Kafka

PostgreSQL

OpenSearch

ClickHouse

​Resources

Build docs developers (and LLMs) love

Overview

Why Choose Aiven for Apache Flink

Key Features

Getting Started

Stream Processing Patterns

Window Types

Table Formats and Connectors

Kafka Table Formats

Cluster Management

Monitoring and Operations

Key Metrics

Integration with Observability

Use Cases

Best Practices

Related Services

Resources