Block Layer API

The Linux block layer mediates all I/O between filesystems and storage devices. The modern multi-queue block layer (blk-mq) maps software submission queues (one per CPU) to hardware dispatch queues, eliminating the global queue lock that bottlenecked single-queue designs and enabling full parallelism on NVMe and other high-queue-depth devices.

struct bio

Basic unit of block I/O submitted by filesystems

struct request

Merged bio(s) dispatched to the driver

blk-mq

Tag sets, ops, and hardware queue management

`struct bio`

struct bio is the basic unit of block I/O. It represents a contiguous range of sectors on a block device and carries the data in a scatter-gather list of page vectors (bio_vec). Filesystems, the VM, and DM/MD layers submit bios directly; the block layer merges them into struct requests before dispatching to drivers.

struct bio {
    struct bio          *bi_next;       /* request queue link */
    struct block_device *bi_bdev;       /* target block device */
    blk_opf_t            bi_opf;        /* REQ_OP_* | req flags */
    unsigned short       bi_flags;      /* BIO_* status flags */
    unsigned short       bi_ioprio;     /* I/O priority */
    blk_status_t         bi_status;     /* completion status */
    struct bio_vec      *bi_io_vec;     /* scatter-gather list */
    struct bvec_iter     bi_iter;       /* current position iterator */
    bio_end_io_t        *bi_end_io;    /* completion callback */
    void                *bi_private;   /* caller private data */
    unsigned short       bi_vcnt;      /* number of bio_vecs */
    unsigned short       bi_max_vecs;  /* allocated bio_vecs */
};

Key `bio` Fields

I/O target and operation

bi_bdev

struct block_device *

The block device (partition or whole disk) this I/O targets. Set by the submitter; the block layer uses this to route to the correct request_queue.

bi_opf

blk_opf_t

Operation and flags. The low 8 bits encode enum req_op (REQ_OP_READ=0, REQ_OP_WRITE=1, REQ_OP_FLUSH=2, REQ_OP_DISCARD=3). The remaining 24 bits carry request flags such as REQ_SYNC, REQ_META, REQ_FUA, REQ_RAHEAD, and REQ_NOWAIT.

bi_iter

struct bvec_iter

Tracks the current position within the bio’s bio_vec list. Contains bi_sector (current sector), bi_size (remaining bytes), bi_idx (current bvec index), and bi_bvec_done (bytes done in current bvec).

bi_status

blk_status_t

Completion status set by the driver before calling bio->bi_end_io. BLK_STS_OK (0) on success. Other values include BLK_STS_IOERR, BLK_STS_TIMEOUT, BLK_STS_NOSPC.

Data vectors

bi_io_vec

struct bio_vec *

Array of scatter-gather segments. Each bio_vec has a bv_page, bv_len, and bv_offset. The bio carries data from pages described by these vectors.

bi_vcnt

unsigned short

Number of active bio_vec entries.

bi_end_io

bio_end_io_t *

Completion callback invoked by the block layer after the I/O finishes (success or failure). The callback must call bio_put() unless ownership is transferred.

bi_private

void *

Opaque pointer for the bio submitter’s private state. The block layer does not touch this field.

Iterating bio Segments

#include <linux/bio.h>

struct bio_vec bvec;
struct bvec_iter iter;

bio_for_each_segment(bvec, bio, iter) {
    /*
     * bvec.bv_page   - the page holding this segment
     * bvec.bv_offset - byte offset within the page
     * bvec.bv_len    - byte length of this segment
     */
    void *kaddr = kmap_local_page(bvec.bv_page);
    do_something(kaddr + bvec.bv_offset, bvec.bv_len);
    kunmap_local(kaddr);
}

bio_for_each_segment iterates the logical segments as submitted. Use bio_for_each_segment_all() if you need to iterate physical segments after the bio has been split by the block layer.

`struct request`

The block layer merges one or more adjacent bios into a struct request before dispatching to the driver. The driver sees only requests, not raw bios.

struct request {
    struct request_queue    *q;             /* owning queue */
    struct blk_mq_ctx       *mq_ctx;       /* software queue context */
    struct blk_mq_hw_ctx    *mq_hctx;     /* hardware queue context */
    blk_opf_t                cmd_flags;    /* REQ_OP_* | request flags */
    req_flags_t              rq_flags;     /* RQF_* internal flags */
    int                      tag;          /* hardware tag */
    int                      internal_tag; /* scheduler tag */
    unsigned int             timeout;      /* timeout in jiffies */
    unsigned int             __data_len;   /* total data length */
    sector_t                 __sector;     /* sector cursor */
    struct bio              *bio;          /* first bio */
    struct bio              *biotail;      /* last bio */
    struct block_device     *part;         /* target partition */
    u64                      start_time_ns;
    u64                      io_start_time_ns;
    unsigned short           nr_phys_segments;
    enum mq_rq_state         state;        /* MQ_RQ_IDLE/IN_FLIGHT/COMPLETE */
    rq_end_io_fn            *end_io;      /* completion callback */
    void                    *end_io_data;
};

Key `request` Fields

Identity and state

cmd_flags

blk_opf_t

Operation code and flags, mirroring bio->bi_opf. Use req_op(rq) to extract the enum req_op value.

tag

int

Hardware tag allocated from blk_mq_tag_set. Unique within the hardware queue. Use blk_mq_rq_from_pdu() in completion paths to map driver-private data back to the request.

state

enum mq_rq_state

MQ_RQ_IDLE: not dispatched. MQ_RQ_IN_FLIGHT: dispatched to hardware. MQ_RQ_COMPLETE: completion is being processed.

timeout

unsigned int

Request timeout in jiffies. The block layer arms a watchdog; if the request does not complete within timeout jiffies, blk_mq_ops.timeout is called.

Data geometry

__data_len

unsigned int

Total byte count of data in this request. Access via blk_rq_bytes(rq). Do not read __data_len directly; it is zeroed during completion.

__sector

sector_t

Current sector position. Access via blk_rq_pos(rq). Each sector is 512 bytes.

nr_phys_segments

unsigned short

Number of DMA-mappable scatter-gather segments after physical address coalescing. Use this count when calling dma_map_sg().

bio / biotail

struct bio *

Linked list of merged bios. Iterate with rq_for_each_bio(bio, rq) if per-bio processing is needed (rare in blk-mq drivers).

Request Helper Macros

/* Get operation */
enum req_op op = req_op(rq);
bool is_write   = op_is_write(req_op(rq));

/* Get position and size */
sector_t start  = blk_rq_pos(rq);
unsigned bytes  = blk_rq_bytes(rq);
unsigned sectors = blk_rq_sectors(rq);

/* DMA direction */
enum dma_data_direction dir = rq_dma_dir(rq); /* DMA_TO_DEVICE or DMA_FROM_DEVICE */

/* Access driver-private data appended to the request (cmd_size bytes) */
void *pdu = blk_mq_rq_to_pdu(rq);
struct request *rq = blk_mq_rq_from_pdu(pdu);

Multi-Queue Block Layer (blk-mq)

blk-mq introduces a two-level queue hierarchy:

Software queues (blk_mq_ctx): one per CPU, lock-free submission.
Hardware queues (blk_mq_hw_ctx): one (or more) per hardware submission queue on the device.

The block layer maps software queues to hardware queues via blk_mq_queue_map. For a device with a single hardware queue all CPUs map to it. For NVMe with 32 queues each queue is typically pinned to a NUMA node.

`blk_mq_tag_set`

blk_mq_tag_set is the central configuration structure allocated once per driver instance and shared across all request_queues (e.g., all namespaces of an NVMe controller).

struct blk_mq_tag_set {
    const struct blk_mq_ops *ops;           /* driver callbacks */
    struct blk_mq_queue_map  map[HCTX_MAX_TYPES]; /* CPU→hctx mapping */
    unsigned int             nr_maps;       /* number of valid maps */
    unsigned int             nr_hw_queues;  /* hardware queue count */
    unsigned int             queue_depth;   /* tags per hw queue */
    unsigned int             reserved_tags; /* reserved for BLK_MQ_REQ_RESERVED */
    unsigned int             cmd_size;      /* extra bytes per request (driver PDU) */
    int                      numa_node;
    unsigned int             timeout;       /* request timeout (jiffies) */
    unsigned int             flags;         /* BLK_MQ_F_* */
    void                    *driver_data;
    struct blk_mq_tags     **tags;          /* per-hw-queue tag arrays */
};

ops

const struct blk_mq_ops *

required

Driver callback table. Must be set before calling blk_mq_alloc_tag_set().

nr_hw_queues

unsigned int

required

Number of hardware submission queues. Must match the hardware. For single-queue devices, set to 1.

queue_depth

unsigned int

required

Maximum number of in-flight requests per hardware queue (i.e., number of tags). Must not exceed the device’s hardware queue depth. Typical values: 128–1024 for NVMe, 32 for SATA.

cmd_size

unsigned int

Extra bytes appended to each struct request for driver-private data (the “PDU”). Access via blk_mq_rq_to_pdu(). Set to sizeof(struct my_driver_request).

flags

unsigned int

BLK_MQ_F_TAG_QUEUE_SHARED: share tags across all hardware queues. BLK_MQ_F_BLOCKING: the queue_rq callback may sleep (uses SRCU instead of RCU). BLK_MQ_F_NO_SCHED: disable I/O scheduler.

`blk_mq_ops`

The driver implements the following callbacks:

Required callbacks

queue_rq

blk_status_t (*)(struct blk_mq_hw_ctx *, const struct blk_mq_queue_data *)

Required. Submit a request to the hardware. The blk_mq_queue_data contains the request *rq and bool last (hint that this is the last request in a batch). On success, call blk_mq_start_request(rq) and initiate DMA. Return BLK_STS_OK. On transient resource exhaustion, return BLK_STS_DEV_RESOURCE — the block layer will retry.

complete

void (*)(struct request *)

Required. Called to complete a request on the CPU that initiated the completion. Typically calls blk_mq_end_request(rq, status).

Initialization callbacks

init_hctx

int (*)(struct blk_mq_hw_ctx *, void *, unsigned int)

Called once per hardware queue after the queue is set up. The driver can allocate per-queue resources. Arguments: hctx, driver_data (from blk_mq_tag_set.driver_data), hctx_idx.

exit_hctx

void (*)(struct blk_mq_hw_ctx *, unsigned int)

Counterpart to init_hctx. Called during teardown.

init_request

int (*)(struct blk_mq_tag_set *, struct request *, unsigned int, unsigned int)

Called for every request allocated in the tag set. The driver can initialize the driver-private PDU (blk_mq_rq_to_pdu(rq)).

exit_request

void (*)(struct blk_mq_tag_set *, struct request *, unsigned int)

Counterpart to init_request.

Optional callbacks

commit_rqs

void (*)(struct blk_mq_hw_ctx *)

If the driver uses bd->last to decide when to ring the hardware doorbell, it must implement commit_rqs to flush any pending requests when an error occurs mid-batch.

timeout

enum blk_eh_timer_return (*)(struct request *)

Called when a request times out. Return BLK_EH_DONE if the driver handles it (will complete the request), or BLK_EH_RESET_TIMER to extend the deadline.

poll

int (*)(struct blk_mq_hw_ctx *, struct io_comp_batch *)

Enables polled I/O (REQ_POLLED). The kernel calls this in a tight loop instead of waiting for interrupts. Return the number of completions processed.

map_queues

void (*)(struct blk_mq_tag_set *)

Override the default CPU-to-hardware-queue mapping. Used by drivers with non-trivial queue topologies (e.g., NVMe multipath).

Block Device Registration

Allocate and initialize the tag set

struct blk_mq_tag_set tag_set = {
    .ops         = &my_blk_mq_ops,
    .nr_hw_queues = num_online_cpus(),
    .queue_depth  = 128,
    .cmd_size     = sizeof(struct my_request_pdu),
    .flags        = BLK_MQ_F_TAG_QUEUE_SHARED,
    .numa_node    = NUMA_NO_NODE,
    .driver_data  = my_dev,
};

int ret = blk_mq_alloc_tag_set(&tag_set);
if (ret)
    return ret;

Allocate the gendisk

/*
 * blk_mq_alloc_disk() allocates a gendisk and creates the
 * associated request_queue with blk-mq support.
 */
struct gendisk *disk = blk_mq_alloc_disk(&tag_set, NULL, my_dev);
if (IS_ERR(disk)) {
    ret = PTR_ERR(disk);
    goto free_tag_set;
}

Configure the gendisk

strscpy(disk->disk_name, "mydev0", DISK_NAME_LEN);
disk->major        = MY_MAJOR;
disk->first_minor  = 0;
disk->minors       = 1;
disk->fops         = &my_block_fops;
disk->private_data = my_dev;

set_capacity(disk, dev_size_in_sectors);

/* Set queue limits */
blk_queue_max_hw_sectors(disk->queue, 1024);
blk_queue_logical_block_size(disk->queue, 512);
blk_queue_physical_block_size(disk->queue, 4096);

Add the disk

ret = add_disk(disk);
if (ret)
    goto cleanup_disk;

add_disk() makes the disk visible in /dev and triggers udev events. After this call, the kernel may immediately submit I/O to the device.

Teardown

del_gendisk(disk);          /* remove from sysfs and stop new I/O */
blk_mq_free_tag_set(&tag_set); /* free tags after del_gendisk */
put_disk(disk);             /* drop reference; frees if refcount hits 0 */

Always call del_gendisk() before blk_mq_free_tag_set(). Outstanding I/O may reference tag set memory; del_gendisk() drains the queue before returning.

Request Lifecycle

/*
 * In queue_rq callback — mark request as in-flight:
 */
void blk_mq_start_request(struct request *rq);

/*
 * In the completion interrupt or poll callback — end the request:
 */
void blk_mq_end_request(struct request *rq, blk_status_t error);

/*
 * If completion runs on a different CPU and you want to run the
 * complete() callback on the issuing CPU:
 */
bool blk_mq_complete_request_remote(struct request *rq);

/* Typical queue_rq implementation: */
static blk_status_t my_queue_rq(struct blk_mq_hw_ctx *hctx,
                                 const struct blk_mq_queue_data *bd)
{
    struct request *rq = bd->rq;
    struct my_request_pdu *pdu = blk_mq_rq_to_pdu(rq);

    blk_mq_start_request(rq);

    /* Build and submit command to hardware */
    pdu->dma = dma_map_sg(&my_dev->pdev->dev,
                          rq->sg_table.sgl, rq->sg_table.nents,
                          rq_dma_dir(rq));
    my_hw_submit_command(my_dev, rq);
    return BLK_STS_OK;
}

/* In the completion interrupt: */
static void my_irq_complete(struct my_dev *dev, u32 tag)
{
    struct request *rq = blk_mq_tag_to_rq(dev->tags, tag);
    blk_mq_end_request(rq, BLK_STS_OK);
}

I/O Schedulers

blk-mq supports pluggable I/O schedulers that reorder requests between the software queue and the hardware dispatch queue to optimize throughput or latency.

none

No reordering. Requests are dispatched in submission order. Best for NVMe and other devices with internal queuing and low latency. Removes all scheduler overhead.

mq-deadline

Deadline-based scheduler ported from the legacy single-queue layer. Maintains sorted read and write trees and enforces deadline expiry to prevent starvation. Default for many rotating disk workloads.

kyber

Token-bucket based scheduler targeting low-latency workloads. Maintains separate budgets for reads, writes, and other I/O. Designed for fast solid-state storage.

bfq

Budget Fair Queueing: provides per-process I/O fairness and prioritization. Well-suited for desktop and interactive workloads. Has higher CPU overhead than simpler schedulers.

Schedulers can be changed at runtime:

# Query available schedulers (current in brackets)
cat /sys/block/sda/queue/scheduler
# none mq-deadline kyber [bfq]

# Switch to mq-deadline
echo mq-deadline > /sys/block/sda/queue/scheduler

Drivers can request a preferred default scheduler:

blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, disk->queue);
disk->queue->elevator_type = &mq_deadline_ops; /* set preferred scheduler */

The none scheduler (i.e., no scheduler) is usually optimal for NVMe devices because the hardware manages its own internal queue ordering. Adding a software scheduler adds latency without benefit.

Core APIs

Subsystem APIs

struct bio

struct request

blk-mq

`struct bio`

Key `bio` Fields

Iterating bio Segments

`struct request`

Key `request` Fields

Request Helper Macros

Multi-Queue Block Layer (blk-mq)

`blk_mq_tag_set`

`blk_mq_ops`

Block Device Registration

Request Lifecycle

I/O Schedulers

none

mq-deadline

kyber

bfq

Build docs developers (and LLMs) love

Core APIs

Subsystem APIs

struct bio

struct request

blk-mq

​struct bio

​Key bio Fields

​Iterating bio Segments

​struct request

​Key request Fields

​Request Helper Macros

​Multi-Queue Block Layer (blk-mq)

​blk_mq_tag_set

​blk_mq_ops

​Block Device Registration

​Request Lifecycle

​I/O Schedulers

none

mq-deadline

kyber

bfq

Build docs developers (and LLMs) love

`struct bio`

Key `bio` Fields

Iterating bio Segments

`struct request`

Key `request` Fields

Request Helper Macros

Multi-Queue Block Layer (blk-mq)

`blk_mq_tag_set`

`blk_mq_ops`

Block Device Registration

Request Lifecycle

I/O Schedulers