Skip to main content
The Linux block layer mediates all I/O between filesystems and storage devices. The modern multi-queue block layer (blk-mq) maps software submission queues (one per CPU) to hardware dispatch queues, eliminating the global queue lock that bottlenecked single-queue designs and enabling full parallelism on NVMe and other high-queue-depth devices.

struct bio

Basic unit of block I/O submitted by filesystems

struct request

Merged bio(s) dispatched to the driver

blk-mq

Tag sets, ops, and hardware queue management

struct bio

struct bio is the basic unit of block I/O. It represents a contiguous range of sectors on a block device and carries the data in a scatter-gather list of page vectors (bio_vec). Filesystems, the VM, and DM/MD layers submit bios directly; the block layer merges them into struct requests before dispatching to drivers.
struct bio {
    struct bio          *bi_next;       /* request queue link */
    struct block_device *bi_bdev;       /* target block device */
    blk_opf_t            bi_opf;        /* REQ_OP_* | req flags */
    unsigned short       bi_flags;      /* BIO_* status flags */
    unsigned short       bi_ioprio;     /* I/O priority */
    blk_status_t         bi_status;     /* completion status */
    struct bio_vec      *bi_io_vec;     /* scatter-gather list */
    struct bvec_iter     bi_iter;       /* current position iterator */
    bio_end_io_t        *bi_end_io;    /* completion callback */
    void                *bi_private;   /* caller private data */
    unsigned short       bi_vcnt;      /* number of bio_vecs */
    unsigned short       bi_max_vecs;  /* allocated bio_vecs */
};

Key bio Fields

bi_bdev
struct block_device *
The block device (partition or whole disk) this I/O targets. Set by the submitter; the block layer uses this to route to the correct request_queue.
bi_opf
blk_opf_t
Operation and flags. The low 8 bits encode enum req_op (REQ_OP_READ=0, REQ_OP_WRITE=1, REQ_OP_FLUSH=2, REQ_OP_DISCARD=3). The remaining 24 bits carry request flags such as REQ_SYNC, REQ_META, REQ_FUA, REQ_RAHEAD, and REQ_NOWAIT.
bi_iter
struct bvec_iter
Tracks the current position within the bio’s bio_vec list. Contains bi_sector (current sector), bi_size (remaining bytes), bi_idx (current bvec index), and bi_bvec_done (bytes done in current bvec).
bi_status
blk_status_t
Completion status set by the driver before calling bio->bi_end_io. BLK_STS_OK (0) on success. Other values include BLK_STS_IOERR, BLK_STS_TIMEOUT, BLK_STS_NOSPC.
bi_io_vec
struct bio_vec *
Array of scatter-gather segments. Each bio_vec has a bv_page, bv_len, and bv_offset. The bio carries data from pages described by these vectors.
bi_vcnt
unsigned short
Number of active bio_vec entries.
bi_end_io
bio_end_io_t *
Completion callback invoked by the block layer after the I/O finishes (success or failure). The callback must call bio_put() unless ownership is transferred.
bi_private
void *
Opaque pointer for the bio submitter’s private state. The block layer does not touch this field.

Iterating bio Segments

#include <linux/bio.h>

struct bio_vec bvec;
struct bvec_iter iter;

bio_for_each_segment(bvec, bio, iter) {
    /*
     * bvec.bv_page   - the page holding this segment
     * bvec.bv_offset - byte offset within the page
     * bvec.bv_len    - byte length of this segment
     */
    void *kaddr = kmap_local_page(bvec.bv_page);
    do_something(kaddr + bvec.bv_offset, bvec.bv_len);
    kunmap_local(kaddr);
}
bio_for_each_segment iterates the logical segments as submitted. Use bio_for_each_segment_all() if you need to iterate physical segments after the bio has been split by the block layer.

struct request

The block layer merges one or more adjacent bios into a struct request before dispatching to the driver. The driver sees only requests, not raw bios.
struct request {
    struct request_queue    *q;             /* owning queue */
    struct blk_mq_ctx       *mq_ctx;       /* software queue context */
    struct blk_mq_hw_ctx    *mq_hctx;     /* hardware queue context */
    blk_opf_t                cmd_flags;    /* REQ_OP_* | request flags */
    req_flags_t              rq_flags;     /* RQF_* internal flags */
    int                      tag;          /* hardware tag */
    int                      internal_tag; /* scheduler tag */
    unsigned int             timeout;      /* timeout in jiffies */
    unsigned int             __data_len;   /* total data length */
    sector_t                 __sector;     /* sector cursor */
    struct bio              *bio;          /* first bio */
    struct bio              *biotail;      /* last bio */
    struct block_device     *part;         /* target partition */
    u64                      start_time_ns;
    u64                      io_start_time_ns;
    unsigned short           nr_phys_segments;
    enum mq_rq_state         state;        /* MQ_RQ_IDLE/IN_FLIGHT/COMPLETE */
    rq_end_io_fn            *end_io;      /* completion callback */
    void                    *end_io_data;
};

Key request Fields

cmd_flags
blk_opf_t
Operation code and flags, mirroring bio->bi_opf. Use req_op(rq) to extract the enum req_op value.
tag
int
Hardware tag allocated from blk_mq_tag_set. Unique within the hardware queue. Use blk_mq_rq_from_pdu() in completion paths to map driver-private data back to the request.
state
enum mq_rq_state
MQ_RQ_IDLE: not dispatched. MQ_RQ_IN_FLIGHT: dispatched to hardware. MQ_RQ_COMPLETE: completion is being processed.
timeout
unsigned int
Request timeout in jiffies. The block layer arms a watchdog; if the request does not complete within timeout jiffies, blk_mq_ops.timeout is called.
__data_len
unsigned int
Total byte count of data in this request. Access via blk_rq_bytes(rq). Do not read __data_len directly; it is zeroed during completion.
__sector
sector_t
Current sector position. Access via blk_rq_pos(rq). Each sector is 512 bytes.
nr_phys_segments
unsigned short
Number of DMA-mappable scatter-gather segments after physical address coalescing. Use this count when calling dma_map_sg().
bio / biotail
struct bio *
Linked list of merged bios. Iterate with rq_for_each_bio(bio, rq) if per-bio processing is needed (rare in blk-mq drivers).

Request Helper Macros

/* Get operation */
enum req_op op = req_op(rq);
bool is_write   = op_is_write(req_op(rq));

/* Get position and size */
sector_t start  = blk_rq_pos(rq);
unsigned bytes  = blk_rq_bytes(rq);
unsigned sectors = blk_rq_sectors(rq);

/* DMA direction */
enum dma_data_direction dir = rq_dma_dir(rq); /* DMA_TO_DEVICE or DMA_FROM_DEVICE */

/* Access driver-private data appended to the request (cmd_size bytes) */
void *pdu = blk_mq_rq_to_pdu(rq);
struct request *rq = blk_mq_rq_from_pdu(pdu);

Multi-Queue Block Layer (blk-mq)

blk-mq introduces a two-level queue hierarchy:
  • Software queues (blk_mq_ctx): one per CPU, lock-free submission.
  • Hardware queues (blk_mq_hw_ctx): one (or more) per hardware submission queue on the device.
The block layer maps software queues to hardware queues via blk_mq_queue_map. For a device with a single hardware queue all CPUs map to it. For NVMe with 32 queues each queue is typically pinned to a NUMA node.

blk_mq_tag_set

blk_mq_tag_set is the central configuration structure allocated once per driver instance and shared across all request_queues (e.g., all namespaces of an NVMe controller).
struct blk_mq_tag_set {
    const struct blk_mq_ops *ops;           /* driver callbacks */
    struct blk_mq_queue_map  map[HCTX_MAX_TYPES]; /* CPU→hctx mapping */
    unsigned int             nr_maps;       /* number of valid maps */
    unsigned int             nr_hw_queues;  /* hardware queue count */
    unsigned int             queue_depth;   /* tags per hw queue */
    unsigned int             reserved_tags; /* reserved for BLK_MQ_REQ_RESERVED */
    unsigned int             cmd_size;      /* extra bytes per request (driver PDU) */
    int                      numa_node;
    unsigned int             timeout;       /* request timeout (jiffies) */
    unsigned int             flags;         /* BLK_MQ_F_* */
    void                    *driver_data;
    struct blk_mq_tags     **tags;          /* per-hw-queue tag arrays */
};
ops
const struct blk_mq_ops *
required
Driver callback table. Must be set before calling blk_mq_alloc_tag_set().
nr_hw_queues
unsigned int
required
Number of hardware submission queues. Must match the hardware. For single-queue devices, set to 1.
queue_depth
unsigned int
required
Maximum number of in-flight requests per hardware queue (i.e., number of tags). Must not exceed the device’s hardware queue depth. Typical values: 128–1024 for NVMe, 32 for SATA.
cmd_size
unsigned int
Extra bytes appended to each struct request for driver-private data (the “PDU”). Access via blk_mq_rq_to_pdu(). Set to sizeof(struct my_driver_request).
flags
unsigned int
BLK_MQ_F_TAG_QUEUE_SHARED: share tags across all hardware queues. BLK_MQ_F_BLOCKING: the queue_rq callback may sleep (uses SRCU instead of RCU). BLK_MQ_F_NO_SCHED: disable I/O scheduler.

blk_mq_ops

The driver implements the following callbacks:
queue_rq
blk_status_t (*)(struct blk_mq_hw_ctx *, const struct blk_mq_queue_data *)
Required. Submit a request to the hardware. The blk_mq_queue_data contains the request *rq and bool last (hint that this is the last request in a batch). On success, call blk_mq_start_request(rq) and initiate DMA. Return BLK_STS_OK. On transient resource exhaustion, return BLK_STS_DEV_RESOURCE — the block layer will retry.
complete
void (*)(struct request *)
Required. Called to complete a request on the CPU that initiated the completion. Typically calls blk_mq_end_request(rq, status).
init_hctx
int (*)(struct blk_mq_hw_ctx *, void *, unsigned int)
Called once per hardware queue after the queue is set up. The driver can allocate per-queue resources. Arguments: hctx, driver_data (from blk_mq_tag_set.driver_data), hctx_idx.
exit_hctx
void (*)(struct blk_mq_hw_ctx *, unsigned int)
Counterpart to init_hctx. Called during teardown.
init_request
int (*)(struct blk_mq_tag_set *, struct request *, unsigned int, unsigned int)
Called for every request allocated in the tag set. The driver can initialize the driver-private PDU (blk_mq_rq_to_pdu(rq)).
exit_request
void (*)(struct blk_mq_tag_set *, struct request *, unsigned int)
Counterpart to init_request.
commit_rqs
void (*)(struct blk_mq_hw_ctx *)
If the driver uses bd->last to decide when to ring the hardware doorbell, it must implement commit_rqs to flush any pending requests when an error occurs mid-batch.
timeout
enum blk_eh_timer_return (*)(struct request *)
Called when a request times out. Return BLK_EH_DONE if the driver handles it (will complete the request), or BLK_EH_RESET_TIMER to extend the deadline.
poll
int (*)(struct blk_mq_hw_ctx *, struct io_comp_batch *)
Enables polled I/O (REQ_POLLED). The kernel calls this in a tight loop instead of waiting for interrupts. Return the number of completions processed.
map_queues
void (*)(struct blk_mq_tag_set *)
Override the default CPU-to-hardware-queue mapping. Used by drivers with non-trivial queue topologies (e.g., NVMe multipath).

Block Device Registration

1

Allocate and initialize the tag set

struct blk_mq_tag_set tag_set = {
    .ops         = &my_blk_mq_ops,
    .nr_hw_queues = num_online_cpus(),
    .queue_depth  = 128,
    .cmd_size     = sizeof(struct my_request_pdu),
    .flags        = BLK_MQ_F_TAG_QUEUE_SHARED,
    .numa_node    = NUMA_NO_NODE,
    .driver_data  = my_dev,
};

int ret = blk_mq_alloc_tag_set(&tag_set);
if (ret)
    return ret;
2

Allocate the gendisk

/*
 * blk_mq_alloc_disk() allocates a gendisk and creates the
 * associated request_queue with blk-mq support.
 */
struct gendisk *disk = blk_mq_alloc_disk(&tag_set, NULL, my_dev);
if (IS_ERR(disk)) {
    ret = PTR_ERR(disk);
    goto free_tag_set;
}
3

Configure the gendisk

strscpy(disk->disk_name, "mydev0", DISK_NAME_LEN);
disk->major        = MY_MAJOR;
disk->first_minor  = 0;
disk->minors       = 1;
disk->fops         = &my_block_fops;
disk->private_data = my_dev;

set_capacity(disk, dev_size_in_sectors);

/* Set queue limits */
blk_queue_max_hw_sectors(disk->queue, 1024);
blk_queue_logical_block_size(disk->queue, 512);
blk_queue_physical_block_size(disk->queue, 4096);
4

Add the disk

ret = add_disk(disk);
if (ret)
    goto cleanup_disk;
add_disk() makes the disk visible in /dev and triggers udev events. After this call, the kernel may immediately submit I/O to the device.
5

Teardown

del_gendisk(disk);          /* remove from sysfs and stop new I/O */
blk_mq_free_tag_set(&tag_set); /* free tags after del_gendisk */
put_disk(disk);             /* drop reference; frees if refcount hits 0 */
Always call del_gendisk() before blk_mq_free_tag_set(). Outstanding I/O may reference tag set memory; del_gendisk() drains the queue before returning.

Request Lifecycle

/*
 * In queue_rq callback — mark request as in-flight:
 */
void blk_mq_start_request(struct request *rq);

/*
 * In the completion interrupt or poll callback — end the request:
 */
void blk_mq_end_request(struct request *rq, blk_status_t error);

/*
 * If completion runs on a different CPU and you want to run the
 * complete() callback on the issuing CPU:
 */
bool blk_mq_complete_request_remote(struct request *rq);
/* Typical queue_rq implementation: */
static blk_status_t my_queue_rq(struct blk_mq_hw_ctx *hctx,
                                 const struct blk_mq_queue_data *bd)
{
    struct request *rq = bd->rq;
    struct my_request_pdu *pdu = blk_mq_rq_to_pdu(rq);

    blk_mq_start_request(rq);

    /* Build and submit command to hardware */
    pdu->dma = dma_map_sg(&my_dev->pdev->dev,
                          rq->sg_table.sgl, rq->sg_table.nents,
                          rq_dma_dir(rq));
    my_hw_submit_command(my_dev, rq);
    return BLK_STS_OK;
}

/* In the completion interrupt: */
static void my_irq_complete(struct my_dev *dev, u32 tag)
{
    struct request *rq = blk_mq_tag_to_rq(dev->tags, tag);
    blk_mq_end_request(rq, BLK_STS_OK);
}

I/O Schedulers

blk-mq supports pluggable I/O schedulers that reorder requests between the software queue and the hardware dispatch queue to optimize throughput or latency.

none

No reordering. Requests are dispatched in submission order. Best for NVMe and other devices with internal queuing and low latency. Removes all scheduler overhead.

mq-deadline

Deadline-based scheduler ported from the legacy single-queue layer. Maintains sorted read and write trees and enforces deadline expiry to prevent starvation. Default for many rotating disk workloads.

kyber

Token-bucket based scheduler targeting low-latency workloads. Maintains separate budgets for reads, writes, and other I/O. Designed for fast solid-state storage.

bfq

Budget Fair Queueing: provides per-process I/O fairness and prioritization. Well-suited for desktop and interactive workloads. Has higher CPU overhead than simpler schedulers.
Schedulers can be changed at runtime:
# Query available schedulers (current in brackets)
cat /sys/block/sda/queue/scheduler
# none mq-deadline kyber [bfq]

# Switch to mq-deadline
echo mq-deadline > /sys/block/sda/queue/scheduler
Drivers can request a preferred default scheduler:
blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, disk->queue);
disk->queue->elevator_type = &mq_deadline_ops; /* set preferred scheduler */
The none scheduler (i.e., no scheduler) is usually optimal for NVMe devices because the hardware manages its own internal queue ordering. Adding a software scheduler adds latency without benefit.

Build docs developers (and LLMs) love