struct bio
Basic unit of block I/O submitted by filesystems
struct request
Merged bio(s) dispatched to the driver
blk-mq
Tag sets, ops, and hardware queue management
struct bio
struct bio is the basic unit of block I/O. It represents a contiguous range of sectors on a block device and carries the data in a scatter-gather list of page vectors (bio_vec). Filesystems, the VM, and DM/MD layers submit bios directly; the block layer merges them into struct requests before dispatching to drivers.
Key bio Fields
I/O target and operation
I/O target and operation
The block device (partition or whole disk) this I/O targets. Set by the submitter; the block layer uses this to route to the correct
request_queue.Operation and flags. The low 8 bits encode
enum req_op (REQ_OP_READ=0, REQ_OP_WRITE=1, REQ_OP_FLUSH=2, REQ_OP_DISCARD=3). The remaining 24 bits carry request flags such as REQ_SYNC, REQ_META, REQ_FUA, REQ_RAHEAD, and REQ_NOWAIT.Tracks the current position within the bio’s
bio_vec list. Contains bi_sector (current sector), bi_size (remaining bytes), bi_idx (current bvec index), and bi_bvec_done (bytes done in current bvec).Completion status set by the driver before calling
bio->bi_end_io. BLK_STS_OK (0) on success. Other values include BLK_STS_IOERR, BLK_STS_TIMEOUT, BLK_STS_NOSPC.Data vectors
Data vectors
Array of scatter-gather segments. Each
bio_vec has a bv_page, bv_len, and bv_offset. The bio carries data from pages described by these vectors.Number of active
bio_vec entries.Completion callback invoked by the block layer after the I/O finishes (success or failure). The callback must call
bio_put() unless ownership is transferred.Opaque pointer for the bio submitter’s private state. The block layer does not touch this field.
Iterating bio Segments
bio_for_each_segment iterates the logical segments as submitted. Use bio_for_each_segment_all() if you need to iterate physical segments after the bio has been split by the block layer.struct request
The block layer merges one or more adjacent bios into a struct request before dispatching to the driver. The driver sees only requests, not raw bios.
Key request Fields
Identity and state
Identity and state
Operation code and flags, mirroring
bio->bi_opf. Use req_op(rq) to extract the enum req_op value.Hardware tag allocated from
blk_mq_tag_set. Unique within the hardware queue. Use blk_mq_rq_from_pdu() in completion paths to map driver-private data back to the request.MQ_RQ_IDLE: not dispatched. MQ_RQ_IN_FLIGHT: dispatched to hardware. MQ_RQ_COMPLETE: completion is being processed.Request timeout in jiffies. The block layer arms a watchdog; if the request does not complete within
timeout jiffies, blk_mq_ops.timeout is called.Data geometry
Data geometry
Total byte count of data in this request. Access via
blk_rq_bytes(rq). Do not read __data_len directly; it is zeroed during completion.Current sector position. Access via
blk_rq_pos(rq). Each sector is 512 bytes.Number of DMA-mappable scatter-gather segments after physical address coalescing. Use this count when calling
dma_map_sg().Linked list of merged bios. Iterate with
rq_for_each_bio(bio, rq) if per-bio processing is needed (rare in blk-mq drivers).Request Helper Macros
Multi-Queue Block Layer (blk-mq)
blk-mq introduces a two-level queue hierarchy:- Software queues (
blk_mq_ctx): one per CPU, lock-free submission. - Hardware queues (
blk_mq_hw_ctx): one (or more) per hardware submission queue on the device.
blk_mq_queue_map. For a device with a single hardware queue all CPUs map to it. For NVMe with 32 queues each queue is typically pinned to a NUMA node.
blk_mq_tag_set
blk_mq_tag_set is the central configuration structure allocated once per driver instance and shared across all request_queues (e.g., all namespaces of an NVMe controller).
Driver callback table. Must be set before calling
blk_mq_alloc_tag_set().Number of hardware submission queues. Must match the hardware. For single-queue devices, set to 1.
Maximum number of in-flight requests per hardware queue (i.e., number of tags). Must not exceed the device’s hardware queue depth. Typical values: 128–1024 for NVMe, 32 for SATA.
Extra bytes appended to each
struct request for driver-private data (the “PDU”). Access via blk_mq_rq_to_pdu(). Set to sizeof(struct my_driver_request).BLK_MQ_F_TAG_QUEUE_SHARED: share tags across all hardware queues. BLK_MQ_F_BLOCKING: the queue_rq callback may sleep (uses SRCU instead of RCU). BLK_MQ_F_NO_SCHED: disable I/O scheduler.blk_mq_ops
The driver implements the following callbacks:
Required callbacks
Required callbacks
Required. Submit a request to the hardware. The
blk_mq_queue_data contains the request *rq and bool last (hint that this is the last request in a batch). On success, call blk_mq_start_request(rq) and initiate DMA. Return BLK_STS_OK. On transient resource exhaustion, return BLK_STS_DEV_RESOURCE — the block layer will retry.Required. Called to complete a request on the CPU that initiated the completion. Typically calls
blk_mq_end_request(rq, status).Initialization callbacks
Initialization callbacks
Called once per hardware queue after the queue is set up. The driver can allocate per-queue resources. Arguments:
hctx, driver_data (from blk_mq_tag_set.driver_data), hctx_idx.Counterpart to
init_hctx. Called during teardown.Called for every request allocated in the tag set. The driver can initialize the driver-private PDU (
blk_mq_rq_to_pdu(rq)).Counterpart to
init_request.Optional callbacks
Optional callbacks
If the driver uses
bd->last to decide when to ring the hardware doorbell, it must implement commit_rqs to flush any pending requests when an error occurs mid-batch.Called when a request times out. Return
BLK_EH_DONE if the driver handles it (will complete the request), or BLK_EH_RESET_TIMER to extend the deadline.Enables polled I/O (
REQ_POLLED). The kernel calls this in a tight loop instead of waiting for interrupts. Return the number of completions processed.Override the default CPU-to-hardware-queue mapping. Used by drivers with non-trivial queue topologies (e.g., NVMe multipath).
Block Device Registration
Add the disk
add_disk() makes the disk visible in /dev and triggers udev events. After this call, the kernel may immediately submit I/O to the device.Request Lifecycle
I/O Schedulers
blk-mq supports pluggable I/O schedulers that reorder requests between the software queue and the hardware dispatch queue to optimize throughput or latency.none
No reordering. Requests are dispatched in submission order. Best for NVMe and other devices with internal queuing and low latency. Removes all scheduler overhead.
mq-deadline
Deadline-based scheduler ported from the legacy single-queue layer. Maintains sorted read and write trees and enforces deadline expiry to prevent starvation. Default for many rotating disk workloads.
kyber
Token-bucket based scheduler targeting low-latency workloads. Maintains separate budgets for reads, writes, and other I/O. Designed for fast solid-state storage.
bfq
Budget Fair Queueing: provides per-process I/O fairness and prioritization. Well-suited for desktop and interactive workloads. Has higher CPU overhead than simpler schedulers.
The
none scheduler (i.e., no scheduler) is usually optimal for NVMe devices because the hardware manages its own internal queue ordering. Adding a software scheduler adds latency without benefit.