Re: [PATCH v4 7/7] ext4: Add atomic block write documentation

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 15 May 2025 09:58:36 -0700

On Thu, May 15, 2025 at 08:15:39PM +0530, Ritesh Harjani (IBM) wrote:
> Add an initial documentation around atomic writes support in ext4.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
> ---
>  .../filesystems/ext4/atomic_writes.rst        | 220 ++++++++++++++++++
>  Documentation/filesystems/ext4/overview.rst   |   1 +
>  2 files changed, 221 insertions(+)
>  create mode 100644 Documentation/filesystems/ext4/atomic_writes.rst
> 
> diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst
> new file mode 100644
> index 000000000000..de54eeb6aaae
> --- /dev/null
> +++ b/Documentation/filesystems/ext4/atomic_writes.rst
> @@ -0,0 +1,220 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. _atomic_writes:
> +
> +Atomic Block Writes
> +-------------------------
> +
> +Introduction
> +~~~~~~~~~~~~
> +
> +Atomic (untorn) block writes ensure that either the entire write is committed
> +to disk or none of it is. This prevents "torn writes" during power loss or
> +system crashes. The ext4 filesystem supports atomic writes (only with Direct
> +I/O) on regular files with extents, provided the underlying storage device
> +supports hardware atomic writes. This is supported in the following two ways:
> +
> +1. **Single-fsblock Atomic Writes**:
> +   EXT4's supports atomic write operations with a single filesystem block since
> +   v6.13. In this the atomic write unit minimum and maximum sizes are both set
> +   to filesystem blocksize.
> +   e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
> +   pagesize system is possible.
> +
> +2. **Multi-fsblock Atomic Writes with Bigalloc**:
> +   EXT4 now also supports atomic writes spanning multiple filesystem blocks
> +   using a feature known as bigalloc. The atomic write unit's minimum and
> +   maximum sizes are determined by the filesystem block size and cluster size,
> +   based on the underlying device’s supported atomic write unit limits.
> +
> +Requirements
> +~~~~~~~~~~~~
> +
> +Basic requirements for atomic writes in ext4:
> +
> + 1. The extents feature must be enabled (default for ext4)
> + 2. The underlying block device must support atomic writes
> + 3. For single-fsblock atomic writes:
> +
> +    1. A filesystem with appropriate block size (up to the page size)
> + 4. For multi-fsblock atomic writes:
> +
> +    1. The bigalloc feature must be enabled
> +    2. The cluster size must be appropriately configured
> +
> +NOTE: EXT4 does not support software or COW based atomic write, which means
> +atomic writes on ext4 are only supported if underlying storage device supports
> +it.
> +
> +Multi-fsblock Implementation Details
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The bigalloc feature changes ext4 to allocate in units of multiple filesystem
> +blocks, also known as clusters. With bigalloc each bit within block bitmap
> +represents cluster (power of 2 number of blocks) rather than individual

Nit: "...represents one cluster"

With that fixed,
Acked-by: "Darrick J. Wong" <djwong@xxxxxxxxxx>

--D

> +filesystem blocks.
> +EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
> +following constraints. The minimum atomic write size is the larger of the fs
> +block size and the minimum hardware atomic write unit; and the maximum atomic
> +write size is smaller of the bigalloc cluster size and the maximum hardware
> +atomic write unit.  Bigalloc ensures that all allocations are aligned to the
> +cluster size, which satisfies the LBA alignment requirements of the hardware
> +device if the start of the partition/logical volume is itself aligned correctly.
> +
> +Here is the block allocation strategy in bigalloc for atomic writes:
> +
> + * For regions with fully mapped extents, no additional work is needed
> + * For append writes, a new mapped extent is allocated
> + * For regions that are entirely holes, unwritten extent is created
> + * For large unwritten extents, the extent gets split into two unwritten
> +   extents of appropriate requested size
> + * For mixed mapping regions (combinations of holes, unwritten extents, or
> +   mapped extents), ext4_map_blocks() is called in a loop with
> +   EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
> +   mapped extent by writing zeroes to it and converting any unwritten extents to
> +   written, if found within the range.
> +
> +Note: Writing on a single contiguous underlying extent, whether mapped or
> +unwritten, is not inherently problematic. However, writing to a mixed mapping
> +region (i.e. one containing a combination of mapped and unwritten extents)
> +must be avoided when performing atomic writes.
> +
> +The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
> +flag, requires that either all data is written or none at all. In the event of
> +a system crash or unexpected power loss during the write operation, the affected
> +region (when later read) must reflect either the complete old data or the
> +complete new data, but never a mix of both.
> +
> +To enforce this guarantee, we ensure that the write target is backed by
> +a single, contiguous extent before any data is written. This is critical because
> +ext4 defers the conversion of unwritten extents to written extents until the I/O
> +completion path (typically in ->end_io()). If a write is allowed to proceed over
> +a mixed mapping region (with mapped and unwritten extents) and a failure occurs
> +mid-write, the system could observe partially updated regions after reboot, i.e.
> +new data over mapped areas, and stale (old) data over unwritten extents that
> +were never marked written. This violates the atomicity and/or torn write
> +prevention guarantee.
> +
> +To prevent such torn writes, ext4 proactively allocates a single contiguous
> +extent for the entire requested region in ``ext4_iomap_alloc`` via
> +``ext4_map_blocks_atomic()``. Only after this allocation, is the write
> +operation performed by iomap.
> +
> +Handling Split Extents Across Leaf Blocks
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +There can be a special edge case where we have logically and physically
> +contiguous extents stored in separate leaf nodes of the on-disk extent tree.
> +This occurs because on-disk extent tree merges only happens within the leaf
> +blocks except for a case where we have 2-level tree which can get merged and
> +collapsed entirely into the inode.
> +If such a layout exists and, in the worst case, the extent status cache entries
> +are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
> +a single contiguous extent for these split leaf extents.
> +
> +To address this edge case, a new get block flag
> +``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
> +``ext4_map_query_blocks()`` lookup behavior.
> +
> +This new get block flag allows ``ext4_map_blocks()`` to first check if there is
> +an entry in the extent status cache for the full range.
> +If not present, it consults the on-disk extent tree using
> +``ext4_map_query_blocks()``.
> +If the located extent is at the end of a leaf node, it probes the next logical
> +block (lblk) to detect a contiguous extent in the adjacent leaf.
> +
> +For now only one additional leaf block is queried to maintain efficiency, as
> +atomic writes are typically constrained to small sizes
> +(e.g. [blocksize, clustersize]).
> +
> +
> +Handling Journal transactions
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +To support multi-fsblock atomic writes, we ensure enough journal credits are
> +reserved during:
> +
> + 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
> +    could be a mixed mapping for the underlying requested range. If yes, then we
> +    reserve credits of up to ``m_len``, assuming every alternate block can be
> +    an unwritten extent followed by a hole.
> +
> + 2. During ``->end_io()`` call, we make sure a single transaction is started for
> +    doing unwritten-to-written conversion. The loop for conversion is mainly
> +    only required to handle a split extent across leaf blocks.
> +
> +How to
> +------
> +
> +Creating Filesystems with Atomic Write Support
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +First check the atomic write units supported by block device.
> +See :ref:`atomic_write_bdev_support` for more details.
> +
> +For single-fsblock atomic writes with a larger block size
> +(on systems with block size < page size):
> +
> +.. code-block:: bash
> +
> +    # Create an ext4 filesystem with a 16KB block size
> +    # (requires page size >= 16KB)
> +    mkfs.ext4 -b 16384 /dev/device
> +
> +For multi-fsblock atomic writes with bigalloc:
> +
> +.. code-block:: bash
> +
> +    # Create an ext4 filesystem with bigalloc and 64KB cluster size
> +    mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
> +
> +Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
> +and ``-O bigalloc`` enables the bigalloc feature.
> +
> +Application Interface
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
> +to perform atomic writes:
> +
> +.. code-block:: c
> +
> +    pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
> +
> +The write must be aligned to the filesystem's block size and not exceed the
> +filesystem's maximum atomic write unit size.
> +See ``generic_atomic_write_valid()`` for more details.
> +
> +``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
> +details:
> +
> + * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
> + * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
> + * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
> +   separate memory buffers that can be gathered into a write operation
> +   (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
> +
> +The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
> +writes are supported.
> +
> +.. _atomic_write_bdev_support:
> +
> +Hardware Support
> +----------------
> +
> +The underlying storage device must support atomic write operations.
> +Modern NVMe and SCSI devices often provide this capability.
> +The Linux kernel exposes this information through sysfs:
> +
> +* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
> +* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
> +
> +Nonzero values for these attributes indicate that the device supports
> +atomic writes.
> +
> +See Also
> +--------
> +
> +* :doc:`bigalloc` - Documentation on the bigalloc feature
> +* :doc:`allocators` - Documentation on block allocation in ext4
> +* Support for atomic block writes in 6.13:
> +  https://lwn.net/Articles/1009298/
> diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst
> index 0fad6eda6e15..9d4054c17ecb 100644
> --- a/Documentation/filesystems/ext4/overview.rst
> +++ b/Documentation/filesystems/ext4/overview.rst
> @@ -25,3 +25,4 @@ order.
>  .. include:: inlinedata.rst
>  .. include:: eainode.rst
>  .. include:: verity.rst
> +.. include:: atomic_writes.rst
> -- 
> 2.49.0
> 
>