On Fri, May 09, 2025 at 02:20:35AM +0530, Ritesh Harjani (IBM) wrote: > EXT4 supports bigalloc feature which allows the FS to work in size of > clusters (group of blocks) rather than individual blocks. This patch > adds atomic write support for bigalloc so that systems with bs = ps can > also create FS using - > mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev> > > With bigalloc ext4 can support multi-fsblock atomic writes. We will have to > adjust ext4's atomic write unit max value to cluster size. This can then support > atomic write of size anywhere between [blocksize, clustersize]. This > patch adds the required changes to enable multi-fsblock atomic write > support using bigalloc in the next patch. > > In this patch for block allocation: > we first query the underlying region of the requested range by calling > ext4_map_blocks() call. Here are the various cases which we then handle > depending upon the underlying mapping type: > 1. If the underlying region for the entire requested range is a mapped extent, > then we don't call ext4_map_blocks() to allocate anything. We don't need to > even start the jbd2 txn in this case. > 2. For an append write case, we create a mapped extent. > 3. If the underlying region is entirely a hole, then we create an unwritten > extent for the requested range. > 4. If the underlying region is a large unwritten extent, then we split the > extent into 2 unwritten extent of required size. > 5. If the underlying region has any type of mixed mapping, then we call > ext4_map_blocks() in a loop to zero out the unwritten and the hole regions > within the requested range. This then provide a single mapped extent type > mapping for the requested range. > > Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO > flag only when the underlying extent mapping of the requested range is > not entirely a hole, an unwritten extent, or a fully mapped extent. That > is, if the underlying region contains a mix of hole(s), unwritten > extent(s), and mapped extent(s), we use this loop to ensure that all the > short mappings are zeroed out. This guarantees that the entire requested > range becomes a single, uniformly mapped extent. It is ok to do so > because we know this is being done on a bigalloc enabled filesystem > where the block bitmap represents the entire cluster unit. > > Note having a single contiguous underlying region of type mapped, > unwrittn or hole is not a problem. But the reason to avoid writing on > top of mixed mapping region is because, atomic writes requires all or > nothing should get written for the userspace pwritev2 request. So if at > any point in time during the write if a crash or a sudden poweroff > occurs, the region undergoing atomic write should read either complete > old data or complete new data. But it should never have a mix of both > old and new data. > So, we first convert any mixed mapping region to a single contiguous > mapped extent before any data gets written to it. This is because > normally FS will only convert unwritten extents to written at the end of > the write in ->end_io() call. And if we allow the writes over a mixed > mapping and if a sudden power off happens in between, we will end up > reading mix of new data (over mapped extents) and old data (over > unwritten extents), because unwritten to written conversion never went > through. > So to avoid this and to avoid writes getting torned due to mixed > mapping, we first allocate a single contiguous block mapping and then > do the write. > > Co-developed-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> Looks fine (I don't like the pre-zeroing but options are limited on ext4) except for one thing... > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 8b86b1a29bdc..2642e1ef128f 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3412,6 +3412,136 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, > } > } > > +static int ext4_map_blocks_atomic_write_slow(handle_t *handle, > + struct inode *inode, struct ext4_map_blocks *map) > +{ > + ext4_lblk_t m_lblk = map->m_lblk; > + unsigned int m_len = map->m_len; > + unsigned int mapped_len = 0, m_flags = 0; > + ext4_fsblk_t next_pblk; > + bool check_next_pblk = false; > + int ret = 0; > + > + WARN_ON_ONCE(!ext4_has_feature_bigalloc(inode->i_sb)); > + > + /* > + * This is a slow path in case of mixed mapping. We use > + * EXT4_GET_BLOCKS_CREATE_ZERO flag here to make sure we get a single > + * contiguous mapped mapping. This will ensure any unwritten or hole > + * regions within the requested range is zeroed out and we return > + * a single contiguous mapped extent. > + */ > + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO; > + > + do { > + ret = ext4_map_blocks(handle, inode, map, m_flags); > + if (ret < 0 && ret != -ENOSPC) > + goto out_err; > + /* > + * This should never happen, but let's return an error code to > + * avoid an infinite loop in here. > + */ > + if (ret == 0) { > + ret = -EFSCORRUPTED; > + ext4_warning_inode(inode, > + "ext4_map_blocks() couldn't allocate blocks m_flags: 0x%x, ret:%d", > + m_flags, ret); > + goto out_err; > + } > + /* > + * With bigalloc we should never get ENOSPC nor discontiguous > + * physical extents. > + */ > + if ((check_next_pblk && next_pblk != map->m_pblk) || > + ret == -ENOSPC) { > + ext4_warning_inode(inode, > + "Non-contiguous allocation detected: expected %llu, got %llu, " > + "or ext4_map_blocks() returned out of space ret: %d", > + next_pblk, map->m_pblk, ret); > + ret = -ENOSPC; > + goto out_err; If you get physically discontiguous mappings within a cluster, the extent tree is corrupt. --D > + } > + next_pblk = map->m_pblk + map->m_len; > + check_next_pblk = true; > + > + mapped_len += map->m_len; > + map->m_lblk += map->m_len; > + map->m_len = m_len - mapped_len; > + } while (mapped_len < m_len); > + > + /* > + * We might have done some work in above loop, so we need to query the > + * start of the physical extent, based on the origin m_lblk and m_len. > + * Let's also ensure we were able to allocate the required range for > + * mixed mapping case. > + */ > + map->m_lblk = m_lblk; > + map->m_len = m_len; > + map->m_flags = 0; > + > + ret = ext4_map_blocks(handle, inode, map, > + EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF); > + if (ret != m_len) { > + ext4_warning_inode(inode, > + "allocation failed for atomic write request m_lblk:%u, m_len:%u, ret:%d\n", > + m_lblk, m_len, ret); > + ret = -EINVAL; > + } > + return ret; > + > +out_err: > + /* reset map before returning an error */ > + map->m_lblk = m_lblk; > + map->m_len = m_len; > + map->m_flags = 0; > + return ret; > +} > + > +/* > + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested > + * range in @map [lblk, lblk + len) is one single contiguous extent with no > + * mixed mappings. > + * > + * We first use m_flags passed to us by our caller (ext4_iomap_alloc()). > + * We only call EXT4_GET_BLOCKS_ZERO in the slow path, when the underlying > + * physical extent for the requested range does not have a single contiguous > + * mapping type i.e. (Hole, Mapped, or Unwritten) throughout. > + * In that case we will loop over the requested range to allocate and zero out > + * the unwritten / holes in between, to get a single mapped extent from > + * [m_lblk, m_lblk + m_len). Note that this is only possible because we know > + * this can be called only with bigalloc enabled filesystem where the underlying > + * cluster is already allocated. This avoids allocating discontiguous extents > + * in the slow path due to multiple calls to ext4_map_blocks(). > + * The slow path is mostly non-performance critical path, so it should be ok to > + * loop using ext4_map_blocks() with appropriate flags to allocate & zero the > + * underlying short holes/unwritten extents within the requested range. > + */ > +static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode, > + struct ext4_map_blocks *map, int m_flags) > +{ > + ext4_lblk_t m_lblk = map->m_lblk; > + unsigned int m_len = map->m_len; > + int ret = 0; > + > + WARN_ON_ONCE(m_len > 1 && !ext4_has_feature_bigalloc(inode->i_sb)); > + > + ret = ext4_map_blocks(handle, inode, map, m_flags); > + if (ret < 0 || ret == m_len) > + goto out; > + /* > + * This is a mixed mapping case where we were not able to allocate > + * a single contiguous extent. In that case let's reset requested > + * mapping and call the slow path. > + */ > + map->m_lblk = m_lblk; > + map->m_len = m_len; > + map->m_flags = 0; > + > + return ext4_map_blocks_atomic_write_slow(handle, inode, map); > +out: > + return ret; > +} > + > static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, > unsigned int flags) > { > @@ -3425,7 +3555,30 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, > */ > if (map->m_len > DIO_MAX_BLOCKS) > map->m_len = DIO_MAX_BLOCKS; > - dio_credits = ext4_chunk_trans_blocks(inode, map->m_len); > + > + /* > + * journal credits estimation for atomic writes. We call > + * ext4_map_blocks(), to find if there could be a mixed mapping. If yes, > + * then let's assume the no. of pextents required can be m_len i.e. > + * every alternate block can be unwritten and hole. > + */ > + if (flags & IOMAP_ATOMIC) { > + unsigned int orig_mlen = map->m_len; > + > + ret = ext4_map_blocks(NULL, inode, map, 0); > + if (ret < 0) > + return ret; > + if (map->m_len < orig_mlen) { > + map->m_len = orig_mlen; > + dio_credits = ext4_meta_trans_blocks(inode, orig_mlen, > + map->m_len); > + } else { > + dio_credits = ext4_chunk_trans_blocks(inode, > + map->m_len); > + } > + } else { > + dio_credits = ext4_chunk_trans_blocks(inode, map->m_len); > + } > > retry: > /* > @@ -3456,7 +3609,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, > else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; > > - ret = ext4_map_blocks(handle, inode, map, m_flags); > + if (flags & IOMAP_ATOMIC) > + ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags); > + else > + ret = ext4_map_blocks(handle, inode, map, m_flags); > > /* > * We cannot fill holes in indirect tree based inodes as that could > @@ -3480,6 +3636,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > int ret; > struct ext4_map_blocks map; > u8 blkbits = inode->i_blkbits; > + unsigned int orig_mlen; > > if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) > return -EINVAL; > @@ -3493,6 +3650,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > map.m_lblk = offset >> blkbits; > map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, > EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; > + orig_mlen = map.m_len; > > if (flags & IOMAP_WRITE) { > /* > @@ -3503,8 +3661,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > */ > if (offset + length <= i_size_read(inode)) { > ret = ext4_map_blocks(NULL, inode, &map, 0); > - if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED)) > - goto out; > + /* > + * For atomic writes the entire requested length should > + * be mapped. > + */ > + if (map.m_flags & EXT4_MAP_MAPPED) { > + if ((!(flags & IOMAP_ATOMIC) && ret > 0) || > + (flags & IOMAP_ATOMIC && ret >= orig_mlen)) > + goto out; > + } > + map.m_len = orig_mlen; > } > ret = ext4_iomap_alloc(inode, &map, flags); > } else { > @@ -3525,6 +3691,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > */ > map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len); > > + /* > + * Before returning to iomap, let's ensure the allocated mapping > + * covers the entire requested length for atomic writes. > + */ > + if (flags & IOMAP_ATOMIC) { > + if (map.m_len < (length >> blkbits)) { > + WARN_ON(1); > + return -EINVAL; > + } > + } > ext4_set_iomap(inode, iomap, &map, offset, length, flags); > > return 0; > -- > 2.49.0 > >