Re: What should we do about the nvme atomics mess?

Nilay Shroff <nilay@xxxxxxxxxxxxx> · Thu, 10 Jul 2025 10:37:19 +0530

On 7/10/25 2:58 AM, Keith Busch wrote:
> On Wed, Jul 09, 2025 at 01:21:17PM +0530, Nilay Shroff wrote:
>> I believe there are multi-controller NVMe disks in the field (including the 
>> one I have) that do not exhibit such inconsistencies, i.e., they report a
>> consistent AWUPF value across controllers and do not change it based on 
>> namespace format. The NVMe specification states this (quoting it from 
>> NVM-Command-Set-Specification-1.0e):
>>
>> "The values (referencing AWUPF / AWUN) reported in the Identify Controller
>> data structure are valid across all namespaces with any supported namespace
>> format, forming a baseline value that is guaranteed not to change."
> 
> I don't think that's a backward compatible requirement. Controllers
> often rescale these after a format command, and it was the only way for
> 1.0 and 1.1 controllers to report atomic sizes.
> 
> Lets say the controller can do 128k byte atomic writes, If all
> namespaces used 512b LBA format, then AWUPF would be 255. If you change
> one namespace format to 4k, AWUPF scales down to 31, yielding a
> sub-optimal result for all the other namespaces.
> 
On the multi-controller disk I’ve been testing, each controller consistently
reports an AWUPF value of 63. I created shared namespaces with mixed LBA formats
— some using 512-byte LBAs and others using 4KB LBAs — and observed that the 
AWUPF value remained constant at 63 across all controllers and formats.

This implies that:
- A namespace with 4KB LBA format can support up to 256KB of  atomic
  writes (4KB × 64),
- A namespace with 512-byte LBA format can only support up to 32KB of
  atomic writes (512B × 64).

So in this case, it's actually the opposite of what one might assume:
Users of namespaces with 4KB LBA format would see the best possible atomic write
performance, while those using 512-byte LBA format may observe sub-optimal 
performance, since the maximum atomic write size scales down with smaller LBAs.

>> While the spec doesn´t explicitly require that AWUPF be consistent across
>> controllers within the same subsystem, it seems to be implied. That said,
>> I agree this should have been stated explicitly in the specification.
> 
> Considering multi-controller subsystems, some controllers might have
> namespaces with only 512b formats attached, and other controllers might
> have some 4k mixed in, so then they can't all consistently report the
> desired AWUPF value. They'd have to just scale AWUPF based on the
> largest sector size supported. Which I guess is what the current wording
> is guiding toward, but that just suggests host drivers disregard the
> value and use NAWUPF instead. So still option III.

Yes, I agree — option III seems to be the best possible way forward. 
However, does this mean we would disregard atomic write support for any
multi-controller NVMe vendor that consistently reports a valid AWUPF value
across all controllers and namespace formats, but sets NAWUPF to zero?

Thanks,
--Nilay