Hi, We would like to tell storage (mostly eMMC, but not exclusively though) to discard a block range, so it reads back zeroes and is preferably unmapped to give the storage device the most flexibility. For eMMCs, this is possible right now with blkdiscard -z (BLKZEROOUT), but digging through the code with Jan, I am starting to question whether eMMC is correctly implementing REQ_OP_WRITE_ZEROES (granted, the expected semantics seem to be spelt out nowhere). Read along for what MMC does, what semantics BLKZEROOUT seems to should have, and what I think might need to be done to address this. eMMC supports a number of different commands for erasing/discarding data[1]. Relevant to my question are two commands: DISCARD: Host no longer needs the data and doesn't care about value read from it. Card may remove or unmap portions or all of it. TRIM: implies DISCARD and additionally guarantees to return all-zero or all-ones on read. These are made available to the block layer as follows: REQ_OP_DISCARD -> DISCARD REQ_OP_WRITE_ZEROES -> TRIM (if erased_byte == 0, otherwise if all-ones, blk_queue_max_write_zeroes_sectors() is not called) Looking at it from the ioctl side: blkdev_common_ioctl(..., BLKZEROOUT, ...) blk_ioctl_zeroout blkdev_issue_zeroout(..., BLKDEV_ZERO_NOUNMAP) __blkdev_issue_write_zeroes(..., BLKDEV_ZERO_NOUNMAP) submit_bio_wait(bio->bi_opf = REQ_NOUNMAP) __REQ_NOUNMAP has comment saying 'do not free blocks when zeroing', but as shown above, TRIM allows the card to unmap the indicated region. REQ_NOUNMAP has no other documentation, but virtio inverts it and translates it to VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP, which is documented as follows[2]: "For write zeroes commands, if the unmap is set, the device MAY deallocate the specified range of sectors in the device backend storage, as if the discard command had been sent.". I.e. REQ_NOUNMAP -> device MAY NOT deallocate This is at odds with the MMC implementation, which ignores REQ_NOUNMAP completely it seems. I don't believe there is a MMC command for write zeroes without discard short of actually writing zeroes, so it sounds like the correct implementation for MMC would be: if (req->cmd_flags & REQ_NOUNMAP) // or w/e causes fallback to __blkdev_issue_zero_pages return -EOPNOTSUPP; // do TRIM as before Of course, this will change user visible behavior: blkdiscard -z will start taking much longer for most users. These users will have to migrate to using fallocate instead: blkdev_fallocate(mode = FALLOC_FL_ZERO_RANGE): blkdev_issue_zeroout(..., BLKDEV_ZERO_NOUNMAP) blkdev_fallocate(mode = FALLOC_FL_PUNCH_HOLE) blkdev_issue_zeroout(..., BLKDEV_ZERO_NOFALLBACK) So, it's not a drop-in replacement. I guess user code can punch hole with fallback to BLKZEROOUT if it fails in order to get back the old behavior. I must admit I don't even know why one would write zeroes and care about them remaining mapped on the storage device, but that seems to be what's expected with BLKZEROOUT. Thoughts? What did I miss? Thanks, Ahmad [1]: For a short and incomplete summary, see: https://github.com/barebox/barebox/commit/91a11c7d50df91 [1]: https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html 5.2.6.2 Device Requirements: Device Operation -- Pengutronix e.K. | | Steuerwalder Str. 21 | http://www.pengutronix.de/ | 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |