On Sun, Mar 30, 2025 at 10:33:23AM +0800, Herbert Xu wrote: > On Wed, Mar 26, 2025 at 11:52:05AM +0800, Herbert Xu wrote: > > > > they don't need it. Take ext4 as an example: > > > > ext4 calls verity > > schedule_work(verity_work); > > return asynchronously! > > > > verity_work: > > do the crypto work > > __read_end_io(bio); > > I went ahead and removed the work queue for fsverity and fscrypt > (except for the reading of the Merkle tree which is still done in > a work queue because I'm too lazy to make that async), and it > actually turned out to be slower than using a work queue. > > I was testing with an encrypted 8GB file over ext4 mounted over a > loopback device in tmpfs. The encryption is with xts-vaes. It turns > out that not using a work queue actually made reading the entire file > go from 2.4s to 2.5s. > > I then tried passing the whole bio (256KB per crypto request in my > test as opposed to the data unit size of 4KB per crypto request) > through using chaining to skcipher, with xts-vaes doing the requests > one-by-one. Against my expectations, this didn't speed things up at > all (but at least it didn't slow things down either). All the > benefits of aggregating the data were offset by the extra setup cost > of creating the chained requests. Yes, your chaining API has poor performance and is difficult to test, as I've been saying all along. > So chaining is clearly not the way to go because it involves cutting > up into data units at the start of the process, rather than the end. Certainly agreed that chaining is not the way to go, but I think you're overlooking that Linus's suggestion to use the libraries directly would also solve this, while also not being restricted to bios and folios (note that not all filesystems are block-based, for example...). That would avoid the per-request overhead from the generic crypto infrastructure, which is the real source of the problem. > Finally I hacked up a patch (this goes on top of the skcipher branch > in cryptodev) to pass the whole bio through the Crypto API all the > way to xts-vaes which then unbundled it. This turned out to be a > winner, taking the read time for 8GB from 2.4s down to 2.1s. > > In view of this result, I'm going to throw away chaining, and instead > work on an interface that can take a whole bio (or folio), then cut > it up into the specified data unit size before processing. > > The bottom-end of the interface should be able to feed two (or whatever > number you fancy) data units to the actual algorithm. > > This should work just as well for compression, since their batching > input is simply a order-N folio. The compression output is a bit > harder because the data unit size is not constant, but I think I > have a way of making it work by adding a bit to the scatterlist data > structure to indicate the end of each data unit. > > PS For fsverity a 256KB bio size equates to 64 units of hash input. > My strategy is to allocate the whole thing if we can (2KB or 4KB > depending on your digest size), and if that fails, fall back to > a stack buffer of 512 bytes (or whatever number that keeps the > compiler quiet regarding stack usage). Even if we're on the stack, > it should still give more than enough to data to satiate your > multibuffer hash code. Extending the generic crypto infrastructure to support bios and folios is an interesting idea. But TBH I think it's worse than Linus's suggestion of just extending lib/crypto/ to support the needed functionality and using that directly. Your proposal is again solving a problem created by the generic crypto infrastructure being too complex, by making the generic crypto infrastructure even more complex. With the bio and folio support in the generic crypto infrastructure, there would be lots of work to do with adding support in all the underlying algorithms, and adding tests for all the new APIs. For hashing, users would need to allocate an array to hold the digest for every block in the bio or folio. That would add an additional memory allocation to every I/O. You said you'd like to fall back to a smaller buffer if the memory allocation fails. But that's silly; if we have to support that anyway, we might as well do it that way only. In which case the bio interface is pointless. Also note that the kernel also *already* has an abstraction layer that allows doing en/decryption on bios. It's called blk-crypto, and it makes it possible to do the en/decryption using either inline encryption hardware (i.e., the newer style of crypto accelerator that is actually commonly used and doesn't use the Crypto API at all) or the Crypto API. I have plans to remove the fs-layer bio en/decryption code from fscrypt and always use blk-crypto instead. Adding bio support to the Crypto API feels duplicative of blk-crypto, and we'd end up with too many abstraction layers. I think my preferred approach is that blk-crypto-fallback would directly call the library functions. The legacy Crypto API really has no useful role to play anymore. FWIW, there are also people thinking about developing inline hashing hardware, in which case something similar would apply to blk-integrity. - Eric