Re: Transparent compression with ext4 - especially with zstd

"Theodore Ts'o" <tytso@xxxxxxx> · Wed, 22 Jan 2025 08:19:12 -0500

On Tue, Jan 21, 2025 at 11:37:38PM -0800, Christoph Hellwig wrote:
> On Wed, Jan 22, 2025 at 08:29:09AM +0100, Gerhard Wiesinger wrote:
> > BTW: Why does it break the ACID properties?
> 
> It doesn't if implemented properly, which of course means out of place
> writes.
> 
> The only sane way to implement compression in XFS would be using out
> of place writes, which we support for reflinks and which is heavily
> used by the new zoned mode.  For the latter retrofitting compression
> would be relatively easy, but it first needs to get merged, then
> stabilize and mature, and then we'll need to see if we have enough
> use cases.  So don't plan for it.

... but out of place writes means that every single fdatasync() called
by the database now requires a file system level transaction commits.
So now every single fdatasync(2) results in the data blocks getting
written out to a new location on disk (this is what out of place
writes mean), followed by a CACHE FLUSH, followed by the metadata
updates to point at the new location on the disk, first written to the
file system tranaction log, followed by the fs commit block, followed
by a *second* CACHE FLUSH command.

So now let's look at a sample scenario where the database needs to
update 3 different 4k blocks (for example, where you are are crediting
$100 to an income account, followed by a $100 debit to an expense
account, followed by the database commit.

Without transparent compression commits (assuming the database is
properly using fdatasync so it's not asking the file system to update
the ctime/mtime of the database file):

1) random write A (4k write)
2) random write B (4k write)
3) random write C (4k write)
4) CACHE FLUSH

With transparent compression:

1) random write A
2) random write B
3) random write C
4) CACHE FLUSH
5) update the location of compression cluster A written to the fs journal
6) update the location of compression cluster B written to the fs journal
7) update the location of compression cluster C written to the fs journal
8) write the commit block to the fs journal
9) CACHE FLUSH

This kills performance, and as I mentioned, in general, IOPS are
expensive and write bandwidth is often far more expensive than bytes
storage.  This is true both for the raw storage by the cloud provider,
the extra network bandwidth bewteen the host and cluster file system
storing the emulated cloud block device, and amount of money charged
to the cloud customer because it does cost more money to the cloud
provider.

If you try to do transparent compression using update-in-place (for
example, via the technique in the Stac patent) then you don't need to
update the location on disk, but given that you are replacing a 64k
compression cluster every time you update a 4k block, if you crash in
the middle of the 64k compression cluster update, that cluster could
get corrupted --- at which point you break the database's ACID
properties.

Finally, note that both Amazon and Google have first party cloud
products (RDS and CloudSQL, respectively) that provide to the customer
the full MySQL and Postgres feature set.  So if you want to enable
database level compression, I believe you *can* do it.  Compression is
not free, and not magic, but if it works for you, you *can* enable it
if you are using MySQL or Postgres.

Now, if you are using a database that doesn't support database-level
compression, then why don't you try demanding your vendor that is
providing the database to add compression as a feature?  Of course,
they might ask you as the customer to pay $$$, but the development
cost to add new features, whether in the database or the file system,
is also not free.

Cheers,

							- Ted