Re: [nf-next 0/2] netfilter: nf_tables: make set flush more resistant to memory pressure

Florian Westphal <fw@xxxxxxxxx> · Fri, 25 Jul 2025 13:15:27 +0200

Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote:
> I think the key is to be able to identify what elements have been
> flushed by what flush command, so abort path can just restore/undo the
> state for the given elements.
> 
> Because this also is possible:
> 
>        flush set x + [...] + flush set x
> 
> And [...] includes possible new/delete elements in x.
> 
> It should be possible to store an flush command id in the set element
> (this increases the memory consumption of the set element, which your
> series already does it) to identify what flush command has deleted it.
> This is needed because the transaction object won't be in place but I
> think it is a fair tradeoff. The flush command id can be incremental
> in the batch (the netlink sequence number cannot be used for this
> purpose).

OK, that might work.  So the idea is to do the set walk as-is, but
instead of allocating a NFT_MSG_DELSETELEM for each transaction
object, introduce NFT_MSG_FLUSHSET transaction.

Then, for a DELSETELEM request with no elements (== flush),
allocate *one* NFT_MSG_FLUSHSET transaction.

The NFT_MSG_FLUSHSET transaction holds the set being flushed
and an id, that increments sequentially once for each flush.

Then, do the walk as-is:

static int nft_setelem_flush(const struct nft_ctx *ctx,
                             struct nft_set *set,
                             const struct nft_set_iter *iter,
                             struct nft_elem_priv *elem_priv)
{
        const struct nft_set_ext *ext = nft_set_elem_ext(set, elem_priv);
        struct nft_trans *trans;

	/* previous delsetelem or erlier flush marked it inactive */
        if (!nft_set_elem_active(ext, iter->genmask))
                return 0;

/* No allocation per set elemenet anymore */
/* trans = nft_trans_alloc_gfp(ctx, NFT_MSG_DELSETELEM, */

	/* trans_flush could be obtained from the tail of
	 * the transaction list.  Or placed in *iter.
	 */
	elem_priv->flush_id = trans_flush->flush_id
        set->ops->flush(ctx->net, set, elem_priv);
        set->ndeact++;

        nft_setelem_data_deactivate(ctx->net, set, elem_priv);

        return 0;
}

On abort, NFT_MSG_FLUSHSET would do another walk, for all set elements,
and then calls nft_setelem_data_activate/nft_setelem_activate in case
elem_priv->flush_id == trans_flush->flush_id.

Did I get that right?  I don't see any major issues with this, except
the need to add u32 flush_id to struct nft_elem_priv.
Or perhaps struct nft_set_ext would be a better fit as nft_elem_priv is
just a proxy.

> Of course, this needs careful look, but if the set element can be used
> to annotate the information that allows us to restore to previous
> state before flush (given the transaction object is not used anymore),
> then it should work. Your series is extending the set element size for
> a different purpose, so I think the extra memory should not be an
> issue.

Agree, it would need 4 bytes per elem which isn't much compared to the
transaction log savings.

Will you have a look or should I have a go at this?