Re: [RFC PATCH bpf-next/net v1 07/13] bpf: Introduce per-prog stdout/stderr streams

Eduard Zingerman <eddyz87@xxxxxxxxx> · Wed, 16 Apr 2025 14:49:20 -0700

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> writes:

> Introduce a set of kfuncs to implement per BPF program stdout and stderr
> streams. This is implemented as a linked list of dynamically allocated
> strings whenever a message is printed. This can be done by the kernel or
> the program itself using the bpf_prog_stream_vprintk kfunc. In-kernel
> wrappers are provided over streams to ease the process.
>
> The idea is that everytime messages need to be dumped, the reader would
> pull out the whole batch of messages from the stream at once, and then
> pop each string one by one and start printing it out (how exactly is
> left up to the BPF program reading the log, but usually it will be
> streaming data back into a ring buffer that is consumed by user space).
>
> The use of a lockless list warrants that we be careful about the
> ordering of messages. When being added, the order maintained is new to
> old, therefore after deletion, we must reverse the list before iterating
> and popping out elements from it.
>
> Overall, this infrastructure provides NMI-safe any context printing
> built on top of the NMI-safe any context bpf_mem_alloc() interface.
>
> Later patches will add support for printing splats in case of BPF arena
> page faults, rqspinlock deadlocks, and cond_break timeouts.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>
> ---

[...]

> diff --git a/kernel/bpf/stream.c b/kernel/bpf/stream.c
> new file mode 100644
> index 000000000000..2019ce134310
> --- /dev/null
> +++ b/kernel/bpf/stream.c

[...]

> +static int bpf_stream_page_check_room(struct bpf_stream_page *stream_page, int len)
                                                                              ^^^^^^^
                                                                     len is never used
> +{
> +	int min = offsetof(struct bpf_stream_elem, str[0]);
> +	int consumed = stream_page->consumed;
> +	int total = BPF_STREAM_PAGE_SZ;
> +	int rem = max(0, total - consumed - min);
> +
> +	/* Let's give room of at least 8 bytes. */
> +	WARN_ON_ONCE(rem % 8 != 0);
> +	return rem < 8 ? 0 : rem;
> +}

[...]

> +static struct bpf_stream_elem *bpf_stream_elem_alloc(int len)
> +{
> +	const int max_len = ARRAY_SIZE((struct bpf_bprintf_buffers){}.buf);
> +	struct bpf_stream_elem *elem;
> +
> +	/*
> +	 * We may overflow, but we should never need more than one page size
> +	 * worth of memory. This can be lifted, but we'd need to adjust the
> +	 * other code to keep allocating more pages to overflow messages.
> +	 */
> +	BUILD_BUG_ON(max_len > BPF_STREAM_PAGE_SZ);
> +	/*
> +	 * Length denotes the amount of data to be written as part of stream element,
> +	 * thus includes '\0' byte. We're capped by how much bpf_bprintf_buffers can
> +	 * accomodate, therefore deny allocations that won't fit into them.
> +	 */
> +	if (len < 0 || len > max_len)
> +		return NULL;
> +
> +	elem = bpf_stream_elem_alloc_from_bpf_ma(len);
> +	if (!elem)
> +		elem = bpf_stream_elem_alloc_from_stream_page(len);

So, the stream page is a backup mechanism, right?
I'm curious, did you compare how many messages are dropped if there is
no such backup? Also, how much memory is wasted if there is no
"spillover" mechanism (BPF_STREAM_ELEM_F_NEXT).
Are these complications absolutely necessary?

> +	return elem;
> +}
> +
> +__bpf_kfunc_start_defs();
> +
> +static int bpf_stream_push_str(struct bpf_stream *stream, const char *str, int len)
> +{

This function accumulates elements in &stream->log w/o a cap.
Why is this not a problem if e.g. user space never flushes streams for
the program?

> +	struct bpf_stream_elem *elem, *next = NULL;
> +	int room = 0, rem = 0;
> +
> +	/*
> +	 * Allocate a bpf_prog_stream_elem and push it to the bpf_prog_stream
> +	 * log, elements will be popped at once and reversed to print the log.
> +	 */
> +	elem = bpf_stream_elem_alloc(len);
> +	if (!elem)
> +		return -ENOMEM;
> +	room = elem->mem_slice.len;
> +	if (elem->flags & BPF_STREAM_ELEM_F_NEXT) {
> +		next = (struct bpf_stream_elem *)((unsigned long)elem->next & ~BPF_STREAM_ELEM_F_MASK);
> +		rem = next->mem_slice.len;
> +	}
> +
> +	memcpy(elem->str, str, room);
> +	if (next)
> +		memcpy(next->str, str + room, rem);
> +
> +	if (next) {
> +		elem->node.next = &next->node;
> +		next->node.next = NULL;
> +
> +		llist_add_batch(&elem->node, &next->node, &stream->log);
> +	} else {
> +		llist_add(&elem->node, &stream->log);
> +	}
> +
> +	return 0;
> +}

[...]

> +BTF_KFUNCS_START(stream_consumer_kfunc_set)
> +BTF_ID_FLAGS(func, bpf_stream_next_elem_batch, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_stream_free_elem_batch, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_stream_next_elem, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_stream_free_elem, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_prog_stream_get, KF_ACQUIRE | KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_prog_stream_put, KF_RELEASE)
> +BTF_KFUNCS_END(stream_consumer_kfunc_set)

This is a complicated API.
If we anticipate that users intend to write this info to ring buffers
maybe just provide a function doing that and do not expose complete API?

[...]

I'll continue reading the patch-set tomorrow...