Am Donnerstag, dem 15.05.2025 um 08:30 +0200 schrieb Johannes Berg: > On Thu, 2025-05-15 at 00:27 +0200, Bert Karwatzki wrote: > > Am Mittwoch, dem 14.05.2025 um 20:56 +0200 schrieb Johannes Berg: > > > > > > > > I've split off the problematic piece of code into an noinline function to simplify the disassembly: > > > > > > > > > > Oh and also, does it even still crash with that? :) > > > > Yes, it still crashes when compiled with clang. > > OK, just checking. :) To be more precise I need clang AND PREEMPT_RT=y to get a crash. > > FWIW, I'm not convinced at all that the code you were looking at is > really the problem. The crash (see below) is happening on the status > side. Of course it cannot crash on the status side if on the TX side we > never enter anything into the IDR data structure, and never tag the SKB > to look up in the IDR and therefore never try to create the status > report on the status side. After looking at the backtrace I'm also no longer conviced that piece of code is the problem. > > Basically what happens is this: > > - on TX, if we have a socket requesting status, create a copy of the > SKB, put it into the IDR, and put the IDR index into the original > skb->cb > - then transmit the original skb, of course > - on TX status report from the driver, see if the skb->cb is tagged with > the IDR value, if so, report the copy of the SKB back to the socket > with the status information > > (The reason we need to make a copy is that the SKB could be encrypted or > otherwise modified in flight, and we don't want to undo that, rather > keeping a copy for the report.) > > > [ 267.339591][ T575] BUG: unable to handle page fault for address: ffffffff51e080b0 > > [ 267.339598][ T575] #PF: supervisor write access in kernel mode > > [ 267.339602][ T575] #PF: error_code(0x0002) - not-present page > > [ 267.339606][ T575] PGD f1cc3c067 P4D f1cc3c067 PUD 0 > > [ 267.339613][ T575] Oops: Oops: 0002 [#1] SMP NOPTI > > [ 267.339622][ T575] CPU: 0 UID: 0 PID: 575 Comm: napi/phy0-0 Not tainted > > 6.15.0-rc6-next-20250513-llvm-00009-gec34cd07a425 #968 PREEMPT_{RT,(full)} > > [ 267.339629][ T575] Hardware name: Micro-Star International Co., Ltd. Alpha > > 15 B5EEK/MS-158L, BIOS E158LAMS.10F 11/11/2024 > > [ 267.339632][ T575] RIP: 0010:queued_spin_lock_slowpath+0x120/0x1c0 > ... > > [ 267.339692][ T575] Call Trace: > > [ 267.339701][ T575] <TASK> > > [ 267.339705][ T575] _raw_spin_lock_irqsave+0x57/0x60 > > [ 267.339714][ T575] rt_spin_lock+0x73/0xa0 > > [ 267.339720][ T575] sock_queue_err_skb+0xdc/0x140 > > [ 267.339727][ T575] skb_complete_wifi_ack+0xa9/0x120 > > [ 267.339737][ T575] ieee80211_report_used_skb+0x541/0x6e0 [mac80211] > > [ 267.339799][ T575] ? srso_alias_return_thunk+0x5/0xfbef5 > > [ 267.339804][ T575] ? start_dl_timer+0xcf/0x110 > > [ 267.339814][ T575] ieee80211_tx_status_ext+0x3b3/0x870 [mac80211] > > [ 267.339851][ T575] ? raw_spin_rq_lock_nested+0x15/0x80 > > [ 267.339862][ T575] ? srso_alias_return_thunk+0x5/0xfbef5 > > [ 267.339866][ T575] ? rt_spin_lock+0x3d/0xa0 > > [ 267.339873][ T575] ? mt76_tx_status_unlock+0x38/0x230 [mt76] > > [ 267.339886][ T575] mt76_tx_status_unlock+0x1e0/0x230 [mt76] > > Yeah so that's the crash on the status report as explained above, it > kind of looks almost like the skb->sk was freed and somehow invalid now? > But I don't see a general issue here (will keep digging), and how come > it only shows up with clang? > > Since it reproduces pretty reliably, maybe you could do with KASAN? > I'm currently doing a testrun with KASAN enabled, test is running ~1h so far (without KASAN the max time to a crash was about 10min), so KASAN is probably killing the bug (there are no messages from KASAN in dmesg). > Also could be interesting - what userspace are you running with wifi? > What tool is even setting up the wifi status? If you don't really know > maybe just put WARN_ON(1) into net/core/sock.s where SO_WIFI_STATUS is > written (sk_setsockopt). > > johannes For the recording these backtraces I disabled wifi just after booting (it usually takes ~5s to connect here) with network manager (nmcli)(from debian sid (last updated on 20250511, before I encountered this bug)) $ nmcli radio wifi off then I set up the netconsole and reenabled wifi and waited for the crash $ nmcli radio wifi on Bert Karwatzki