On 06/08/25 11:17 AM, Harshvardhan Jha wrote: > On 04/08/25 1:15 PM, Harshvardhan Jha wrote: >> On 28/07/25 3:04 PM, NeilBrown wrote: >>> On Mon, 28 Jul 2025, Harshvardhan Jha wrote: >>>> On 27/07/25 10:20 AM, NeilBrown wrote: >>>>> On Fri, 25 Jul 2025, Harshvardhan Jha wrote: >>>>>> On 23/07/25 1:37 PM, NeilBrown wrote: >>>>>>> On Wed, 23 Jul 2025, Harshvardhan Jha wrote: >>>>>>>> On 08/04/25 4:01 PM, Mark Brown wrote: >>>>>>>>> On Fri, Mar 28, 2025 at 01:40:44PM -0400, trondmy@xxxxxxxxxx wrote: >>>>>>>>>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> >>>>>>>>>> >>>>>>>>>> Once a task calls exit_signals() it can no longer be signalled. So do >>>>>>>>>> not allow it to do killable waits. >>>>>>>>> We're seeing the LTP acct02 test failing in kernels with this patch >>>>>>>>> applied, testing on systems with NFS root filesystems: >>>>>>>>> >>>>>>>>> 10271 05:03:09.064993 tst_test.c:1900: TINFO: LTP version: 20250130-1-g60fe84aaf >>>>>>>>> 10272 05:03:09.076425 tst_test.c:1904: TINFO: Tested kernel: 6.15.0-rc1 #1 SMP PREEMPT Sun Apr 6 21:18:14 UTC 2025 aarch64 >>>>>>>>> 10273 05:03:09.076733 tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz' >>>>>>>>> 10274 05:03:09.087803 tst_test.c:1722: TINFO: Overall timeout per run is 0h 01m 30s >>>>>>>>> 10275 05:03:09.088107 tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz' >>>>>>>>> 10276 05:03:09.093097 acct02.c:63: TINFO: CONFIG_BSD_PROCESS_ACCT_V3=y >>>>>>>>> 10277 05:03:09.093400 acct02.c:240: TINFO: Verifying using 'struct acct_v3' >>>>>>>>> 10278 05:03:10.053504 <6>[ 98.043143] Process accounting resumed >>>>>>>>> 10279 05:03:10.053935 <6>[ 98.043143] Process accounting resumed >>>>>>>>> 10280 05:03:10.064653 acct02.c:193: TINFO: == entry 1 == >>>>>>>>> 10281 05:03:10.064953 acct02.c:84: TINFO: ac_comm != 'acct02_helper' ('acct02') >>>>>>>>> 10282 05:03:10.076029 acct02.c:133: TINFO: ac_exitcode != 32768 (0) >>>>>>>>> 10283 05:03:10.076331 acct02.c:141: TINFO: ac_ppid != 2466 (2461) >>>>>>> It seems that the acct02 process got logged.. >>>>>>> Maybe the vfork attempt (trying to run acct02_helper) got half way an >>>>>>> aborted. >>>>>>> It got far enough that accounting got interested. >>>>>>> It didn't get far enough to update the ppid. >>>>>>> I'd be surprised if that were even possible.... >>>>>>> >>>>>>> If you would like to help debug this, changing the >>>>>>> >>>>>>> + if (unlikely(current->flags & PF_EXITING)) >>>>>>> >>>>>>> to >>>>>>> >>>>>>> + if (unlikely(WARN_ON(current->flags & PF_EXITING))) >>>>>>> >>>>>>> would provide stack traces so we can wee where -EINTR is actually being >>>>>>> returned. That should provide some hints. >>>>>>> >>>>>>> NeilBrown >>>>>> Hi Neil, >>>>>> >>>>>> Upon this addition I got this in the logs >>>>> Thanks for testing. Was there anything new in the kernel logs? I was >>>>> expecting a WARNING message followed by a "Call Trace". >>>>> >>>>> If there wasn't, then this patch cannot have caused the problem. >>>>> If there was, then I need to see it. >>>>> >>>>> Thanks, >>>>> NeilBrown >>>> This is what the dmesg contains: >>>> >>>> [ 678.814887] LTP: starting acct02 >>>> [ 679.831232] ------------[ cut here ]------------ >>>> [ 679.833500] WARNING: CPU: 6 PID: 88930 at net/sunrpc/sched.c:279 >>>> rpc_wait_bit_killable+0x76/0x90 [sunrpc] >>>> [ 679.837308] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs >>>> netfs rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd auth_rpcgss nfs_acl lockd >>>> grace loop nft_redir ipt_REJECT xt_comment xt_owner nft_compat >>>> nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib rfkill nft_reject_inet >>>> nf_reject_ >>>> ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack >>>> nf_defrag_ipv6 nf_defrag_ipv4 ip_set cuse vfat fat intel_rapl_msr >>>> intel_rapl_common kvm_amd ccp kvm drm_shmem_helper irqbypass i2c_piix4 >>>> drm_kms_helper pcspkr pvpanic_mmio i2c_smbus pvpanic drm fuse xfs >>>> crc32c_generic >>>> nvme_tcp nvme_fabrics nvme_core nvme_keyring nvme_auth sd_mod >>>> virtio_net sg net_failover virtio_scsi failover ata_generic pata_acpi >>>> ata_piix ghash_clmulni_intel libata sha512_ssse3 virtio_pci sha256_ssse3 >>>> virtio_pci_legacy_dev sha1_ssse3 virtio_pci_modern_dev serio_raw >>>> dm_multipath btrfs >>>> blake2b_generic xor zstd_compress raid6_pq sunrpc dm_mirror >>>> dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls >>>> cxgb3i cxgb3 mdio libcxgbi libcxgb >>>> [ 679.837524] qla4xxx iscsi_tcp libiscsi_tcp libiscsi >>>> scsi_transport_iscsi iscsi_ibft iscsi_boot_sysfs qemu_fw_cfg aesni_intel >>>> crypto_simd cryptd [last unloaded: kheaders] >>>> [ 679.873316] CPU: 6 UID: 0 PID: 88930 Comm: acct02_helper Kdump: >>>> loaded Not tainted 6.15.8-1.el9.rc2.x86_64 #1 PREEMPT(voluntary) >>>> [ 679.877769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), >>>> BIOS 1.6.4 02/27/2023 >>>> [ 679.880782] RIP: 0010:rpc_wait_bit_killable+0x76/0x90 [sunrpc] >>>> [ 679.883189] Code: 01 b8 00 fe ff ff 75 d5 48 8b 85 48 0d 00 00 5b 5d >>>> 48 c1 e8 08 83 e0 01 f7 d8 19 c0 25 00 fe ff ff 31 d2 31 f6 e9 8a e6 c4 >>>> d4 <0f> 0b b8 fc ff ff ff 5b 5d 31 d2 31 f6 e9 78 e6 c4 d4 0f 1f 84 00 >>>> [ 679.889976] RSP: 0018:ffffaf47811a7770 EFLAGS: 00010202 >>>> [ 679.892196] RAX: ffff97be48e00330 RBX: ffffaf47811a77c0 RCX: >>>> 0000000000000000 >>>> [ 679.894978] RDX: 0000000000000001 RSI: 0000000000002102 RDI: >>>> ffffaf47811a77c0 >>>> [ 679.897786] RBP: ffff97be61588000 R08: 0000000000000000 R09: >>>> 0000000000000000 >>>> [ 679.900600] R10: 0000000000000000 R11: 0000000000000000 R12: >>>> 0000000000002102 >>>> [ 679.903432] R13: ffffffff96408ea0 R14: ffffaf47811a77d8 R15: >>>> ffffffffc07568e0 >>>> [ 679.906233] FS: 00007fc2563f8600(0000) GS:ffff97c5c890f000(0000) >>>> knlGS:0000000000000000 >>>> [ 679.909289] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [ 679.911736] CR2: 00007fc2561fba70 CR3: 00000003bce3a000 CR4: >>>> 00000000003506f0 >>>> [ 679.914555] Call Trace: >>>> [ 679.915918] <TASK> >>>> [ 679.917215] __wait_on_bit+0x31/0xa0 >>>> [ 679.918932] out_of_line_wait_on_bit+0x93/0xc0 >>>> [ 679.920914] ? __pfx_wake_bit_function+0x10/0x10 >>>> [ 679.922944] __rpc_execute+0x109/0x310 [sunrpc] >>>> [ 679.925024] rpc_execute+0x137/0x160 [sunrpc] >>>> [ 679.927020] rpc_run_task+0x107/0x170 [sunrpc] >>>> [ 679.929032] nfs4_call_sync_sequence+0x74/0xc0 [nfsv4] >>>> [ 679.931319] _nfs4_proc_statfs+0xc7/0x100 [nfsv4] >>>> [ 679.933520] ? srso_return_thunk+0x5/0x5f >>>> [ 679.935391] nfs4_proc_statfs+0x6b/0xb0 [nfsv4] >>>> [ 679.937367] nfs_statfs+0x7e/0x1e0 [nfs] >>>> [ 679.939138] statfs_by_dentry+0x67/0xa0 >>>> [ 679.940887] vfs_statfs+0x1c/0x40 >>>> [ 679.942596] check_free_space+0x71/0x110 >>> Thanks. I'm not sure why this causes a problem as if vfs_statfs() fail, >>> check_free_space() assumes there is still free space. >>> However it does strongly suggest that we still need to NFS to work in >>> processes where signals have been shutdown. >>> >>> Could you change rpc_wait_bit_killable() to be the following and retest? >>> I intention is that when the process is exiting, we wait up to 5 seconds >>> for each request and then fail. It's a bit ugly, but it is a rather >>> strange situation. It blocking forever that we really want to avoid >>> here, not blocking at all. >>> >>> Thanks, >>> NeilBrown >>> >>> >>> static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode) >>> { >>> if (unlikely(current->flags & PF_EXITING)) { >>> if (schedule_timeout(5*HZ) > 0) >>> /* timed out */ >>> return 0; >>> return -EINTR; >>> } >>> schedule(); >>> if (signal_pending_state(mode, current)) >>> return -ERESTARTSYS; >>> return 0; >>> } >> Adding this change makes the test pass: >> >> <<<test_start>>> >> tag=acct02 stime=1754066481 >> cmdline="acct02" >> contacts="" >> analysis=exit >> <<<test_output>>> >> tst_kconfig.c:88: TINFO: Parsing kernel config '/lib/modules/6.15.8-master.sunrpc.el9.rc3.x86_64/config' >> tst_tmpdir.c:316: TINFO: Using /tmpdir/ltp-lNzAk1qhuX/LTP_accZ75zl1 as tmpdir (nfs filesystem) >> tst_test.c:2004: TINFO: LTP version: 20250530-128-g6505f9e29 >> tst_test.c:2007: TINFO: Tested kernel: 6.15.8-master.sunrpc.el9.rc3.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Jul 29 05:06:28 PDT 2025 x86_64 >> tst_kconfig.c:88: TINFO: Parsing kernel config '/lib/modules/6.15.8-master.sunrpc.el9.rc3.x86_64/config' >> tst_test.c:1825: TINFO: Overall timeout per run is 0h 00m 30s >> tst_kconfig.c:88: TINFO: Parsing kernel config '/lib/modules/6.15.8-master.sunrpc.el9.rc3.x86_64/config' >> acct02.c:61: TINFO: CONFIG_BSD_PROCESS_ACCT_V3=y >> acct02.c:238: TINFO: Verifying using 'struct acct_v3' >> acct02.c:191: TINFO: == entry 1 == >> acct02.c:82: TINFO: ac_comm != 'acct02_helper' ('iscsiadm') >> acct02.c:131: TINFO: ac_exitcode != 32768 (5376) >> acct02.c:139: TINFO: ac_ppid != 52326 (2475) >> acct02.c:191: TINFO: == entry 2 == >> acct02.c:82: TINFO: ac_comm != 'acct02_helper' ('systemd') >> acct02.c:125: TINFO: elap_time/clock_ticks >= 2 (1065/100: 10.00) >> acct02.c:131: TINFO: ac_exitcode != 32768 (0) >> acct02.c:139: TINFO: ac_ppid != 52326 (1) >> acct02.c:191: TINFO: == entry 3 == >> acct02.c:82: TINFO: ac_comm != 'acct02_helper' ('(sd-pam)') >> acct02.c:125: TINFO: elap_time/clock_ticks >= 2 (1062/100: 10.00) >> acct02.c:131: TINFO: ac_exitcode != 32768 (9) >> acct02.c:139: TINFO: ac_ppid != 52326 (1) >> acct02.c:191: TINFO: == entry 4 == >> acct02.c:82: TINFO: ac_comm != 'acct02_helper' ('systemd-user-ru') >> acct02.c:131: TINFO: ac_exitcode != 32768 (0) >> acct02.c:139: TINFO: ac_ppid != 52326 (1) >> acct02.c:191: TINFO: == entry 5 == >> acct02.c:202: TINFO: Number of accounting file entries tested: 5 >> acct02.c:208: TPASS: acct() wrote correct file contents! >> >> Summary: >> passed 1 >> failed 0 >> broken 0 >> skipped 0 >> warnings 0 >> incrementing stop >> <<<execution_status>>> >> initiation_status="ok" >> duration=1 termination_type=exited termination_id=0 corefile=no >> cutime=0 cstime=0 >> <<<test_end>>> >> >> Thanks & Regards, >> Harshvardhan > Hi there, > > I tested this around 50 iterations and it passes all 50 times with this > timeout. > > Thanks & Regards, > Harshvardhan > Hello there, Can we go ahead and revert this patch for the meantime until a fix is obtained? Thanks & Regards, Harshvardhan >> >>>> [ 679.944433] acct_write_process+0x45/0x180 >>>> [ 679.946313] acct_process+0xff/0x180 >>>> [ 679.948003] do_exit+0x216/0x480 >>>> [ 679.949799] ? srso_return_thunk+0x5/0x5f >>>> [ 679.951621] do_group_exit+0x30/0x80 >>>> [ 679.953329] __x64_sys_exit_group+0x18/0x20 >>>> [ 679.955217] x64_sys_call+0xfdb/0x14f0 >>>> [ 679.956971] do_syscall_64+0x82/0x7a0 >>>> [ 679.958717] ? srso_return_thunk+0x5/0x5f >>>> [ 679.960550] ? ___pte_offset_map+0x1b/0x1a0 >>>> [ 679.962434] ? srso_return_thunk+0x5/0x5f >>>> [ 679.964261] ? __alloc_frozen_pages_noprof+0x18d/0x340 >>>> [ 679.966389] ? srso_return_thunk+0x5/0x5f >>>> [ 679.968183] ? srso_return_thunk+0x5/0x5f >>>> [ 679.969945] ? __mod_memcg_lruvec_state+0xb6/0x1b0 >>>> [ 679.971977] ? srso_return_thunk+0x5/0x5f >>>> [ 679.973690] ? __lruvec_stat_mod_folio+0x83/0xd0 >>>> [ 679.975671] ? srso_return_thunk+0x5/0x5f >>>> [ 679.977392] ? srso_return_thunk+0x5/0x5f >>>> [ 679.979079] ? set_ptes.isra.0+0x36/0x90 >>>> [ 679.980771] ? srso_return_thunk+0x5/0x5f >>>> [ 679.982375] ? srso_return_thunk+0x5/0x5f >>>> [ 679.984052] ? wp_page_copy+0x333/0x730 >>>> [ 679.985648] ? srso_return_thunk+0x5/0x5f >>>> [ 679.987220] ? __handle_mm_fault+0x397/0x6f0 >>>> [ 679.988818] ? srso_return_thunk+0x5/0x5f >>>> [ 679.990411] ? __count_memcg_events+0xbb/0x150 >>>> [ 679.992111] ? srso_return_thunk+0x5/0x5f >>>> [ 679.993689] ? count_memcg_events.constprop.0+0x26/0x50 >>>> [ 679.995590] ? srso_return_thunk+0x5/0x5f >>>> [ 679.997177] ? handle_mm_fault+0x245/0x350 >>>> [ 679.998807] ? srso_return_thunk+0x5/0x5f >>>> [ 680.000339] ? do_user_addr_fault+0x221/0x690 >>>> [ 680.002042] ? srso_return_thunk+0x5/0x5f >>>> [ 680.003553] ? arch_exit_to_user_mode_prepare.isra.0+0x1e/0xd0 >>>> [ 680.005643] ? srso_return_thunk+0x5/0x5f >>>> [ 680.007202] entry_SYSCALL_64_after_hwframe+0x76/0x7e >>>> [ 680.009025] RIP: 0033:0x7fc2560d985d >>>> [ 680.010510] Code: Unable to access opcode bytes at 0x7fc2560d9833. >>>> [ 680.012660] RSP: 002b:00007ffde591df68 EFLAGS: 00000246 ORIG_RAX: >>>> 00000000000000e7 >>>> [ 680.015355] RAX: ffffffffffffffda RBX: 00007fc2561f59e0 RCX: >>>> 00007fc2560d985d >>>> [ 680.017749] RDX: 00000000000000e7 RSI: ffffffffffffff88 RDI: >>>> 0000000000000080 >>>> [ 680.020292] RBP: 0000000000000080 R08: 0000000000000000 R09: >>>> 0000000000000020 >>>> [ 680.022729] R10: 00007ffde591de10 R11: 0000000000000246 R12: >>>> 00007fc2561f59e0 >>>> [ 680.025174] R13: 00007fc2561faf20 R14: 0000000000000001 R15: >>>> 00007fc2561faf08 >>>> [ 680.027593] </TASK> >>>> [ 680.028661] ---[ end trace 0000000000000000 ]--- >>>> >>>> >>>> Thanks & Regards, >>>> Harshvardhan >>>> >>>>>> <<<test_start>>> >>>>>> tag=acct02 stime=1753444172 >>>>>> cmdline="acct02" >>>>>> contacts="" >>>>>> analysis=exit >>>>>> <<<test_output>>> >>>>>> tst_kconfig.c:88: TINFO: Parsing kernel config >>>>>> '/lib/modules/6.15.8-1.bug38227970.el9.rc2.x86_64/config' >>>>>> tst_tmpdir.c:316: TINFO: Using /tmpdir/ltp-w1ozKKlJ6n/LTP_acc4RRfLh as >>>>>> tmpdir (nfs filesystem) >>>>>> tst_test.c:2004: TINFO: LTP version: 20250530-105-gda73e1527 >>>>>> tst_test.c:2007: TINFO: Tested kernel: >>>>>> 6.15.8-1.bug38227970.el9.rc2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jul 25 >>>>>> 02:03:04 PDT 2025 x86_64 >>>>>> tst_kconfig.c:88: TINFO: Parsing kernel config >>>>>> '/lib/modules/6.15.8-1.bug38227970.el9.rc2.x86_64/config' >>>>>> tst_test.c:1825: TINFO: Overall timeout per run is 0h 00m 30s >>>>>> tst_kconfig.c:88: TINFO: Parsing kernel config >>>>>> '/lib/modules/6.15.8-1.bug38227970.el9.rc2.x86_64/config' >>>>>> acct02.c:61: TINFO: CONFIG_BSD_PROCESS_ACCT_V3=y >>>>>> acct02.c:238: TINFO: Verifying using 'struct acct_v3' >>>>>> acct02.c:191: TINFO: == entry 1 == >>>>>> acct02.c:82: TINFO: ac_comm != 'acct02_helper' ('acct02') >>>>>> acct02.c:131: TINFO: ac_exitcode != 32768 (0) >>>>>> acct02.c:139: TINFO: ac_ppid != 88929 (88928) >>>>>> acct02.c:181: TFAIL: end of file reached >>>>>> >>>>>> HINT: You _MAY_ be missing kernel fixes: >>>>>> >>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4d9570158b626 >>>>>> >>>>>> Summary: >>>>>> passed 0 >>>>>> failed 1 >>>>>> broken 0 >>>>>> skipped 0 >>>>>> warnings 0 >>>>>> incrementing stop >>>>>> <<<execution_status>>> >>>>>> initiation_status="ok" >>>>>> duration=1 termination_type=exited termination_id=1 corefile=no >>>>>> cutime=0 cstime=20 >>>>>> >>>>>> <<<test_end>>> >>>>>> >>>>>> >>>>>> Thanks & Regards, >>>>>> >>>>>> Harshvardhan