On Thu, May 15, 2025 at 12:59:22PM +0100, shiju.jose@xxxxxxxxxx wrote: > From: Shiju Jose <shiju.jose@xxxxxxxxxx> > > Certain operations on memory, such as memory repair, are permitted > only when the address and other attributes for the operation are > from the current boot. This is determined by checking whether the > memory attributes for the operation match those in the CXL gen_media > or CXL DRAM memory event records reported during the current boot. > > The CXL event records must be backed up because they are cleared > in the hardware after being processed by the kernel. > > Support is added for storing CXL gen_media or CXL DRAM memory event > records in xarrays. Old records are deleted when they expire or when > there is an overflow and which depends on platform correctly report > Event Record Timestamp field of CXL spec Table 8-55 Common Event > Record Format. > > Additionally, helper functions are implemented to find a matching > record in the xarray storage based on the memory attributes and > repair type. > > Add validity check, when matching attributes for sparing, using > the validity flag in the DRAM event record, to ensure that all > required attributes for a requested repair operation are valid and > set. Using cxl-test I'm getting the below stack trace when trying to modprobe -r cxl-test. These cxl-test memdevs have num_ras_features == 0 so upon adding the memdevs edac_dev_register() returns -EINVAL, as does devm_cxl_memdev_edac_register() Later on when I try to modprobe -r cxl-test, it seems odd that the devm_cxl_memdev_edac_release() is being called for this device, but it is, and that where the kfree() oops happens. Are those register and release functions intended to be paired/reciprocal ? Here's the code segment (from this patch) where it is failing, and the stack trace follows. > + > +void devm_cxl_memdev_edac_release(struct cxl_memdev *cxlmd) > +{ > +#ifdef CONFIG_CXL_EDAC_MEM_REPAIR > + struct cxl_mem_err_rec *array_rec = cxlmd->array_err_rec; > + struct cxl_event_gen_media *rec_gen_media; > + struct cxl_event_dram *rec_dram; > + unsigned long index; > + > + if (!array_rec) > + return; > + > + xa_for_each(&array_rec->rec_dram, index, rec_dram) > + kfree(rec_dram); It fails above. That's as far as I've gotten. Hoping that you recognize something with this case where num_ras_features == 0 that can lead to this. Here's the stack trace, and I'll hop to the point in the diff below where it's failing: [ 108.114491] cxl mem9: ALISON try_free_rec_dram [ 108.115654] Oops: general protection fault, probably for non-canonical address 0x1ad998badadad88: 0000 [#1] SMP NOPTI [ 108.118125] CPU: 0 UID: 0 PID: 1187 Comm: modprobe Tainted: G O 6.15.0-rc4+ #1 PREEMPT(voluntary) [ 108.120507] Tainted: [O]=OOT_MODULE [ 108.121202] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 108.122466] RIP: 0010:kfree+0x6e/0x300 [ 108.123099] Code: 80 48 01 d8 0f 82 98 02 00 00 48 c7 c2 00 00 00 80 48 2b 15 94 70 61 01 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 72 70 61 01 <48> 8b 50 08 49 89 c5 f6 c2 01 0f 85 4c 01 00 00 0f 1f 44 00 00 41 [ 108.126050] RSP: 0018:ffffc90002c03be0 EFLAGS: 00010203 [ 108.126905] RAX: 01ad998badadad80 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000002 [ 108.128035] RDX: 0000777f80000000 RSI: ffffffff82a44991 RDI: ffffffff82a5ac86 [ 108.129194] RBP: ffffc90002c03c30 R08: 0000000000000000 R09: 0000000000000000 [ 108.130346] R10: 0000000000000001 R11: ffff888276ffe000 R12: ffff8880075a5000 [ 108.131414] R13: ffff88800680a578 R14: 0000000000000000 R15: 000000000000000b [ 108.132234] FS: 00007fce1dd38740(0000) GS:ffff8880fa52b000(0000) knlGS:0000000000000000 [ 108.133256] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 108.133952] CR2: 00007ffce474e6e8 CR3: 0000000143296006 CR4: 0000000000370ef0 [ 108.134787] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 108.135560] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 108.136375] Call Trace: [ 108.136694] <TASK> [ 108.136939] devm_cxl_memdev_edac_release+0x6d/0x130 [cxl_core] [ 108.137611] cxl_memdev_release+0x22/0x40 [cxl_core] [ 108.138268] device_release+0x30/0x90 [ 108.138720] kobject_put+0x85/0x1c0 [ 108.139111] put_device+0xe/0x20 [ 108.139478] cxl_memdev_unregister+0x42/0x50 [cxl_core] [ 108.140092] devm_action_release+0xd/0x20 [ 108.140531] devres_release_all+0xa5/0xe0 [ 108.141059] device_unbind_cleanup+0xd/0x70 [ 108.141416] device_release_driver_internal+0x1d3/0x220 [ 108.141890] device_release_driver+0xd/0x20 [ 108.142236] bus_remove_device+0xd7/0x140 [ 108.142578] device_del+0x15a/0x3a0 [ 108.142932] platform_device_del.part.0+0x13/0x80 [ 108.143330] platform_device_unregister+0x1b/0x40 [ 108.143786] cxl_test_exit+0x1a/0xd80 [cxl_test] [ 108.144171] __x64_sys_delete_module+0x17d/0x260 [ 108.144579] ? debug_smp_processor_id+0x17/0x20 [ 108.145004] x64_sys_call+0x19bd/0x1f90 [ 108.145321] do_syscall_64+0x64/0x140 [ 108.145628] entry_SYSCALL_64_after_hwframe+0x71/0x79 [ 108.146284] RIP: 0033:0x7fce1d5128cb [ 108.146775] Code: 73 01 c3 48 8b 0d 55 55 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 25 55 0e 00 f7 d8 64 89 01 48 [ 108.148672] RSP: 002b:00007ffce4752768 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 108.149490] RAX: ffffffffffffffda RBX: 0000556b12b5dd60 RCX: 00007fce1d5128cb [ 108.150284] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000556b12b5ddc8 [ 108.151169] RBP: 0000556b12b5ddc8 R08: 1999999999999999 R09: 0000000000000000 [ 108.151982] R10: 00007fce1d59dac0 R11: 0000000000000206 R12: 0000000000000001 [ 108.152756] R13: 0000000000000000 R14: 00007ffce4754aa8 R15: 0000556b12b5def0 [ 108.153523] </TASK> [ 108.153925] Modules linked in: dax_pmem nd_pmem kmem nd_btt device_dax dax_cxl nd_e820 nfit cxl_mock_mem(O) cxl_test(O-) cxl_mem(O) cxl_pmem(O) cxl_acpi(O) cxl_port(O) cxl_mock(O) cxl_core(O) libnvdimm [ 108.155784] ---[ end trace 0000000000000000 ]---