On 21.03.2025 17:48, Robin Murphy wrote: > On 21/03/2025 12:15 pm, Marek Szyprowski wrote: >> On 17.03.2025 19:22, Robin Murphy wrote: >>> On 17/03/2025 7:37 am, Marek Szyprowski wrote: >>>> On 13.03.2025 15:12, Robin Murphy wrote: >>>>> On 2025-03-13 1:06 pm, Robin Murphy wrote: >>>>>> On 2025-03-13 12:23 pm, Marek Szyprowski wrote: >>>>>>> On 13.03.2025 12:01, Robin Murphy wrote: >>>>>>>> On 2025-03-13 9:56 am, Marek Szyprowski wrote: >>>>>>>> [...] >>>>>>>>> This patch landed in yesterday's linux-next as commit >>>>>>>>> bcb81ac6ae3c >>>>>>>>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my >>>>>>>>> tests I >>>>>>>>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board >>>>>>>>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the >>>>>>>>> relevant kernel log: >>>>>>>> >>>>>>>> ...and the bug-flushing-out begins! >>>>>>>> >>>>>>>>> Unable to handle kernel NULL pointer dereference at virtual >>>>>>>>> address >>>>>>>>> 00000000000003e8 >>>>>>>>> Mem abort info: >>>>>>>>> ESR = 0x0000000096000004 >>>>>>>>> EC = 0x25: DABT (current EL), IL = 32 bits >>>>>>>>> SET = 0, FnV = 0 >>>>>>>>> EA = 0, S1PTW = 0 >>>>>>>>> FSC = 0x04: level 0 translation fault >>>>>>>>> Data abort info: >>>>>>>>> ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 >>>>>>>>> CM = 0, WnR = 0, TnD = 0, TagAccess = 0 >>>>>>>>> GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 >>>>>>>>> [00000000000003e8] user address but active_mm is swapper >>>>>>>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP >>>>>>>>> Modules linked in: >>>>>>>>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ >>>>>>>>> #15533 >>>>>>>>> Hardware name: Hardkernel ODROID-M1 (DT) >>>>>>>>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>>>>>>> pc : devm_kmalloc+0x2c/0x114 >>>>>>>>> lr : rk_iommu_of_xlate+0x30/0x90 >>>>>>>>> ... >>>>>>>>> Call trace: >>>>>>>>> devm_kmalloc+0x2c/0x114 (P) >>>>>>>>> rk_iommu_of_xlate+0x30/0x90 >>>>>>>> >>>>>>>> Yeah, looks like this is doing something a bit questionable which >>>>>>>> can't >>>>>>>> work properly. TBH the whole dma_dev thing could probably be >>>>>>>> cleaned up >>>>>>>> now that we have proper instances, but for now does this work? >>>>>>> >>>>>>> Yes, this patch fixes the problem I've observed. >>>>>>> >>>>>>> Reported-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> >>>>>>> Tested-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> >>>>>>> >>>>>>> BTW, this dma_dev idea has been borrowed from my exynos_iommu >>>>>>> driver >>>>>>> and >>>>>>> I doubt it can be cleaned up. >>>>>> >>>>>> On the contrary I suspect they both can - it all dates back to when >>>>>> we had the single global platform bus iommu_ops and the SoC drivers >>>>>> were forced to bodge their own notion of multiple instances, but >>>>>> with >>>>>> the modern core code, ops are always called via a valid IOMMU >>>>>> instance or domain, so in principle it should always be possible to >>>>>> get at an appropriate IOMMU device now. IIRC it was mostly about >>>>>> allocating and DMA-mapping the pagetables in domain_alloc, where the >>>>>> private notion of instances didn't have enough information, but >>>>>> domain_alloc_paging solves that. >>>>> >>>>> Bah, in fact I think I am going to have to do that now, since >>>>> although >>>>> it doesn't crash, rk_domain_alloc_paging() will also be failing for >>>>> the same reason. Time to find a PSU for the RK3399 board, I guess... >>>>> >>>>> (Or maybe just move the dma_dev assignment earlier to match Exynos?) >>>> >>>> Well I just found that Exynos IOMMU is also broken on some on my test >>>> boards. It looks that the runtime pm links are somehow not correctly >>>> established. I will try to analyze this later in the afternoon. >>> >>> Hmm, I tried to get an Odroid-XU3 up and running, but it seems unable >>> to boot my original 6.14-rc3-based branch - even with the IOMMU driver >>> disabled, it's consistently dying somewhere near (or just after) init >>> with what looks like some catastrophic memory corruption issue - very >>> occasionally it's managed to print the first line of various different >>> panics. >>> >>> Before that point though, with the IOMMU driver enabled it does appear >>> to show signs of working OK: >>> >>> [ 0.649703] exynos-sysmmu 14650000.sysmmu: hardware version: 3.3 >>> [ 0.654220] platform 14450000.mixer: Adding to iommu group 1 >>> ... >>> [ 2.680920] exynos-mixer 14450000.mixer: >>> exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42924000 >>> ... >>> [ 5.196674] exynos-mixer 14450000.mixer: >>> exynos_iommu_identity_attach: Restored IOMMU to IDENTITY from pgtable >>> 0x42924000 >>> [ 5.207091] exynos-mixer 14450000.mixer: >>> exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42884000 >>> >>> >>> The multi-instance stuff in probe/release does look a bit suspect, >>> however - seems like the second instance probe would overwrite the >>> first instance's links, and then there would be a double-del() if the >>> device were ever actually released again? I may have made that much >>> more likely to happen, but I suspect it was already possible with >>> async driver probe... >> >> That is really strange. My Odroid XU3 boots fine from commit >> bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path"), >> although the IOMMU seems not to be working correctly. I've tested this >> with 14450000.mixer device (one need to attach HDMI cable to get it >> activated) and it looks that the video data are not being read from >> memory at all (the lack of VSYNC is reported, no IOMMU fault). However, >> from time to time, everything initializes and works properly. > > Urgh, seems my mistake was assuming exynos_defconfig was the right > thing to begin from - bcb81ac6ae3c with that still dies in the same > way (this time I saw a hint of spin_bug() being hit...), however a > multi_v7_defconfig build does get to userspace OK again with no > obvious signs of distress: > > [root@alarm ~]# grep -Hr . /sys/kernel/iommu_groups/*/type > /sys/kernel/iommu_groups/0/type:identity > /sys/kernel/iommu_groups/1/type:identity > /sys/kernel/iommu_groups/10/type:identity > /sys/kernel/iommu_groups/2/type:identity > /sys/kernel/iommu_groups/3/type:identity > /sys/kernel/iommu_groups/4/type:identity > /sys/kernel/iommu_groups/5/type:identity > /sys/kernel/iommu_groups/6/type:identity > /sys/kernel/iommu_groups/7/type:identity > /sys/kernel/iommu_groups/8/type:identity > /sys/kernel/iommu_groups/9/type:identity > > Annoyingly I do have an adapter for the fiddly micro-HDMI, but it's at > home :( > >> It looks that this is somehow related to the different IOMMU/DMA-mapping >> glue code, as the other boards (ARM64 based) with exactly the same >> Exynos IOMMU driver always work fine. I've tried to figure out what >> actually happens, but so far I didn't get anything for sure. Disabling >> the call to dev->bus->dma_configure(dev) from iommu_init_device() seems >> to be fixing this, but this is almost equal to the revert of the >> $subject patch. I don't get why calling it in iommu_init_device() causes >> problems. It also doesn't look that this is anyhow related to the >> multi-instance stuff, as the same happens if I only leave a single >> exynos-sysmmu instance and its client (only 14450000.mixer device in the >> system). > > On a hunch I stuck a print in exynos_iommu_probe_device(), and it > looks like in fact device_link_add() isn't getting called at all, and > indeed your symptoms do sound like they could be explained by the > IOMMU not being reliably resumed... lemme stare at > exynos_iommu_of_xlate() a bit longer... Just to let everyone know. The $subject change is okay. This is a bug in exynos-iommu driver, fixed by the following patch: https://lore.kernel.org/all/20250401202731.2810474-1-m.szyprowski@xxxxxxxxxxx/ Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland