On 7/22/2025 6:03 PM, Robin Murphy wrote: > On 2025-07-22 8:24 am, Greg KH wrote: >> On Tue, Jul 22, 2025 at 11:05:11AM +0500, Muhammad Usama Anjum wrote: >>> Adding ath/mhi and dma API developers to the discussion. >>> >>> On 7/22/25 10:32 AM, Greg KH wrote: >>>> On Mon, Jul 21, 2025 at 06:13:10PM +0100, Matthew Wilcox wrote: >>>>> On Mon, Jul 21, 2025 at 08:03:12PM +0500, Muhammad Usama Anjum wrote: >>>>>> Hello, >>>>>> >>>>>> When 10-12GB our of total 16GB RAM is being used as page cache >>>>>> (active_file + inactive_file) at suspend time, the drivers fail to allocate >>>>>> dma memory at resume as dma memory is either occupied by the page cache or >>>>>> fragmented. Example: >>>>>> >>>>>> kworker/u33:5: page allocation failure: order:7, mode:0xc04(GFP_NOIO|GFP_DMA32), >>>>>> nodemask=(null),cpuset=/,mems_allowed=0 >>>>> >>>>> Just to be clear, this is not a page cache problem. The driver is asking >>>>> us to do a 512kB allocation without doing I/O! This is a ridiculous >>>>> request that should be expected to fail. >>>>> >>>>> The solution, whatever it may be, is not related to the page cache. >>>>> I reject your diagnosis. Almost all of the page cache is clean and >>>>> could be dropped (as far as I can tell from the output below). >>>>> >>>>> Now, I'm not too familiar with how the page allocator chooses to fail >>>>> this request. Maybe it should be trying harder to drop bits of the page >>>>> cache. Maybe it should be doing some compaction. >>> That's very thoughtful. I'll look at the page allocator why isn't it dropping >>> cache or doing compaction. >>> >>>>> I am not inclined to >>>>> go digging on your behalf, because frankly I'm offended by the suggestion >>>>> that the page cache is at fault. >>> I apologize—that wasn't my intention. >>> >>>>> >>>>> Perhaps somebody else will help you, or you can dig into this yourself. >>>> >>>> I'm with Matthew, this really looks like a driver bug somehow. If there >>>> is page cache memory that is "clean", the driver should be able to >>>> access it just fine if really required. >>>> >>>> What exact driver(s) is having this problem? What is the exact error, >>>> and on what lines of code? >>> The issue occurs on both ath11k and mhi drivers during resume, when >>> dma_alloc_coherent(GFP_KERNEL) fails and returns -ENOMEM. This failure has >>> been observed at multiple points in these drivers. >>> >>> For example, in the mhi driver, the failure is triggered when the >>> MHI's st_worker gets scheduled-in at resume. >>> >>> mhi_pm_st_worker() >>> -> mhi_fw_load_handler() >>> -> mhi_load_image_bhi() >>> -> mhi_alloc_bhi_buffer() >>> -> dma_alloc_coherent(GFP_KERNEL) returns -ENOMEM >> >> And what is the exact size you are asking for here? >> What is the dma ops set to for your system? Are you sure that is >> working properly for your platform? What platform is this exactly? >> >> The driver isn't asking for DMA32 here, so that shouldn't be the issue, >> so why do you feel it is? Have you tried using the tracing stuff for >> dma allocations to see exactly what is going on for this failure? > > I'm guessing the device has a 32-bit DMA mask, and the allocation ends up in Yeah, the device is capable of 32 bit coherent DMA only. > __dma_direct_alloc_pages() such that that adds GFP_DMA32 in order to try to satisfy the > mask via regular page allocation. How GFP_KERNEL turns into GFP_NOIO, though, given that > the DMA layer certainly isn't (knowingly) messing with __GFP_IO or __GFP_FS, is more of a > mystery... I suppose "during resume" is the red flag there - is this worker perhaps trying > to run too early in some restricted context before the rest of the system has fully woken up? the worker is running at __resume_early stage. > > Thanks, > Robin. > >> >> I think you need to do a bit more debugging :) >> >> thanks, >> >> greg k-h >