On 4/25/25 1:59 PM, Manivannan Sadhasivam wrote: > On Fri, Apr 25, 2025 at 12:42:38PM +0500, Muhammad Usama Anjum wrote: >> On 4/25/25 12:32 PM, Manivannan Sadhasivam wrote: >>> On Fri, Apr 25, 2025 at 12:14:39PM +0500, Muhammad Usama Anjum wrote: >>>> On 4/25/25 12:04 PM, Manivannan Sadhasivam wrote: >>>>> On Thu, Apr 10, 2025 at 07:56:54PM +0500, Muhammad Usama Anjum wrote: >>>>>> Fix dma_direct_alloc() failure at resume time during bhie_table >>>>>> allocation. There is a crash report where at resume time, the memory >>>>>> from the dma doesn't get allocated and MHI fails to re-initialize. >>>>>> There may be fragmentation of some kind which fails the allocation >>>>>> call. >>>>>> >>>>> >>>>> If dma_direct_alloc() fails, then it is a platform limitation/issue. We cannot >>>>> workaround that in the device drivers. What is the guarantee that other drivers >>>>> will also continue to work? Will you go ahead and patch all of them which >>>>> release memory during suspend? >>>>> >>>>> Please investigate why the allocation fails. Even this is not a device issue, so >>>>> we cannot add quirks :/ >>>> This isn't a platform specific quirk. We are only hitting it because >>>> there is high memory pressure during suspend/resume. This dma allocation >>>> failure can happen with memory pressure on any device. >>>> >>> >>> Yes. >> Thanks for understanding. >> >>> >>>> The purpose of this patch is just to make driver more robust to memory >>>> pressure during resume. >>>> >>>> I'm not sure about MHI. But other drivers already have such patches as >>>> dma_direct_alloc() is susceptible to failures when memory pressure is >>>> high. This patch was motivated from ath12k [1] and ath11k [2]. >>>> >>> >>> Even if we patch the MHI driver, the issue is going to trip some other driver. >>> How does the DMA memory goes low during resume? So some other driver is >>> consuming more than it did during probe()? >> Think it like this. The first probe happens just after boot. Most of the >> RAM was empty. Then let's say user launches applications which not only >> consume entire RAM but also the Swap. The DMA memory area is the first >> ~4GB on x86_64 (if I'm not mistaken). Now at resume time when we want to >> allocate memory from dma, it may not be available entirely or because of >> fragmentation we cannot allocate that much contiguous memory. >> > > Looks like you have a workload that consumes the limited DMA coherent memory. > Most likely the GPU applications I think. > >> In our testing and real world cases, right now only wifi driver is >> misbehaving. Wifi is also very important. So we are hoping to make wifi >> driver robust. >> > > Sounds fair. If you want to move forward, please modify the exisiting > mhi_power_down_keep_dev() to include this partial unprepare as well: > > mhi_power_down_unprepare_keep_dev() > > Since both APIs are anyway going to be used together, I don't see a need to > introduce yet another API. I've looked at usages of mhi_power_down_keep_dev(). Its getting used by ath12k and ath11k both. We would have to look at ath12k as well before we can change mhi_power_down_keep_dev(). Unfortunately, I don't have device using ath12k at hand. Should we keep this new API or what should we do? > > - Mani > >>> >>>> [1] >>>> https://lore.kernel.org/all/20240419034034.2842-1-quic_bqiang@xxxxxxxxxxx/ >>>> [2] >>>> https://lore.kernel.org/all/20220506141448.10340-1-quic_akolli@xxxxxxxxxxx/ >>>> >>>> What do you think can be the way forward for this patch? >>>> >>> >>> Let's try first to analyze why the memory pressure happens during suspend. As I >>> can see, even if we fix the MHI driver, you are likely to hit this issue >>> somewhere else.> >>> - Mani >>> >>>>> >>> >>> [...] >>> >>>>> Did you intend to leak this information? If not, please remove it from >>>>> stacktrace. >>>> The device isn't private. Its fine. >>>> >>> >>> Okay. >>> >>> - Mani >>> >> >> >> -- >> Regards, >> Usama > -- Regards, Usama