Hi, I've been staring at the crypto IAA driver and checking if "wq_table" relies on disabling bottom halves/ expecting that softirq is not preemtible. It does not look like it does. wq_table_next_wq() returns a "struct idxd_wq *" which could return twice the same but the struct itself has its own locking which looks fine. Please correct me if I am wrong ;) While looking at it, I noticed a few other things: - wq_table This could be initialized as DEFINE_PER_CPU() instead the alloc_percpu() it has now. I see no benefit from delaying it. The driver is loaded on its own once the dmaengine counterpart is loaded first and added the special bus plus the device. So the allocation could be avoided. - nr_cpus, nr_nodes, nr_cpus_per_node. This is derived from - nr_cpus = num_possible_cpus(); - for_each_node_with_cpus(node) nr_nodes++; - nr_cpus_per_node = nr_cpus / nr_nodes; Moving on. nr_cpus_per_node is only used as cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa; Given the above definition for nr_cpus_per_node we can substitute cpus_per_iaa = nr_cpus / nr_iaa and then remove nr_cpus_per_node as I see no other user. If we ignore the one pr_debug() then we could also remove nr_nodes because there are no users. There are several iterations which go from 0 to nr_nodes. I would recommend using for_each_possible_cpu(). Then nr_cpus could be removed. - wq_table->wqs This is allocated as (max_wqs * sizeof(struct wq *)). I don't know where "struct wq" is from but it is not "struct idxd_wq" and gcc probably doesn't care because it is a pointer. Also the memory for the per-CPU memory is allocated always from the _current_ NUMA node. Not bad but also not ideal. How huge can max_wqs can get? If it is, say 64, then you could make a fix array of 64 which is part of struct wq_table_entry and avoid the allocation. Another variant would be to do something like struct wq_table_entry { int max_wqs; int n_wqs; int cur_wq; struct idxd_wq wqs[]; }; and then wq_table = alloc_percpu(struct_size(wq_table, wqs, max_wqs)); voilà. I didn't figure out how the final struct idxd_wq is allocated but it would make sense to allocate it node local the CPU if possible. Sebastian