On 29/04/2025 10:45, Suzuki K Poulose wrote: > On 16/04/2025 14:41, Steven Price wrote: >> Add the KVM_CAP_ARM_RME_CREATE_RD ioctl to create a realm. This involves >> delegating pages to the RMM to hold the Realm Descriptor (RD) and for >> the base level of the Realm Translation Tables (RTT). A VMID also need >> to be picked, since the RMM has a separate VMID address space a >> dedicated allocator is added for this purpose. >> >> KVM_CAP_ARM_RME_CONFIG_REALM is provided to allow configuring the realm >> before it is created. Configuration options can be classified as: >> >> 1. Parameters specific to the Realm stage2 (e.g. IPA Size, vmid, stage2 >> entry level, entry level RTTs, number of RTTs in start level, LPA2) >> Most of these are not measured by RMM and comes from KVM book >> keeping. >> >> 2. Parameters controlling "Arm Architecture features for the VM". (e.g. >> SVE VL, PMU counters, number of HW BRPs/WPs), configured by the VMM >> using the "user ID register write" mechanism. These will be >> supported in the later patches. >> >> 3. Parameters are not part of the core Arm architecture but defined >> by the RMM spec (e.g. Hash algorithm for measurement, >> Personalisation value). These are programmed via >> KVM_CAP_ARM_RME_CONFIG_REALM. >> >> For the IPA size there is the possibility that the RMM supports a >> different size to the IPA size supported by KVM for normal guests. At >> the moment the 'normal limit' is exposed by KVM_CAP_ARM_VM_IPA_SIZE and >> the IPA size is configured by the bottom bits of vm_type in >> KVM_CREATE_VM. This means that it isn't easy for the VMM to discover >> what IPA sizes are supported for Realm guests. Since the IPA is part of >> the measurement of the realm guest the current expectation is that the >> VMM will be required to pick the IPA size demanded by attestation and >> therefore simply failing if this isn't available is fine. An option >> would be to expose a new capability ioctl to obtain the RMM's maximum >> IPA size if this is needed in the future. >> >> Co-developed-by: Suzuki K Poulose <suzuki.poulose@xxxxxxx> >> Signed-off-by: Suzuki K Poulose <suzuki.poulose@xxxxxxx> >> Signed-off-by: Steven Price <steven.price@xxxxxxx> >> Reviewed-by: Gavin Shan <gshan@xxxxxxxxxx> >> --- >> Changes since v7: >> * Minor code cleanup following Gavin's review. >> Changes since v6: >> * Separate RMM RTT calculations from host PAGE_SIZE. This allows the >> host page size to be larger than 4k while still communicating with an >> RMM which uses 4k granules. >> Changes since v5: >> * Introduce free_delegated_granule() to replace many >> undelegate/free_page() instances and centralise the comment on >> leaking when the undelegate fails. >> * Several other minor improvements suggested by reviews - thanks for >> the feedback! >> Changes since v2: >> * Improved commit description. >> * Improved return failures for rmi_check_version(). >> * Clear contents of PGD after it has been undelegated in case the RMM >> left stale data. >> * Minor changes to reflect changes in previous patches. >> --- >> arch/arm64/include/asm/kvm_emulate.h | 5 + >> arch/arm64/include/asm/kvm_rme.h | 19 ++ >> arch/arm64/kvm/arm.c | 16 ++ >> arch/arm64/kvm/mmu.c | 22 +- >> arch/arm64/kvm/rme.c | 319 +++++++++++++++++++++++++++ >> 5 files changed, 379 insertions(+), 2 deletions(-) >> >> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/ >> include/asm/kvm_emulate.h >> index 1c43a4fc25dd..4ee6c215da82 100644 >> --- a/arch/arm64/include/asm/kvm_emulate.h >> +++ b/arch/arm64/include/asm/kvm_emulate.h >> @@ -699,6 +699,11 @@ static inline enum realm_state >> kvm_realm_state(struct kvm *kvm) >> return READ_ONCE(kvm->arch.realm.state); >> } >> +static inline bool kvm_realm_is_created(struct kvm *kvm) >> +{ >> + return kvm_is_realm(kvm) && kvm_realm_state(kvm) != >> REALM_STATE_NONE; >> +} >> + >> static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu) >> { >> return false; >> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/ >> asm/kvm_rme.h >> index 9c8a0b23e0e4..5dc1915de891 100644 >> --- a/arch/arm64/include/asm/kvm_rme.h >> +++ b/arch/arm64/include/asm/kvm_rme.h >> @@ -6,6 +6,8 @@ >> #ifndef __ASM_KVM_RME_H >> #define __ASM_KVM_RME_H >> +#include <uapi/linux/kvm.h> >> + >> /** >> * enum realm_state - State of a Realm >> */ >> @@ -46,11 +48,28 @@ enum realm_state { >> * struct realm - Additional per VM data for a Realm >> * >> * @state: The lifetime state machine for the realm >> + * @rd: Kernel mapping of the Realm Descriptor (RD) >> + * @params: Parameters for the RMI_REALM_CREATE command >> + * @num_aux: The number of auxiliary pages required by the RMM >> + * @vmid: VMID to be used by the RMM for the realm >> + * @ia_bits: Number of valid Input Address bits in the IPA >> */ >> struct realm { >> enum realm_state state; >> + >> + void *rd; >> + struct realm_params *params; >> + >> + unsigned long num_aux; >> + unsigned int vmid; >> + unsigned int ia_bits; >> }; >> void kvm_init_rme(void); >> +u32 kvm_realm_ipa_limit(void); >> + >> +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap); >> +int kvm_init_realm_vm(struct kvm *kvm); >> +void kvm_destroy_realm(struct kvm *kvm); >> #endif /* __ASM_KVM_RME_H */ >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c >> index 856a721d41ac..0e8482fdc4d3 100644 >> --- a/arch/arm64/kvm/arm.c >> +++ b/arch/arm64/kvm/arm.c >> @@ -136,6 +136,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, >> } >> mutex_unlock(&kvm->lock); >> break; >> + case KVM_CAP_ARM_RME: >> + mutex_lock(&kvm->lock); >> + r = kvm_realm_enable_cap(kvm, cap); >> + mutex_unlock(&kvm->lock); >> + break; >> default: >> break; >> } >> @@ -198,6 +203,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned >> long type) >> bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES); >> + /* Initialise the realm bits after the generic bits are enabled */ >> + if (kvm_is_realm(kvm)) { >> + ret = kvm_init_realm_vm(kvm); >> + if (ret) >> + goto err_free_cpumask; >> + } >> + >> return 0; >> err_free_cpumask: >> @@ -257,6 +269,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm) >> kvm_unshare_hyp(kvm, kvm + 1); >> kvm_arm_teardown_hypercalls(kvm); >> + kvm_destroy_realm(kvm); >> } >> static bool kvm_has_full_ptr_auth(void) >> @@ -405,6 +418,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, >> long ext) >> case KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES: >> r = BIT(0); >> break; >> + case KVM_CAP_ARM_RME: >> + r = static_key_enabled(&kvm_rme_is_available); >> + break; >> default: >> r = 0; >> } >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >> index 2feb6c6b63af..5957a07de86d 100644 >> --- a/arch/arm64/kvm/mmu.c >> +++ b/arch/arm64/kvm/mmu.c >> @@ -876,12 +876,16 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = { >> .icache_inval_pou = invalidate_icache_guest_page, >> }; >> -static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long >> type) >> +static int kvm_init_ipa_range(struct kvm *kvm, >> + struct kvm_s2_mmu *mmu, unsigned long type) >> { >> u32 kvm_ipa_limit = get_kvm_ipa_limit(); >> u64 mmfr0, mmfr1; >> u32 phys_shift; >> + if (kvm_is_realm(kvm)) >> + kvm_ipa_limit = kvm_realm_ipa_limit(); >> + >> if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) >> return -EINVAL; >> @@ -946,7 +950,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct >> kvm_s2_mmu *mmu, unsigned long t >> return -EINVAL; >> } >> - err = kvm_init_ipa_range(mmu, type); >> + err = kvm_init_ipa_range(kvm, mmu, type); >> if (err) >> return err; >> @@ -1072,6 +1076,20 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu) >> struct kvm_pgtable *pgt = NULL; >> write_lock(&kvm->mmu_lock); >> + if (kvm_is_realm(kvm) && >> + (kvm_realm_state(kvm) != REALM_STATE_DEAD && >> + kvm_realm_state(kvm) != REALM_STATE_NONE)) { >> + /* Tearing down RTTs will be added in a later patch */ >> + write_unlock(&kvm->mmu_lock); >> + >> + /* >> + * The physical PGD pages are delegated to the RMM, so cannot >> + * be freed at this point. This function will be called again >> + * from kvm_destroy_realm() after the physical pages have been >> + * returned at which point the memory can be freed. >> + */ >> + return; >> + } >> pgt = mmu->pgt; >> if (pgt) { >> mmu->pgd_phys = 0; >> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c >> index 67cf2d94cb2d..dbb6521fe380 100644 >> --- a/arch/arm64/kvm/rme.c >> +++ b/arch/arm64/kvm/rme.c >> @@ -5,9 +5,23 @@ >> #include <linux/kvm_host.h> >> +#include <asm/kvm_emulate.h> >> +#include <asm/kvm_mmu.h> >> #include <asm/rmi_cmds.h> >> #include <asm/virt.h> >> +#include <asm/kvm_pgtable.h> >> + >> +static unsigned long rmm_feat_reg0; >> + >> +#define RMM_PAGE_SHIFT 12 >> +#define RMM_PAGE_SIZE BIT(RMM_PAGE_SHIFT) >> + >> +static bool rme_has_feature(unsigned long feature) >> +{ >> + return !!u64_get_bits(rmm_feat_reg0, feature); >> +} >> + >> static int rmi_check_version(void) >> { >> struct arm_smccc_res res; >> @@ -42,6 +56,305 @@ static int rmi_check_version(void) >> return 0; >> } >> +u32 kvm_realm_ipa_limit(void) >> +{ >> + return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ); >> +} >> + >> +static int get_start_level(struct realm *realm) >> +{ >> + return 4 - ((realm->ia_bits - 8) / (RMM_PAGE_SHIFT - 3)); > > minor nit: It may be worth adding a comment here, how we got this magic > number 8. We could say: > > Open coded version of ARM64_HW_PGTABLE_LEVELS(ia_bits - 4) (accounting > for the concatenation of upto 16 tables in entry level) for RMM Stage2. > Agreed a comment makes sense. This is actually an open coded version of "4 - stage2_pgtable_levels()" but using the RMM's page size (indeed previous versions of the series did exactly that). There's already a comment for stage2_pgtable_levels() explaining the concatenation. >> +} >> + >> +static void free_delegated_granule(phys_addr_t phys) >> +{ >> + if (WARN_ON(rmi_granule_undelegate(phys))) { >> + /* Undelegate failed: leak the page */ >> + return; >> + } >> + >> + kvm_account_pgtable_pages(phys_to_virt(phys), -1); >> + >> + free_page((unsigned long)phys_to_virt(phys)); >> +} >> + >> +/* Calculate the number of s2 root rtts needed */ >> +static int realm_num_root_rtts(struct realm *realm) >> +{ >> + unsigned int ipa_bits = realm->ia_bits; >> + unsigned int levels = 3 - get_start_level(realm); > > nit: Why is this 3 - start_level and not 4 ? Though that is compensated > by the "levels + 1" below, hence the calculation is correct. I honestly can't remember - will change ;) >> + unsigned int sl_ipa_bits = (levels + 1) * (RMM_PAGE_SHIFT - 3) + >> + RMM_PAGE_SHIFT; >> + >> + if (sl_ipa_bits >= ipa_bits) >> + return 1; >> + >> + return 1 << (ipa_bits - sl_ipa_bits); >> +} >> + >> +static int realm_create_rd(struct kvm *kvm) >> +{ >> + struct realm *realm = &kvm->arch.realm; >> + struct realm_params *params = realm->params; >> + void *rd = NULL; >> + phys_addr_t rd_phys, params_phys; >> + size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr); >> + int i, r; >> + int rtt_num_start; >> + >> + realm->ia_bits = VTCR_EL2_IPA(kvm->arch.mmu.vtcr); >> + rtt_num_start = realm_num_root_rtts(realm); >> + >> + if (WARN_ON(realm->rd || !realm->params)) >> + return -EEXIST; >> + >> + if (pgd_size / RMM_PAGE_SIZE < rtt_num_start) >> + return -EINVAL; >> + >> + rd = (void *)__get_free_page(GFP_KERNEL); >> + if (!rd) >> + return -ENOMEM; >> + >> + rd_phys = virt_to_phys(rd); >> + if (rmi_granule_delegate(rd_phys)) { >> + r = -ENXIO; >> + goto free_rd; >> + } >> + >> + for (i = 0; i < pgd_size; i += RMM_PAGE_SIZE) { >> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i; >> + >> + if (rmi_granule_delegate(pgd_phys)) { >> + r = -ENXIO; >> + goto out_undelegate_tables; >> + } >> + } >> + >> + params->s2sz = VTCR_EL2_IPA(kvm->arch.mmu.vtcr); >> + params->rtt_level_start = get_start_level(realm); >> + params->rtt_num_start = rtt_num_start; >> + params->rtt_base = kvm->arch.mmu.pgd_phys; >> + params->vmid = realm->vmid; >> + >> + params_phys = virt_to_phys(params); >> + >> + if (rmi_realm_create(rd_phys, params_phys)) { >> + r = -ENXIO; >> + goto out_undelegate_tables; >> + } >> + >> + if (WARN_ON(rmi_rec_aux_count(rd_phys, &realm->num_aux))) { >> + WARN_ON(rmi_realm_destroy(rd_phys)); >> + goto out_undelegate_tables; >> + } >> + >> + realm->rd = rd; >> + >> + return 0; >> + >> +out_undelegate_tables: >> + while (i > 0) { >> + i -= RMM_PAGE_SIZE; >> + >> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i; >> + >> + if (WARN_ON(rmi_granule_undelegate(pgd_phys))) { >> + /* Leak the pages if they cannot be returned */ >> + kvm->arch.mmu.pgt = NULL; >> + break; >> + } > > minor nit: Do we need to try undelegating the other pages ? We could > make that WARN_ON_ONCE() too. We could do, but there's no real point unless we also deal with leaking only the pages which cannot be undelegated. Considering this is a "should never happen" case I felt it was better to keep the code simple and leak more than necessary. To me this feels like a "save everything and reboot as soon as possible" sort of error condition - the kernel (or RMM) has lost track of which pages are delegated and so a GPT fault could be coming at any point. So it's better to play safe and be pessimistic. Thanks, Steve