Re: [PATCH v8 07/43] arm64: RME: ioctls to create and configure realms

Steven Price <steven.price@xxxxxxx> · Thu, 1 May 2025 16:09:16 +0100

On 29/04/2025 10:45, Suzuki K Poulose wrote:
> On 16/04/2025 14:41, Steven Price wrote:
>> Add the KVM_CAP_ARM_RME_CREATE_RD ioctl to create a realm. This involves
>> delegating pages to the RMM to hold the Realm Descriptor (RD) and for
>> the base level of the Realm Translation Tables (RTT). A VMID also need
>> to be picked, since the RMM has a separate VMID address space a
>> dedicated allocator is added for this purpose.
>>
>> KVM_CAP_ARM_RME_CONFIG_REALM is provided to allow configuring the realm
>> before it is created. Configuration options can be classified as:
>>
>>   1. Parameters specific to the Realm stage2 (e.g. IPA Size, vmid, stage2
>>      entry level, entry level RTTs, number of RTTs in start level, LPA2)
>>      Most of these are not measured by RMM and comes from KVM book
>>      keeping.
>>
>>   2. Parameters controlling "Arm Architecture features for the VM". (e.g.
>>      SVE VL, PMU counters, number of HW BRPs/WPs), configured by the VMM
>>      using the "user ID register write" mechanism. These will be
>>      supported in the later patches.
>>
>>   3. Parameters are not part of the core Arm architecture but defined
>>      by the RMM spec (e.g. Hash algorithm for measurement,
>>      Personalisation value). These are programmed via
>>      KVM_CAP_ARM_RME_CONFIG_REALM.
>>
>> For the IPA size there is the possibility that the RMM supports a
>> different size to the IPA size supported by KVM for normal guests. At
>> the moment the 'normal limit' is exposed by KVM_CAP_ARM_VM_IPA_SIZE and
>> the IPA size is configured by the bottom bits of vm_type in
>> KVM_CREATE_VM. This means that it isn't easy for the VMM to discover
>> what IPA sizes are supported for Realm guests. Since the IPA is part of
>> the measurement of the realm guest the current expectation is that the
>> VMM will be required to pick the IPA size demanded by attestation and
>> therefore simply failing if this isn't available is fine. An option
>> would be to expose a new capability ioctl to obtain the RMM's maximum
>> IPA size if this is needed in the future.
>>
>> Co-developed-by: Suzuki K Poulose <suzuki.poulose@xxxxxxx>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@xxxxxxx>
>> Signed-off-by: Steven Price <steven.price@xxxxxxx>
>> Reviewed-by: Gavin Shan <gshan@xxxxxxxxxx>
>> ---
>> Changes since v7:
>>   * Minor code cleanup following Gavin's review.
>> Changes since v6:
>>   * Separate RMM RTT calculations from host PAGE_SIZE. This allows the
>>     host page size to be larger than 4k while still communicating with an
>>     RMM which uses 4k granules.
>> Changes since v5:
>>   * Introduce free_delegated_granule() to replace many
>>     undelegate/free_page() instances and centralise the comment on
>>     leaking when the undelegate fails.
>>   * Several other minor improvements suggested by reviews - thanks for
>>     the feedback!
>> Changes since v2:
>>   * Improved commit description.
>>   * Improved return failures for rmi_check_version().
>>   * Clear contents of PGD after it has been undelegated in case the RMM
>>     left stale data.
>>   * Minor changes to reflect changes in previous patches.
>> ---
>>   arch/arm64/include/asm/kvm_emulate.h |   5 +
>>   arch/arm64/include/asm/kvm_rme.h     |  19 ++
>>   arch/arm64/kvm/arm.c                 |  16 ++
>>   arch/arm64/kvm/mmu.c                 |  22 +-
>>   arch/arm64/kvm/rme.c                 | 319 +++++++++++++++++++++++++++
>>   5 files changed, 379 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/
>> include/asm/kvm_emulate.h
>> index 1c43a4fc25dd..4ee6c215da82 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -699,6 +699,11 @@ static inline enum realm_state
>> kvm_realm_state(struct kvm *kvm)
>>       return READ_ONCE(kvm->arch.realm.state);
>>   }
>>   +static inline bool kvm_realm_is_created(struct kvm *kvm)
>> +{
>> +    return kvm_is_realm(kvm) && kvm_realm_state(kvm) !=
>> REALM_STATE_NONE;
>> +}
>> +
>>   static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
>>   {
>>       return false;
>> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/
>> asm/kvm_rme.h
>> index 9c8a0b23e0e4..5dc1915de891 100644
>> --- a/arch/arm64/include/asm/kvm_rme.h
>> +++ b/arch/arm64/include/asm/kvm_rme.h
>> @@ -6,6 +6,8 @@
>>   #ifndef __ASM_KVM_RME_H
>>   #define __ASM_KVM_RME_H
>>   +#include <uapi/linux/kvm.h>
>> +
>>   /**
>>    * enum realm_state - State of a Realm
>>    */
>> @@ -46,11 +48,28 @@ enum realm_state {
>>    * struct realm - Additional per VM data for a Realm
>>    *
>>    * @state: The lifetime state machine for the realm
>> + * @rd: Kernel mapping of the Realm Descriptor (RD)
>> + * @params: Parameters for the RMI_REALM_CREATE command
>> + * @num_aux: The number of auxiliary pages required by the RMM
>> + * @vmid: VMID to be used by the RMM for the realm
>> + * @ia_bits: Number of valid Input Address bits in the IPA
>>    */
>>   struct realm {
>>       enum realm_state state;
>> +
>> +    void *rd;
>> +    struct realm_params *params;
>> +
>> +    unsigned long num_aux;
>> +    unsigned int vmid;
>> +    unsigned int ia_bits;
>>   };
>>     void kvm_init_rme(void);
>> +u32 kvm_realm_ipa_limit(void);
>> +
>> +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
>> +int kvm_init_realm_vm(struct kvm *kvm);
>> +void kvm_destroy_realm(struct kvm *kvm);
>>     #endif /* __ASM_KVM_RME_H */
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 856a721d41ac..0e8482fdc4d3 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -136,6 +136,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>           }
>>           mutex_unlock(&kvm->lock);
>>           break;
>> +    case KVM_CAP_ARM_RME:
>> +        mutex_lock(&kvm->lock);
>> +        r = kvm_realm_enable_cap(kvm, cap);
>> +        mutex_unlock(&kvm->lock);
>> +        break;
>>       default:
>>           break;
>>       }
>> @@ -198,6 +203,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned
>> long type)
>>         bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES);
>>   +    /* Initialise the realm bits after the generic bits are enabled */
>> +    if (kvm_is_realm(kvm)) {
>> +        ret = kvm_init_realm_vm(kvm);
>> +        if (ret)
>> +            goto err_free_cpumask;
>> +    }
>> +
>>       return 0;
>>     err_free_cpumask:
>> @@ -257,6 +269,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>>       kvm_unshare_hyp(kvm, kvm + 1);
>>         kvm_arm_teardown_hypercalls(kvm);
>> +    kvm_destroy_realm(kvm);
>>   }
>>     static bool kvm_has_full_ptr_auth(void)
>> @@ -405,6 +418,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm,
>> long ext)
>>       case KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES:
>>           r = BIT(0);
>>           break;
>> +    case KVM_CAP_ARM_RME:
>> +        r = static_key_enabled(&kvm_rme_is_available);
>> +        break;
>>       default:
>>           r = 0;
>>       }
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 2feb6c6b63af..5957a07de86d 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -876,12 +876,16 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>>       .icache_inval_pou    = invalidate_icache_guest_page,
>>   };
>>   -static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long
>> type)
>> +static int kvm_init_ipa_range(struct kvm *kvm,
>> +                  struct kvm_s2_mmu *mmu, unsigned long type)
>>   {
>>       u32 kvm_ipa_limit = get_kvm_ipa_limit();
>>       u64 mmfr0, mmfr1;
>>       u32 phys_shift;
>>   +    if (kvm_is_realm(kvm))
>> +        kvm_ipa_limit = kvm_realm_ipa_limit();
>> +
>>       if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
>>           return -EINVAL;
>>   @@ -946,7 +950,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct
>> kvm_s2_mmu *mmu, unsigned long t
>>           return -EINVAL;
>>       }
>>   -    err = kvm_init_ipa_range(mmu, type);
>> +    err = kvm_init_ipa_range(kvm, mmu, type);
>>       if (err)
>>           return err;
>>   @@ -1072,6 +1076,20 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>>       struct kvm_pgtable *pgt = NULL;
>>         write_lock(&kvm->mmu_lock);
>> +    if (kvm_is_realm(kvm) &&
>> +        (kvm_realm_state(kvm) != REALM_STATE_DEAD &&
>> +         kvm_realm_state(kvm) != REALM_STATE_NONE)) {
>> +        /* Tearing down RTTs will be added in a later patch */
>> +        write_unlock(&kvm->mmu_lock);
>> +
>> +        /*
>> +         * The physical PGD pages are delegated to the RMM, so cannot
>> +         * be freed at this point. This function will be called again
>> +         * from kvm_destroy_realm() after the physical pages have been
>> +         * returned at which point the memory can be freed.
>> +         */
>> +        return;
>> +    }
>>       pgt = mmu->pgt;
>>       if (pgt) {
>>           mmu->pgd_phys = 0;
>> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
>> index 67cf2d94cb2d..dbb6521fe380 100644
>> --- a/arch/arm64/kvm/rme.c
>> +++ b/arch/arm64/kvm/rme.c
>> @@ -5,9 +5,23 @@
>>     #include <linux/kvm_host.h>
>>   +#include <asm/kvm_emulate.h>
>> +#include <asm/kvm_mmu.h>
>>   #include <asm/rmi_cmds.h>
>>   #include <asm/virt.h>
>>   +#include <asm/kvm_pgtable.h>
>> +
>> +static unsigned long rmm_feat_reg0;
>> +
>> +#define RMM_PAGE_SHIFT        12
>> +#define RMM_PAGE_SIZE        BIT(RMM_PAGE_SHIFT)
>> +
>> +static bool rme_has_feature(unsigned long feature)
>> +{
>> +    return !!u64_get_bits(rmm_feat_reg0, feature);
>> +}
>> +
>>   static int rmi_check_version(void)
>>   {
>>       struct arm_smccc_res res;
>> @@ -42,6 +56,305 @@ static int rmi_check_version(void)
>>       return 0;
>>   }
>>   +u32 kvm_realm_ipa_limit(void)
>> +{
>> +    return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
>> +}
>> +
>> +static int get_start_level(struct realm *realm)
>> +{
>> +    return 4 - ((realm->ia_bits - 8) / (RMM_PAGE_SHIFT - 3));
> 
> minor nit: It may be worth adding a comment here, how we got this magic
> number 8. We could say:
> 
> Open coded version of ARM64_HW_PGTABLE_LEVELS(ia_bits - 4) (accounting
> for the concatenation of upto 16 tables in entry level) for RMM Stage2.
> 

Agreed a comment makes sense. This is actually an open coded version of
"4 - stage2_pgtable_levels()" but using the RMM's page size (indeed
previous versions of the series did exactly that). There's already a
comment for stage2_pgtable_levels() explaining the concatenation.

>> +}
>> +
>> +static void free_delegated_granule(phys_addr_t phys)
>> +{
>> +    if (WARN_ON(rmi_granule_undelegate(phys))) {
>> +        /* Undelegate failed: leak the page */
>> +        return;
>> +    }
>> +
>> +    kvm_account_pgtable_pages(phys_to_virt(phys), -1);
>> +
>> +    free_page((unsigned long)phys_to_virt(phys));
>> +}
>> +
>> +/* Calculate the number of s2 root rtts needed */
>> +static int realm_num_root_rtts(struct realm *realm)
>> +{
>> +    unsigned int ipa_bits = realm->ia_bits;
>> +    unsigned int levels = 3 - get_start_level(realm);
> 
> nit: Why is this 3 - start_level and not 4 ? Though that is compensated
> by the "levels + 1" below, hence the calculation is correct.

I honestly can't remember - will change ;)

>> +    unsigned int sl_ipa_bits = (levels + 1) * (RMM_PAGE_SHIFT - 3) +
>> +                   RMM_PAGE_SHIFT;
>> +
>> +    if (sl_ipa_bits >= ipa_bits)
>> +        return 1;
>> +
>> +    return 1 << (ipa_bits - sl_ipa_bits);
>> +}
>> +
>> +static int realm_create_rd(struct kvm *kvm)
>> +{
>> +    struct realm *realm = &kvm->arch.realm;
>> +    struct realm_params *params = realm->params;
>> +    void *rd = NULL;
>> +    phys_addr_t rd_phys, params_phys;
>> +    size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
>> +    int i, r;
>> +    int rtt_num_start;
>> +
>> +    realm->ia_bits = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
>> +    rtt_num_start = realm_num_root_rtts(realm);
>> +
>> +    if (WARN_ON(realm->rd || !realm->params))
>> +        return -EEXIST;
>> +
>> +    if (pgd_size / RMM_PAGE_SIZE < rtt_num_start)
>> +        return -EINVAL;
>> +
>> +    rd = (void *)__get_free_page(GFP_KERNEL);
>> +    if (!rd)
>> +        return -ENOMEM;
>> +
>> +    rd_phys = virt_to_phys(rd);
>> +    if (rmi_granule_delegate(rd_phys)) {
>> +        r = -ENXIO;
>> +        goto free_rd;
>> +    }
>> +
>> +    for (i = 0; i < pgd_size; i += RMM_PAGE_SIZE) {
>> +        phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i;
>> +
>> +        if (rmi_granule_delegate(pgd_phys)) {
>> +            r = -ENXIO;
>> +            goto out_undelegate_tables;
>> +        }
>> +    }
>> +
>> +    params->s2sz = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
>> +    params->rtt_level_start = get_start_level(realm);
>> +    params->rtt_num_start = rtt_num_start;
>> +    params->rtt_base = kvm->arch.mmu.pgd_phys;
>> +    params->vmid = realm->vmid;
>> +
>> +    params_phys = virt_to_phys(params);
>> +
>> +    if (rmi_realm_create(rd_phys, params_phys)) {
>> +        r = -ENXIO;
>> +        goto out_undelegate_tables;
>> +    }
>> +
>> +    if (WARN_ON(rmi_rec_aux_count(rd_phys, &realm->num_aux))) {
>> +        WARN_ON(rmi_realm_destroy(rd_phys));
>> +        goto out_undelegate_tables;
>> +    }
>> +
>> +    realm->rd = rd;
>> +
>> +    return 0;
>> +
>> +out_undelegate_tables:
>> +    while (i > 0) {
>> +        i -= RMM_PAGE_SIZE;
>> +
>> +        phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i;
>> +
>> +        if (WARN_ON(rmi_granule_undelegate(pgd_phys))) {
>> +            /* Leak the pages if they cannot be returned */
>> +            kvm->arch.mmu.pgt = NULL;
>> +            break;
>> +        }
> 
> minor nit: Do we need to try undelegating the other pages ? We could
> make that WARN_ON_ONCE() too.

We could do, but there's no real point unless we also deal with leaking
only the pages which cannot be undelegated. Considering this is a
"should never happen" case I felt it was better to keep the code simple
and leak more than necessary. To me this feels like a "save everything
and reboot as soon as possible" sort of error condition - the kernel (or
RMM) has lost track of which pages are delegated and so a GPT fault
could be coming at any point. So it's better to play safe and be
pessimistic.

Thanks,
Steve