[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
|
From: |
Chao Peng |
|
Subject: |
Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE |
|
Date: |
Mon, 7 Mar 2022 21:26:02 +0800 |
|
User-agent: |
Mutt/1.9.4 (2018-02-28) |
On Fri, Mar 04, 2022 at 11:24:30AM -0800, Andy Lutomirski wrote:
> On 2/23/22 04:05, Steven Price wrote:
> > On 23/02/2022 11:49, Chao Peng wrote:
> > > On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote:
> > > > On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
> > > > > On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
> > > > > > On 1/18/22 05:21, Chao Peng wrote:
> > > > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > > > >
> > > > > > > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> > > > > > > the file is inaccessible from userspace through ordinary MMU
> > > > > > > access
> > > > > > > (e.g., read/write/mmap). However, the file content can be accessed
> > > > > > > via a different mechanism (e.g. KVM MMU) indirectly.
> > > > > > >
> > > > > > > It provides semantics required for KVM guest private memory
> > > > > > > support
> > > > > > > that a file descriptor with this seal set is going to be used as
> > > > > > > the
> > > > > > > source of guest memory in confidential computing environments such
> > > > > > > as Intel TDX/AMD SEV but may not be accessible from host
> > > > > > > userspace.
> > > > > > >
> > > > > > > At this time only shmem implements this seal.
> > > > > > >
> > > > > >
> > > > > > I don't dislike this *that* much, but I do dislike this.
> > > > > > F_SEAL_INACCESSIBLE
> > > > > > essentially transmutes a memfd into a different type of object.
> > > > > > While this
> > > > > > can apparently be done successfully and without races (as in this
> > > > > > code),
> > > > > > it's at least awkward. I think that either creating a special
> > > > > > inaccessible
> > > > > > memfd should be a single operation that create the correct type of
> > > > > > object or
> > > > > > there should be a clear justification for why it's a two-step
> > > > > > process.
> > > > >
> > > > > Now one justification maybe from Stever's comment to patch-00: for ARM
> > > > > usage it can be used with creating a normal memfd, (partially)populate
> > > > > it with initial guest memory content (e.g. firmware), and then
> > > > > F_SEAL_INACCESSIBLE it just before the first time lunch of the guest
> > > > > in
> > > > > KVM (definitely the current code needs to be changed to support that).
> > > >
> > > > Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right?
> > > > So this won't work.
> > >
> > > Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will
> > > need to make sure access to existing mmap-ed area should be prevented,
> > > but that is hard.
> > >
> > > >
> > > > In any case, the whole confidential VM initialization story is a bit
> > > > buddy. From the earlier emails, it sounds like ARM expects the host to
> > > > fill in guest memory and measure it. From my recollection of Intel's
> > > > scheme (which may well be wrong, and I could easily be confusing it
> > > > with SGX), TDX instead measures what is essentially a transcript of the
> > > > series of operations that initializes the VM. These are fundamentally
> > > > not the same thing even if they accomplish the same end goal. For TDX,
> > > > we unavoidably need an operation (ioctl or similar) that initializes
> > > > things according to the VM's instructions, and ARM ought to be able to
> > > > use roughly the same mechanism.
> > >
> > > Yes, TDX requires a ioctl. Steven may comment on the ARM part.
> >
> > The Arm story is evolving so I can't give a definite answer yet. Our
> > current prototyping works by creating the initial VM content in a
> > memslot as with a normal VM and then calling an ioctl which throws the
> > big switch and converts all the (populated) pages to be protected. At
> > this point the RMM performs a measurement of the data that the VM is
> > being populated with.
> >
> > The above (in our prototype) suffers from all the expected problems with
> > a malicious VMM being able to trick the host kernel into accessing those
> > pages after they have been protected (causing a fault detected by the
> > hardware).
> >
> > The ideal (from our perspective) approach would be to follow the same
> > flow but where the VMM populates a memfd rather than normal anonymous
> > pages. The memfd could then be sealed and the pages converted to
> > protected ones (with the RMM measuring them in the process).
> >
> > The question becomes how is that memfd populated? It would be nice if
> > that could be done using normal operations on a memfd (i.e. using
> > mmap()) and therefore this code could be (relatively) portable. This
> > would mean that any pages mapped from the memfd would either need to
> > block the sealing or be revoked at the time of sealing.
> >
> > The other approach is we could of course implement a special ioctl which
> > effectively does a memcpy into the (created empty and sealed) memfd and
> > does the necessary dance with the RMM to measure the contents. This
> > would match the "transcript of the series of operations" described above
> > - but seems much less ideal from the viewpoint of the VMM.
>
> A VMM that supports Other Vendors will need to understand this sort of model
> regardless.
>
> I don't particularly mind the idea of having the kernel consume a normal
> memfd and spit out a new object, but I find the concept of changing the type
> of the object in place, even if it has other references, and trying to
> control all the resulting races to be somewhat alarming.
>
> In pseudo-Rust, this is the difference between:
>
> fn convert_to_private(in: &mut Memfd)
>
> and
>
> fn convert_to_private(in: Memfd) -> PrivateMemoryFd
>
> This doesn't map particularly nicely to the kernel, though.
I understand this Rust semantics and the difficulty to handle races.
Probably we should not expose F_SEAL_INACCESSIBLE to userspace, instead
we can use a new in-kernel flag to indicate the same thing. That flag
should be set only when the memfd is created with MFD_INACCESSIBLE.
Chao
>
> --Andy\