qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qid path collision issues in 9pfs


From: Greg Kurz
Subject: Re: [Qemu-devel] [RFC] qid path collision issues in 9pfs
Date: Wed, 24 Jan 2018 16:09:47 +0100

On Mon, 22 Jan 2018 13:40:00 +0100
Eduard Shishkin <address@hidden> wrote:

> On 1/19/2018 5:37 PM, Veaceslav Falico wrote:
> > On 1/19/2018 4:52 PM, Eduard Shishkin wrote:  
> >>
> >>
> >> On 1/19/2018 11:27 AM, Greg Kurz wrote:  
> >>> On Mon, 15 Jan 2018 11:49:31 +0800
> >>> Antonios Motakis <address@hidden> wrote:
> >>>  
> >>>> On 13-Jan-18 00:14, Greg Kurz wrote:  
> >>>>> On Fri, 12 Jan 2018 19:32:10 +0800
> >>>>> Antonios Motakis <address@hidden> wrote:
> >>>>>  
> >>>>>> Hello all,
> >>>>>>  
> >>>>>
> >>>>> Hi Antonios,
> >>>>>
> >>>>> I see you have attached a patch to this email... this really isn't the 
> >>>>> preferred
> >>>>> way to do things since it prevents to comment the patch (at least with 
> >>>>> my mail
> >>>>> client). The appropriate way would have been to send the patch with a 
> >>>>> cover
> >>>>> letter, using git-send-email for example.  
> >>>>
> >>>> I apologize for attaching the patch, I should have known better!
> >>>>  
> >>>
> >>> np :)
> >>>  
> >>>>>  
> >>>>>> We have found an issue in the 9p implementation of QEMU, with how qid 
> >>>>>> paths are generated, which can cause qid path collisions and several 
> >>>>>> issues caused by them. In our use case (running containers under VMs) 
> >>>>>> these have proven to be critical.
> >>>>>>  
> >>>>>
> >>>>> Ouch...
> >>>>>  
> >>>>>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using 
> >>>>>> the inode number of the file as input. According to the 9p spec the 
> >>>>>> path should be able to uniquely identify a file, distinct files should 
> >>>>>> not share a path value.
> >>>>>>
> >>>>>> The current implementation that defines qid.path = inode nr works fine 
> >>>>>> as long as there are not files from multiple partitions visible under 
> >>>>>> the 9p share. In that case, distinct files from different devices are 
> >>>>>> allowed to have the same inode number. So with multiple partitions, we 
> >>>>>> have a very high probability of qid path collisions.
> >>>>>>
> >>>>>> How to demonstrate the issue:
> >>>>>> 1) Prepare a problematic share:
> >>>>>>  - mount one partition under share/p1/ with some files inside
> >>>>>>  - mount another one *with identical contents* under share/p2/
> >>>>>>  - confirm that both partitions have files with same inode nr, size, 
> >>>>>> etc
> >>>>>> 2) Demonstrate breakage:
> >>>>>>  - start a VM with a virtio-9p pointing to the share
> >>>>>>  - mount 9p share with FSCACHE on
> >>>>>>  - keep open share/p1/file
> >>>>>>  - open and write to share/p2/file
> >>>>>>
> >>>>>> What should happen is, the guest will consider share/p1/file and 
> >>>>>> share/p2/file to be the same file, and since we are using the cache it 
> >>>>>> will not reopen it. We intended to write to partition 2, but we just 
> >>>>>> wrote to partition 1. This is just one example on how the guest may 
> >>>>>> rely on qid paths being unique.
> >>>>>>
> >>>>>> In the use case of containers where we commonly have a few containers 
> >>>>>> per VM, all based on similar images, these kind of qid path collisions 
> >>>>>> are very common and they seem to cause all kinds of funny behavior 
> >>>>>> (sometimes very subtle).
> >>>>>>
> >>>>>> To avoid this situation, the device id of a file needs to be also 
> >>>>>> taken as input for generating a qid path. Unfortunately, the size of 
> >>>>>> both inode nr + device id together would be 96 bits, while we have 
> >>>>>> only 64 bits for the qid path, so we can't just append them and call 
> >>>>>> it a day :(
> >>>>>>
> >>>>>> We have thought of a few approaches, but we would definitely like to 
> >>>>>> hear what the upstream maintainers and community think:
> >>>>>>
> >>>>>> * Full fix: Change the 9p protocol
> >>>>>>
> >>>>>> We would need to support a longer qid path, based on a virtio feature 
> >>>>>> flag. This would take reworking of host and guest parts of virtio-9p, 
> >>>>>> so both QEMU and Linux for most users.
> >>>>>>  
> >>>>>
> >>>>> I agree for a longer qid path, but we shouldn't tie it to a virtio flag 
> >>>>> since
> >>>>> 9p is transport agnostic. And it happens to be used with a variety of 
> >>>>> transports.
> >>>>> QEMU has both virtio-9p and a Xen backend for example.
> >>>>>  
> >>>>>> * Fallback and/or interim solutions
> >>>>>>
> >>>>>> A virtio feature flag may be refused by the guest, so we think we 
> >>>>>> still need to make collisions less likely even with 64 bit paths. E.g. 
> >>>>>>  
> >>>>>
> >>>>> In all cases, we would need a fallback solution to support current
> >>>>> guest setups. Also there are several 9p server implementations out
> >>>>> there (ganesha, diod, kvmtool) that are currently used with linux
> >>>>> clients... it will take some time to get everyone in sync :-\
> >>>>>  
> >>>>>> 1. XOR the device id with inode nr to produce the qid path (we attach 
> >>>>>> a proof of concept patch)  
> >>>>>
> >>>>> Hmm... this would still allow collisions. Not good for fallback.
> >>>>>  
> >>>>>> 2. 64 bit hash of device id and inode nr  
> >>>>>
> >>>>> Same here.
> >>>>>  
> >>>>>> 3. other ideas, such as allocating new qid paths and keep a look up 
> >>>>>> table of some sorts (possibly too expensive)
> >>>>>>  
> >>>>>
> >>>>> This would be acceptable for fallback.  
> >>>>
> >>>> Maybe we can use the QEMU hash table 
> >>>> (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it 
> >>>> scales to millions of entries. Do you think it is appropriate in this 
> >>>> case?
> >>>>  
> >>>
> >>> I don't know QHT, hence Cc'ing Emilio for insights.
> >>>  
> >>>> I was thinking on how to implement something like this, without having 
> >>>> to maintain millions of entries... One option we could consider is to 
> >>>> split the bits into a directly-mapped part, and a lookup part. For 
> >>>> example:
> >>>>
> >>>> Inode =
> >>>> [ 10 first bits ] + [ 54 lowest bits ]
> >>>>
> >>>> A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit 
> >>>> prefix ]
> >>>> The prefix is uniquely allocated for each input.
> >>>>
> >>>> Qid path =
> >>>> [ 10 bit prefix ] + [ inode 54 lowest bits ]
> >>>>
> >>>> Since inodes are not completely random, and we usually have a handful of 
> >>>> device IDs, we get a much smaller number of entries to track in the hash 
> >>>> table.
> >>>>
> >>>> So what this would give:
> >>>> (1)    Would be faster and take less memory than mapping the full 
> >>>> inode_nr,devi_id tuple to unique QID paths
> >>>> (2)    Guaranteed not to run out of bits when inode numbers stay below 
> >>>> the lowest 54 bits and we have less than 1024 devices.
> >>>> (3)    When we get beyond this this limit, there is a chance we run out 
> >>>> of bits to allocate new QID paths, but we can detect this and refuse to 
> >>>> serve the offending files instead of allowing a collision.
> >>>>
> >>>> We could tweak the prefix size to match the scenarios that we consider 
> >>>> more likely, but I think close to 10-16 bits sounds reasonable enough. 
> >>>> What do you think?
> >>>>  
> >>>
> >>> I think that should work. Please send a patchset :)  
> >>
> >> Hi Greg,
> >>
> >> This is based on the assumption that high bits of inode numbers are
> >> always zero, which is unacceptable from my standpoint. Inode numbers
> >> allocator is a per-volume file system feature, so there may appear
> >> allocators, which don't use simple increment and assign inode number
> >> with non-zero high bits even for the first file of a volume.
> >>
> >> As I understand, the problem is that the guest doesn't have enough
> >> information for proper files identification (in our case share/p1/file
> >> and share/p2/file are wrongly identified as the same file). It seems
> >> that we need to transmit device ID to the guest, and not in the high bits 
> >> of the inode number.
> >>
> >> AFAIK Slava has patches for such transmission, I hope that he will send
> >> it eventually.  
> >
> > They're pretty trivial, however it'll require great effort and coordination
> > for all parties involved, as they break backwards compatibility (QID becomes
> > bigger and, thus, breaks "Q" transmission).
> >
> > I'd like to raise another issue - it seems that st_dev and i_no aren't
> > enough:
> >
> > mkdir -p /tmp/mounted
> > touch /tmp/mounted/file
> > mount -o bind /tmp/mounted t1
> > mount -o bind /tmp/mounted t2
> > stat t{1,2}/file  | grep Inode
> > Device: 805h/2053d      Inode: 42729487    Links: 3
> > Device: 805h/2053d      Inode: 42729487    Links: 3
> >
> > In that case, even with the patch applied, we'll still have the issue of
> > colliding QIDs guest-side - so, for example, after umount t1, t2/file
> > becomes unaccessible, as the inode points to t1...
> >
> > I'm not sure how to fix this one. Ideas?  
> 
> t1/file and t2/file have different mount IDs,
> which can be retrieved by name_to_handle_at(2).
> So the idea is to transmit also that mount ID along with st_dev.
> 

As for transmitting st_dev, this calls for a spec change and might take
some time... :-\

And anyway, we really want to have a fallback in QEMU. Maybe you can try
something like Emilio suggested in another mail ?

The first step would be to allow fsdev backends to return the mount id
along with the struct stat:
- add a uint32_t *mnt_id arg to the .fstat and .stat ops in file-op-9p.h
- adapt the 9p-local backend to fill out *mnt_id with name_to_handle_at()

If this works, then same thing should be done in 9p-proxy/virtfs-proxy-helper.
Other backends can be left behind.

Cheers,

--
Greg

> Thanks,
> Eduard.
> 
> >  
> >>
> >> Thanks,
> >> Eduard.
> >>  
> >>>  
> >>>>>  
> >>>>>> With our proof of concept patch, the issues caused by qid path 
> >>>>>> collisions go away, so it can be seen as an interim solution of sorts. 
> >>>>>> However, the chance of collisions is not eliminated, we are just 
> >>>>>> replacing the current strategy, which is almost guaranteed to cause 
> >>>>>> collisions in certain use cases, with one that makes them more rare. 
> >>>>>> We think that a virtio feature flag for longer qid paths is the only 
> >>>>>> way to eliminate these issues completely.
> >>>>>>
> >>>>>> This is the extent that we were able to analyze the issue from our 
> >>>>>> side, we would like to hear if there are some better ideas, or which 
> >>>>>> approach would be more appropriate for upstream.
> >>>>>>
> >>>>>> Best regards,
> >>>>>>  
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> --
> >>>>> Greg
> >>>>>  
> >>>>  
> >>>  
> >>
> >> .
> >>  
> >
> >  
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]