qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v4 5/5] 9p: Use variable length suffixes for ino


From: Christian Schoenebeck
Subject: Re: [Qemu-devel] [PATCH v4 5/5] 9p: Use variable length suffixes for inode remapping
Date: Fri, 28 Jun 2019 16:56:15 +0200

On Freitag, 28. Juni 2019 13:50:15 CEST Greg Kurz wrote:
> > And with k=5 they would look like:
> > 
> > Index Dec/Bin -> Generated Suffix Bin
> > 1 [1] -> [000001] (6 bits)
> > 2 [10] -> [100001] (6 bits)
> > 3 [11] -> [010001] (6 bits)
> > 4 [100] -> [110001] (6 bits)
> > 5 [101] -> [001001] (6 bits)
> > 6 [110] -> [101001] (6 bits)
> > 7 [111] -> [011001] (6 bits)
> > 8 [1000] -> [111001] (6 bits)
> > 9 [1001] -> [000101] (6 bits)
> > 10 [1010] -> [100101] (6 bits)
> > 11 [1011] -> [010101] (6 bits)
> > 12 [1100] -> [110101] (6 bits)
> > ...
> > 65533 [1111111111111101] -> [0011100000000000100000000000] (28 bits)
> > 65534 [1111111111111110] -> [1011100000000000100000000000] (28 bits)
> > 65535 [1111111111111111] -> [0111100000000000100000000000] (28 bits)
> > Hence minBits=6 maxBits=28
> 
> IIUC, this k control parameter should be constant for the
> lifetime of QIDs. So it all boils down to choose a _good_
> value that would cover most scenarios, right ?

That's correct. In the end it's just a matter of how many devices do you 
expect to be exposed with the same 9p export for choosing an appropriate value 
for k. That parameter k must be constant at least until guest is rebooted, 
otherwise you would end up with completely different inode numbers if you 
would change that parameter k while guest is still running.

Like I mentioned before, I can send a short C file if you want to play around 
with that parameter k to see how the generated suffixes would look like. The 
console output is actually like shown above. Additionally the C demo also 
checks and proofs that all generated suffixes really generate unique numbers 
for 
the entire possible 64 bit range, so that should take away the scare about 
what this algorithm does.

Since you said before that you find it already exotic to have more than 1 
device being exported by 9p, I hence did not even bother to make that 
parameter "k" a runtime parameter. I *think* k=0 should be fine in practice.

> For now, I just have some _cosmetic_ remarks.
> 
> > Signed-off-by: Christian Schoenebeck <address@hidden>
> > ---
> > 
> >  hw/9pfs/9p.c | 267
> >  ++++++++++++++++++++++++++++++++++++++++++++++++++++++----- hw/9pfs/9p.h
> >  |  67 ++++++++++++++-
> >  2 files changed, 312 insertions(+), 22 deletions(-)
> > 
> > diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
> > index e6e410972f..46c9f11384 100644
> > --- a/hw/9pfs/9p.c
> > +++ b/hw/9pfs/9p.c
> > @@ -26,6 +26,7 @@
> > 
> >  #include "migration/blocker.h"
> >  #include "sysemu/qtest.h"
> >  #include "qemu/xxhash.h"
> > 
> > +#include <math.h>
> > 
> >  int open_fd_hw;
> >  int total_open_fd;
> > 
> > @@ -572,6 +573,123 @@ static void coroutine_fn virtfs_reset(V9fsPDU *pdu)
> > 
> >                                  P9_STAT_MODE_NAMED_PIPE |   \
> >                                  P9_STAT_MODE_SOCKET)
> > 
> > +#if P9_VARI_LENGTH_INODE_SUFFIXES
> 
> The numerous locations guarded by P9_VARI_LENGTH_INODE_SUFFIXES
> really obfuscate the code, and don't ease review (for me at least).
> And anyway, if we go for variable length suffixes, we won't keep
> the fixed length prefix code.

I just did that to provide a direct comparison between the fixed size prefix 
vs. 
variable size suffix code, since the fixed size prefix code is easier to 
understand.

If you want I can add a 6th "cleanup" patch that would remove the fixed size 
prefix code and all the #ifs ?

> > +
> > +/* Mirrors all bits of a byte. So e.g. binary 10100000 would become
> > 00000101. */ +static inline uint8_t mirror8bit(uint8_t byte) {
> 
> From CODING_STYLE:
> 
> 4. Block structure
> 
> [...]
> 
> for reasons of tradition and clarity it comes on a line by itself:
> 
>     void a_function(void)
>     {
>         do_something();
>     }

Got it.

> > +/* Parameter k for the Exponential Golomb algorihm to be used.
> > + *
> > + * The smaller this value, the smaller the minimum bit count for the Exp.
> > + * Golomb generated affixes will be (at lowest index) however for the
> > + * price of having higher maximum bit count of generated affixes (at
> > highest + * index). Likewise increasing this parameter yields in smaller
> > maximum bit + * count for the price of having higher minimum bit count.
> 
> Forgive my laziness but what are the benefits of a smaller or larger
> value, in term of user experience ?

Well, the goal in general is too keep the generated suffix numbers (or their 
bit 
count actually) as small as possible, because you are cutting away the same 
amount of bits of the orginal inode number from host. So the smaller the 
generated suffix number is, the more bits you can simply directly use from the 
original inode number from host, and hence the cheaper this remap code becomes 
in practice. Because if you have e.g. a suffix number of just 3 bits, then in 
practice you will very likely only have exactly 1 entry in that hash table 
*ever*. So hash lookup will be very cheap, etc.

If you had a suffix number of 32 bit instead that would mean you would need to 
cut away 32 bits from the original inode numbers, and hence you would likely 
end up with multiple entries in the hash table and the remap code becomes more 
expensive due to the hash table lookups, and even worse, you might even end up 
getting into that "full map" code which generates a hash entry for every next 
inode.

In practice this issue becomes probably more relevant when nested 
virtualization becomes more popular OR if you are using a large number of 
devices on the same export.

> > + */
> > +#define EXP_GOLOMB_K    0
> > +
> > +# if !EXP_GOLOMB_K
> > +
> > +/** @brief Exponential Golomb algorithm limited to the case k=0.
> > + *
> 
> This doesn't really help to have a special implementation for k=0. The
> resulting function is nearly identical to the general case. It is likely
> that the compiler can optimize it and generate the same code.

I don't think the compiler's optimizer would really drop all the instructions 
automatically of the generalized variant of that particular function. Does it 
matter in practice? No, I actually just provided that optimized variant for 
the special case k=0 because I think k=0 will be the default value for that 
parameter and because you were already picky about a simple

        if (pdu->s->dev_id == 0)

to be optimized. In practice users won't feel the runtime difference either 
one of the two optimization scenarios.

> > +/** @brief Exponential Golomb algorithm for arbitrary k (including k=0).
> > + *
> > + * The Exponential Golomb algorithm generates @b prefixes (@b not
> > suffixes!) + * with growing length and with the mathematical property of
> > being + * "prefix-free". The latter means the generated prefixes can be
> > prepended + * in front of arbitrary numbers and the resulting
> > concatenated numbers are + * guaranteed to be always unique.
> > + *
> > + * This is a minor adjustment to the original Exp. Golomb algorithm in
> > the
> > + * sense that lowest allowed index (@param n) starts with 1, not with
> > zero. + *
> > + * @param n - natural number (or index) of the prefix to be generated
> > + *            (1, 2, 3, ...)
> > + * @param k - parameter k of Exp. Golomb algorithm to be used
> > + *            (see comment on EXP_GOLOMB_K macro for details about k)
> > + */
> > +static VariLenAffix expGolombEncode(uint64_t n, int k) {
> 
> Function.

ack_once();

> > +    const uint64_t value = n + (1 << k) - 1;
> > +    const int bits = (int) log2(value) + 1;
> > +    return (VariLenAffix) {
> > +        .type = AffixType_Prefix,
> > +        .value = value,
> > +        .bits = bits + MAX((bits - 1 - k), 0)
> > +    };
> > +}
> > +
> > +# endif /* !EXP_GOLOMB_K */
> > +
> > +/** @brief Converts a suffix into a prefix, or a prefix into a suffix.
> > + *
> > + * What this function does is actually mirroring all bits of the affix
> > value,
> Drop the "What this function does..." wording and use an imperative style
> instead, ie. "Mirror all bits of..."

Ok.

> >  static int qid_path_fullmap(V9fsPDU *pdu, const struct stat *stbuf,
> >  
> >                              uint64_t *path)
> >  
> >  {
> > 
> > @@ -615,11 +795,9 @@ static int qid_path_fullmap(V9fsPDU *pdu, const
> > struct stat *stbuf,> 
> >          .ino = stbuf->st_ino
> >      
> >      }, *val;
> >      uint32_t hash = qpf_hash(lookup);
> > 
> > -
> > -    /* most users won't need the fullmap, so init the table lazily */
> > -    if (!pdu->s->qpf_table.map) {
> > -        qht_init(&pdu->s->qpf_table, qpf_lookup_func, 1 << 16,
> > QHT_MODE_AUTO_RESIZE); -    }
> > +#if P9_VARI_LENGTH_INODE_SUFFIXES
> > +    VariLenAffix affix;
> > +#endif
> 
> Move this declaration to the beginning of the block.

Ok.

> > -/* stat_to_qid needs to map inode number (64 bits) and device id (32
> > bits)
> > +/** @brief Quick mapping host inode nr -> guest inode nr.
> > + *
> > + * This function performs quick remapping of an original file inode
> > number
> > + * on host to an appropriate different inode number on guest. This
> > remapping + * of inodes is required to avoid inode nr collisions on guest
> > which would + * happen if the 9p export contains more than 1 exported
> > file system (or + * more than 1 file system data set), because unlike on
> > host level where the + * files would have different device nrs, all files
> > exported by 9p would + * share the same device nr on guest (the device nr
> > of the virtual 9p device + * that is).
> > + *
> > + * if P9_VARI_LENGTH_INODE_SUFFIXES == 0 :
> > + * stat_to_qid needs to map inode number (64 bits) and device id (32
> > bits)
> > 
> >   * to a unique QID path (64 bits). To avoid having to map and keep track
> >   * of up to 2^64 objects, we map only the 16 highest bits of the inode
> >   plus
> >   * the device id to the 16 highest bits of the QID path. The 48 lowest
> >   bits
> >   * of the QID path equal to the lowest bits of the inode number.
> >   *
> > 
> > - * This takes advantage of the fact that inode number are usually not
> > - * random but allocated sequentially, so we have fewer items to keep
> > - * track of.
> 
> Hmm... why dropping this comment ?

Because I found the entire function comment block became already too verbose 
and that particular sentence to be not relevant. At least IMO.

Best regards,
Christian Schoenebeck



reply via email to

[Prev in Thread] Current Thread [Next in Thread]