Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalab

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalab

From:	Emilio G. Cota
Subject:	Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table
Date:	Fri, 20 May 2016 22:48:11 -0400
User-agent:	Mutt/1.5.23 (2014-03-12)

Hi Sergey,

Any 'Ack' below means the change has made it to my tree.

On Sat, May 21, 2016 at 01:13:20 +0300, Sergey Fedorov wrote:
> > +#include "qemu/osdep.h"
> > +#include "qemu-common.h"
> 
> There's no need in qemu-common.h

Ack

> > +#include "qemu/seqlock.h"
> > +#include "qemu/qdist.h"
> > +#include "qemu/rcu.h"
> 
> qemu/rcu.h is really required in qht.c, not here.

Ack

> > +struct qht {
> > +    struct qht_map *map;
> > +    unsigned int mode;
> > +};
> > +
> > +struct qht_stats {
> > +    size_t head_buckets;
> > +    size_t used_head_buckets;
> > +    size_t entries;
> > +    struct qdist chain;
> > +    struct qdist occupancy;
> > +};
> > +
> > +typedef bool (*qht_lookup_func_t)(const void *obj, const void *userp);
> > +typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void 
> > *up);
> 
> We could move ght_stats, qht_lookup_func_t and qht_iter_func_t closer to
> the relevant functions declarations. Anyway, I think it's also fine to
> keep them here. :)

I like these at the top of files, so that functions can be easily moved around.

> Although the API is mostly intuitive some kernel-doc-style comments
> wouldn’t hurt, I think. ;-)

The nit that bothered me is the "external lock needed" bit, but it's
removed by the subsequent patch (which once it gets reviewed should be merged
onto this patch); I think the interface is simple enough that comments
would just add noise and maintenance burden. Plus, there are tests under
tests/.

(snip)
> > +/* if @func is NULL, then pointer comparison is used */
> > +void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
> > +                 uint32_t hash);

BTW the comment above this signature isn't true anymore; I've removed it.
[ This is an example of why I'd rather have a simple interface,
  than a complex one that needs documentation :P ]

> (snip)
> > diff --git a/util/qht.c b/util/qht.c
(snip)
> > + * The key structure is the bucket, which is cacheline-sized. Buckets
> > + * contain a few hash values and pointers; the u32 hash values are stored 
> > in
> > + * full so that resizing is fast. Having this structure instead of directly
> > + * chaining items has three advantages:
> 
> s/three/two/?

Ack

(snip)
> > +/* define these to keep sizeof(qht_bucket) within QHT_BUCKET_ALIGN */
> > +#if HOST_LONG_BITS == 32
> > +#define QHT_BUCKET_ENTRIES 6
> > +#else /* 64-bit */
> > +#define QHT_BUCKET_ENTRIES 4
> > +#endif
> > +
> > +struct qht_bucket {
> > +    QemuSpin lock;
> > +    QemuSeqLock sequence;
> > +    uint32_t hashes[QHT_BUCKET_ENTRIES];
> > +    void *pointers[QHT_BUCKET_ENTRIES];
> > +    struct qht_bucket *next;
> > +} QEMU_ALIGNED(QHT_BUCKET_ALIGN);
> > +
> > +QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);
> 
> Have you considered using separate structures for head buckets and
> non-head buckets, e.g. "struct qht_head_bucket" and "struct
> qht_added_bucket"? This would give us a little more entries per cache-line.

I considered it. Note however that the gain would only apply to
32-bit hosts, since on 64-bit we'd only save 8 bytes but we'd
need 12 to store hash+pointer. (lock+sequence=8, hashes=4*4=16,
pointers=4*8=32, next=8, that is 8+16+32+8=32+32=64).

On 32-bits with 6 entries we have 4 bytes of waste; we could squeeze in
an extra entry. I'm reluctant to do this because (1) it would complicate
code and (2) I don't think we should care too much about performance on
32-bit hosts.

> > +/**
> > + * struct qht_map - structure to track an array of buckets
> > + * @rcu: used by RCU. Keep it as the top field in the struct to help 
> > valgrind
> > + *       find the whole struct.
> > + * @buckets: array of head buckets. It is constant once the map is created.
> > + * @n: number of head buckets. It is constant once the map is created.
> > + * @n_added_buckets: number of added (i.e. "non-head") buckets
> > + * @n_added_buckets_threshold: threshold to trigger an upward resize once 
> > the
> > + *                             number of added buckets surpasses it.
> > + *
> > + * Buckets are tracked in what we call a "map", i.e. this structure.
> > + */
> > +struct qht_map {
> > +    struct rcu_head rcu;
> > +    struct qht_bucket *buckets;
> > +    size_t n;
> 
> s/n/n_buckets/? (Actually, 'n_buckets' is already mentioned in a comment
> below.)

Ack. I also applied this change to other functions, e.g. qht_map_create and
qht_resize.

> > +    size_t n_added_buckets;
> > +    size_t n_added_buckets_threshold;
> > +};
> > +
> > +/* trigger a resize when n_added_buckets > n_buckets / div */
> > +#define QHT_NR_ADDED_BUCKETS_THRESHOLD_DIV 8
> > +
> > +static void qht_do_resize(struct qht *ht, size_t n);
> > +
> > +static inline struct qht_map *qht_map__atomic_mb(const struct qht *ht)
> > +{
> > +    struct qht_map *map;
> > +
> > +    map = atomic_read(&ht->map);
> > +    /* paired with smp_wmb() before setting ht->map */
> > +    smp_rmb();
> > +    return map;
> > +}
> 
> Why don't just use atomic_rcu_read/set()? Looks like they were meant for
> that exact purpose.

Ack. I also changed it to the bucket->next functions.

In fact the right barrier to emit is smp_read_barrier_depends; I just
noticed that we emit a quite stronger CONSUME for recent gcc's under
atomic_rcu_read. I'll address this on a separate patch if nobody
beats me to it.

(snip)
> > +static inline
> > +void *qht_do_lookup(struct qht_bucket *head, qht_lookup_func_t func,
> > +                    const void *userp, uint32_t hash)
> > +{
> > +    struct qht_bucket *b = head;
> > +    int i;
> > +
> > +    do {
> > +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> > +            if (atomic_read(&b->hashes[i]) == hash) {
> > +                void *p = atomic_read(&b->pointers[i]);
> 
> Why do we need this atomic_read() and other (looking a bit inconsistent)
> atomic operations on 'b->pointers' and 'b->hash'? if we always have to
> access them protected properly by a seqlock together with a spinlock?

[ There should be consistency: read accesses use the atomic ops to read,
  while write accesses have acquired the bucket lock so don't need them.
  Well, they need care when they write, since there may be concurrent
  readers. ]

I'm using atomic_read but what I really want is ACCESS_ONCE. That is:
(1) Make sure that the accesses are done in a single instruction (even
    though gcc doesn't explicitly guarantee it even to aligned addresses
    anymore[1])
(2) Make sure the pointer value is only read once, and never refetched.
    This is what comes right after the pointer is read:
> +                if (likely(p) && likely(func(p, userp))) {
> +                    return p;
> +                }
    Refetching the pointer value might result in us passing something
    a NULL p value to the comparison function (since there may be
    concurrent updaters!), with an immediate segfault. See [2] for a
    discussion on this (essentially the compiler assumes that there's
    only a single thread).

Given that even reading a garbled hash is OK (we don't really need (1),
since the seqlock will make us retry anyway), I've changed the code to:

         for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
-            if (atomic_read(&b->hashes[i]) == hash) {
+            if (b->hashes[i] == hash) {
+                /* make sure the pointer is read only once */
                 void *p = atomic_read(&b->pointers[i]);

                 if (likely(p) && likely(func(p, userp))) {

Performance-wise this is the impact after 10 tries for:
        $ taskset -c 0 tests/qht-bench \
          -d 5 -n 1 -u 0 -k 4096 -K 4096 -l 4096 -r 4096 -s 4096
on my Haswell machine I get, in Mops/s:
        atomic_read() for all           40.389 +- 0.20888327415622
        atomic_read(p) only             40.759 +- 0.212835356294224
        no atomic_read(p) (unsafe)      40.559 +- 0.121422128680622

Note that the unsafe version is slightly slower; I guess the CPU is trying
to speculate too much and is gaining little from it.

[1] "Linux-Kernel Memory Model" by Paul McKenney
    http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4374.html
[2] https://lwn.net/Articles/508991/

(snip)
> > +static __attribute__((noinline))
> > +void *qht_lookup__slowpath(struct qht_bucket *b, qht_lookup_func_t func,
> > +                           const void *userp, uint32_t hash)
> > +{
> > +    uint32_t version;
> > +    void *ret;
> > +
> > +    do {
> > +        version = seqlock_read_begin(&b->sequence);
> 
> seqlock_read_begin() returns "unsigned".

Ack. Changed all callers to unsigned int.

(snip)
> > +/* call with an external lock held */
> > +bool qht_insert(struct qht *ht, void *p, uint32_t hash)
> > +{
> > +    struct qht_map *map = ht->map;
> > +    struct qht_bucket *b = qht_map_to_bucket(map, hash);
> > +    bool needs_resize = false;
> > +    bool ret;
> > +
> > +    /* NULL pointers are not supported */
> > +    assert(p);
> 
> Maybe wrap such assertions in a macro, something like tcg_debug_assert()?

Ack. Defined qht_debug_assert, which only generates the assert if
QHT_DEBUG is defined.

(snip)
> > +/*
> > + * Find the last valid entry in @head, and swap it with @orig[pos], which 
> > has
> > + * just been invalidated.
> > + */
> > +static inline void qht_bucket_fill_hole(struct qht_bucket *orig, int pos)
> > +{
> > +    struct qht_bucket *b = orig;
> > +    struct qht_bucket *prev = NULL;
> > +    int i;
> > +
> > +    if (qht_entry_is_last(orig, pos)) {
> > +        return;
> > +    }
> > +    do {
> > +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> 
> We could iterate in the opposite direction: from the last entry in a
> qht_bucket to the first. It would allow us to fast-forward to the next
> qht_bucket in a chain in case of non-NULL last entry and speed-up the
> search.

But it would slow us down if--say--only the first entry is set. Also
it would complicate the code a bit.

Note that with the resizing threshold that we have, we're guaranteed to
have only up to 1/8 of the head buckets full. We should therefore optimize
for the case where the head bucket isn't full.

> > +            if (q == p) {
> > +                assert(b->hashes[i] == hash);
> > +                seqlock_write_begin(&head->sequence);
> > +                atomic_set(&b->hashes[i], 0);
> > +                atomic_set(&b->pointers[i], NULL);
> 
> Could we better avoid zeroing these here? (At least when debugging is
> disabled.)

Ack. We get a small speedup. Before and after, for the same test as
above with update rate 100%:
  33.779  0.19518936674135
  33.937  0.132249763704892

Changed the code to:

@@ -557,7 +557,7 @@ static void
 qht_entry_move(struct qht_bucket *to, int i, struct qht_bucket *from, int j)
 {
     qht_debug_assert(!(to == from && i == j));
-    qht_debug_assert(to->pointers[i] == NULL);
+    qht_debug_assert(to->pointers[i]);
     qht_debug_assert(from->pointers[j]);

     atomic_set(&to->hashes[i], from->hashes[j]);
@@ -578,11 +578,13 @@ static inline void qht_bucket_fill_hole(struct qht_bucket 
*orig, int pos)
     int i;

     if (qht_entry_is_last(orig, pos)) {
+        atomic_set(&orig->hashes[pos], 0);
+        atomic_set(&orig->pointers[pos], NULL);
         return;
     }
     do {
         for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
-            if (b->pointers[i] || (b == orig && i == pos)) {
+            if (b->pointers[i]) {
                 continue;
             }
             if (i > 0) {
@@ -616,8 +618,6 @@ bool qht_remove__locked(struct qht_map *map, struct 
qht_bucket *head,
             if (q == p) {
                 qht_debug_assert(b->hashes[i] == hash);
                 seqlock_write_begin(&head->sequence);
-                atomic_set(&b->hashes[i], 0);
-                atomic_set(&b->pointers[i], NULL);
                 qht_bucket_fill_hole(b, i);
                 seqlock_write_end(&head->sequence);
                 return true;
@@ -634,6 +634,9 @@ bool qht_remove(struct qht *ht, const void *p, uint32_t 
hash)
     struct qht_map *map;
     bool ret;

+    /* NULL pointers are not supported */
+    qht_debug_assert(p);
+

(snip)
> > +/* call with an external lock held */
> > +static void qht_do_resize(struct qht *ht, size_t n)
> > +{
> > +    struct qht_map *old = ht->map;
> > +    struct qht_map *new;
> > +
> > +    g_assert_cmpuint(n, !=, old->n);
> 
> Maybe use g_assert*() consistently? (I mead using either assert() or
> g_assert*() variants.)

This is the only g_assert I'm using. This is a slow path and I quite like
the fact that the numbers would be printed; I left is as is.

All other asserts are of the type assert(bool), so there's little
point in using g_assert for them. BTW all other asserts are now
under qht_debug_assert.

Thanks for taking a look! If you have time, please check patch 13 out.
That patch should eventually be merged onto this one.

                Emilio

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH v5 16/18] qht: add test-qht-par to invoke qht-bench from 'check' target, (continued)
- [Qemu-devel] [PATCH v5 16/18] qht: add test-qht-par to invoke qht-bench from 'check' target, Emilio G. Cota, 2016/05/13
- [Qemu-devel] [PATCH v5 17/18] tb hash: track translated blocks with qht, Emilio G. Cota, 2016/05/13
- [Qemu-devel] [PATCH v5 15/18] qht: add qht-bench, a performance benchmark, Emilio G. Cota, 2016/05/13
- [Qemu-devel] [PATCH v5 14/18] qht: add test program, Emilio G. Cota, 2016/05/13
- [Qemu-devel] [PATCH v5 09/18] tb hash: hash phys_pc, pc, and flags with xxhash, Emilio G. Cota, 2016/05/13
  - Re: [Qemu-devel] [PATCH v5 09/18] tb hash: hash phys_pc, pc, and flags with xxhash, Sergey Fedorov, 2016/05/17
    - Re: [Qemu-devel] [PATCH v5 09/18] tb hash: hash phys_pc, pc, and flags with xxhash, Emilio G. Cota, 2016/05/17
- [Qemu-devel] [PATCH v5 11/18] qdist: add test program, Emilio G. Cota, 2016/05/13
- [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table, Emilio G. Cota, 2016/05/13
  - Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table, Sergey Fedorov, 2016/05/20
    - Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table, Emilio G. Cota <=
    - Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table, Emilio G. Cota, 2016/05/21
    - Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table, Alex Bennée, 2016/05/22
    - Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table, Emilio G. Cota, 2016/05/23
    - Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table, Sergey Fedorov, 2016/05/21
    - Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table, Sergey Fedorov, 2016/05/23
- Re: [Qemu-devel] [PATCH v5 00/18] tb hash improvements, Sergey Fedorov, 2016/05/23

Prev by Date: [Qemu-devel] [PATCH 8/9] virtio-blk: dataplane multiqueue support
Next by Date: [Qemu-devel] [PATCH v5 00/10] Add Ethernet device for i.MX6 SOC
Previous by thread: Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table
Next by thread: Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table
Index(es):
- Date
- Thread