qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB


From: Liu, Yuan1
Subject: RE: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
Date: Tue, 9 Jul 2024 08:42:59 +0000

> -----Original Message-----
> From: Yichen Wang <yichen.wang@bytedance.com>
> Sent: Saturday, July 6, 2024 2:29 AM
> To: Paolo Bonzini <pbonzini@redhat.com>; Daniel P. Berrangé
> <berrange@redhat.com>; Eduardo Habkost <eduardo@habkost.net>; Marc-André
> Lureau <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>;
> Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster <armbru@redhat.com>; Laurent Vivier <lvivier@redhat.com>; qemu-
> devel@nongnu.org
> Cc: Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 <yuan1.liu@intel.com>;
> Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>; Wang, Yichen <yichen.wang@bytedance.com>
> Subject: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
> 
> v4:
> - Rebase changes on top of 1a2d52c7fcaeaaf4f2fe8d4d5183dccaeab67768
> - Move the IOV initialization to qatzip implementation
> - Only use qatzip to compress normal pages
> 
> v3:
> - Rebase changes on top of master
> - Merge two patches per Fabiano Rosas's comment
> - Add versions into comments and documentations
> 
> v2:
> - Rebase changes on top of recent multifd code changes.
> - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers.
> - Remove parameter tuning and use QATzip's defaults for better
>   performance.
> - Add parameter to enable QAT software fallback.
> 
> v1:
> https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg03761.html
> 
> * Performance
> 
> We present updated performance results. For circumstantial reasons, v1
> presented performance on a low-bandwidth (1Gbps) network.
> 
> Here, we present updated results with a similar setup as before but with
> two main differences:
> 
> 1. Our machines have a ~50Gbps connection, tested using 'iperf3'.
> 2. We had a bug in our memory allocation causing us to only use ~1/2 of
> the VM's RAM. Now we properly allocate and fill nearly all of the VM's
> RAM.
> 
> Thus, the test setup is as follows:
> 
> We perform multifd live migration over TCP using a VM with 64GB memory.
> We prepare the machine's memory by powering it on, allocating a large
> amount of memory (60GB) as a single buffer, and filling the buffer with
> the repeated contents of the Silesia corpus[0]. This is in lieu of a more
> realistic memory snapshot, which proved troublesome to acquire.
> 
> We analyze CPU usage by averaging the output of 'top' every second
> during migration. This is admittedly imprecise, but we feel that it
> accurately portrays the different degrees of CPU usage of varying
> compression methods.
> 
> We present the latency, throughput, and CPU usage results for all of the
> compression methods, with varying numbers of multifd threads (4, 8, and
> 16).
> 
> [0] The Silesia corpus can be accessed here:
> https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> 
> ** Results
> 
> 4 multifd threads:
> 
>     |---------------|---------------|----------------|---------|---------|
>     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
>     |---------------|---------------|----------------|---------|---------|
>     |qatzip         | 23.13         | 8749.94        |117.50   |186.49   |
>     |---------------|---------------|----------------|---------|---------|
>     |zlib           |254.35         |  771.87        |388.20   |144.40   |
>     |---------------|---------------|----------------|---------|---------|
>     |zstd           | 54.52         | 3442.59        |414.59   |149.77   |
>     |---------------|---------------|----------------|---------|---------|
>     |none           | 12.45         |43739.60        |159.71   |204.96   |
>     |---------------|---------------|----------------|---------|---------|
> 
> 8 multifd threads:
> 
>     |---------------|---------------|----------------|---------|---------|
>     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
>     |---------------|---------------|----------------|---------|---------|
>     |qatzip         | 16.91         |12306.52        |186.37   |391.84   |
>     |---------------|---------------|----------------|---------|---------|
>     |zlib           |130.11         | 1508.89        |753.86   |289.35   |
>     |---------------|---------------|----------------|---------|---------|
>     |zstd           | 27.57         | 6823.23        |786.83   |303.80   |
>     |---------------|---------------|----------------|---------|---------|
>     |none           | 11.82         |46072.63        |163.74   |238.56   |
>     |---------------|---------------|----------------|---------|---------|
> 
> 16 multifd threads:
> 
>     |---------------|---------------|----------------|---------|---------|
>     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
>     |---------------|---------------|----------------|---------|---------|
>     |qatzip         |18.64          |11044.52        | 573.61  |437.65   |
>     |---------------|---------------|----------------|---------|---------|
>     |zlib           |66.43          | 2955.79        |1469.68  |567.47   |
>     |---------------|---------------|----------------|---------|---------|
>     |zstd           |14.17          |13290.66        |1504.08  |615.33   |
>     |---------------|---------------|----------------|---------|---------|
>     |none           |16.82          |32363.26        | 180.74  |217.17   |
>     |---------------|---------------|----------------|---------|---------|
> 
> ** Observations
> 
> - In general, not using compression outperforms using compression in a
>   non-network-bound environment.
> - 'qatzip' outperforms other compression workers with 4 and 8 workers,
>   achieving a ~91% latency reduction over 'zlib' with 4 workers, and a
> ~58% latency reduction over 'zstd' with 4 workers.
> - 'qatzip' maintains comparable performance with 'zstd' at 16 workers,
>   showing a ~32% increase in latency. This performance difference
> becomes more noticeable with more workers, as CPU compression is highly
> parallelizable.
> - 'qatzip' compression uses considerably less CPU than other compression
>   methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in
> compression CPU usage compared to 'zstd' and 'zlib'.
> - 'qatzip' decompression CPU usage is less impressive, and is even
>   slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16 workers.

Hi Peter & Yichen

I have a test based on the V4 patch set
VM configuration:16 vCPU, 64G memory, 
VM Workload: all vCPUs are idle and 54G memory is filled with Silesia data.
QAT Devices: 4

Sender migration parameters
migrate_set_capability multifd on
migrate_set_parameter multifd-channels 2/4/8
migrate_set_parameter max-bandwidth 1G/10G
migrate_set_parameter multifd-compression qatzip/zstd

Receiver migration parameters
migrate_set_capability multifd on
migrate_set_parameter multifd-channels 2
migrate_set_parameter multifd-compression qatzip/zstd

max-bandwidth: 1GBps
     |-----------|--------|---------|----------|------|------|
     |2 Channels |Total   |down     |throughput| send | recv |
     |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
     |-----------|--------|---------|----------|------|------|
     |qatzip     |   21607|       77|      8051|    88|   125|
     |-----------|--------|---------|----------|------|------|
     |zstd       |   78351|       96|      2199|   204|    80|
     |-----------|--------|---------|----------|------|------|

     |-----------|--------|---------|----------|------|------|
     |4 Channels |Total   |down     |throughput| send | recv |
     |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
     |-----------|--------|---------|----------|------|------|
     |qatzip     |   20336|       25|      8557|   110|   190|
     |-----------|--------|---------|----------|------|------|
     |zstd       |   39324|       31|      4389|   406|   160|
     |-----------|--------|---------|----------|------|------|

     |-----------|--------|---------|----------|------|------|
     |8 Channels |Total   |down     |throughput| send | recv |
     |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
     |-----------|--------|---------|----------|------|------|
     |qatzip     |   20208|       22|      8613|   125|   300|
     |-----------|--------|---------|----------|------|------|
     |zstd       |   20515|       22|      8438|   800|   340|
     |-----------|--------|---------|----------|------|------|

max-bandwidth: 10GBps
     |-----------|--------|---------|----------|------|------|
     |2 Channels |Total   |down     |throughput| send | recv |
     |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
     |-----------|--------|---------|----------|------|------|
     |qatzip     |   22450|       77|      7748|    80|   125|
     |-----------|--------|---------|----------|------|------|
     |zstd       |   78339|       76|      2199|   204|    80|
     |-----------|--------|---------|----------|------|------|

     |-----------|--------|---------|----------|------|------|
     |4 Channels |Total   |down     |throughput| send | recv |
     |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
     |-----------|--------|---------|----------|------|------|
     |qatzip     |   13017|       24|     13401|   180|   285|
     |-----------|--------|---------|----------|------|------|
     |zstd       |   39466|       21|      4373|   406|   160|
     |-----------|--------|---------|----------|------|------|

     |-----------|--------|---------|----------|------|------|
     |8 Channels |Total   |down     |throughput| send | recv |
     |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
     |-----------|--------|---------|----------|------|------|
     |qatzip     |   10255|       22|     17037|   280|   590|
     |-----------|--------|---------|----------|------|------|
     |zstd       |   20126|       77|      8595|   810|   340|
     |-----------|--------|---------|----------|------|------|

If the user has enabled compression in live migration, using QAT
can save the host CPU resources.

When compression is enabled, the bottleneck of migration is usually
the compression throughput on the sender side, since CPU decompression
throughput is higher than compression, some reference data 
https://github.com/inikep/lzbench, so more CPU resources need to be 
allocated to the sender side.

Summary:
1. In the 1GBps case, QAT only uses 88% CPU utilization to reach 1GBps, 
   but ZSTD needs 800%.
2. In the 10Gbps case, QAT uses 180% CPU utilization to reach 10GBps
   But ZSTD still cannot reach 10Gbps even if it uses 810%.
3. The QAT decompression CPU utilization is higher than compression and ZSTD,
   from my analysis
   3.1 when using QAT compression, the data needs to be copied to the QAT 
       memory (for DMA operations), and the same for decompression. However, 
       do_user_addr_fault will be triggered during decompression because the 
       QAT decompressed data is copied to the VM address space for the first 
time,
       in addition, both compression and decompression are processed by QAT and 
       do not consume CPU resources, so the CPU utilization of the receiver is 
       slightly higher than the sender.
   
   3.2 Since zstd decompression decompresses data directly into the VM address 
space, 
       there is one less memory copy than QAT, so the CPU utilization on the 
receiver
       is better than QAT. For the 1GBps case, the receiver CPU utilization is 
125%,
       and the memory copy occupies ~80% of CPU utilization.

   I think this is acceptable. Considering the overall CPU usage of the sender 
and receiver, 
   the QAT benefit is good.

> Bryan Zhang (4):
>   meson: Introduce 'qatzip' feature to the build system
>   migration: Add migration parameters for QATzip
>   migration: Introduce 'qatzip' compression method
>   tests/migration: Add integration test for 'qatzip' compression method
> 
>  hw/core/qdev-properties-system.c |   6 +-
>  meson.build                      |  10 +
>  meson_options.txt                |   2 +
>  migration/meson.build            |   1 +
>  migration/migration-hmp-cmds.c   |   8 +
>  migration/multifd-qatzip.c       | 391 +++++++++++++++++++++++++++++++
>  migration/multifd.h              |   5 +-
>  migration/options.c              |  57 +++++
>  migration/options.h              |   2 +
>  qapi/migration.json              |  38 +++
>  scripts/meson-buildoptions.sh    |   3 +
>  tests/qtest/meson.build          |   4 +
>  tests/qtest/migration-test.c     |  35 +++
>  13 files changed, 559 insertions(+), 3 deletions(-)
>  create mode 100644 migration/multifd-qatzip.c
> 
> --
> Yichen Wang




reply via email to

[Prev in Thread] Current Thread [Next in Thread]