RE: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB

From:	Liu, Yuan1
Subject:	RE: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
Date:	Wed, 10 Jul 2024 13:55:23 +0000

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, July 10, 2024 2:43 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>; Eduardo
> Habkost <eduardo@habkost.net>; Marc-André Lureau
> <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>; Philippe
> Mathieu-Daudé <philmd@linaro.org>; Fabiano Rosas <farosas@suse.de>; Eric
> Blake <eblake@redhat.com>; Markus Armbruster <armbru@redhat.com>; Laurent
> Vivier <lvivier@redhat.com>; qemu-devel@nongnu.org; Hao Xiang
> <hao.xiang@linux.dev>; Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack)
> Chuang <horenchuang@bytedance.com>
> Subject: Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
> 
> On Tue, Jul 09, 2024 at 08:42:59AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Yichen Wang <yichen.wang@bytedance.com>
> > > Sent: Saturday, July 6, 2024 2:29 AM
> > > To: Paolo Bonzini <pbonzini@redhat.com>; Daniel P. Berrangé
> > > <berrange@redhat.com>; Eduardo Habkost <eduardo@habkost.net>; Marc-
> André
> > > Lureau <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>;
> > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> <peterx@redhat.com>;
> > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>;
> Markus
> > > Armbruster <armbru@redhat.com>; Laurent Vivier <lvivier@redhat.com>;
> qemu-
> > > devel@nongnu.org
> > > Cc: Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 <yuan1.liu@intel.com>;
> > > Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack) Chuang
> > > <horenchuang@bytedance.com>; Wang, Yichen <yichen.wang@bytedance.com>
> > > Subject: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
> > >
> > > v4:
> > > - Rebase changes on top of 1a2d52c7fcaeaaf4f2fe8d4d5183dccaeab67768
> > > - Move the IOV initialization to qatzip implementation
> > > - Only use qatzip to compress normal pages
> > >
> > > v3:
> > > - Rebase changes on top of master
> > > - Merge two patches per Fabiano Rosas's comment
> > > - Add versions into comments and documentations
> > >
> > > v2:
> > > - Rebase changes on top of recent multifd code changes.
> > > - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers.
> > > - Remove parameter tuning and use QATzip's defaults for better
> > >   performance.
> > > - Add parameter to enable QAT software fallback.
> > >
> > > v1:
> > > https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg03761.html
> > >
> > > * Performance
> > >
> > > We present updated performance results. For circumstantial reasons, v1
> > > presented performance on a low-bandwidth (1Gbps) network.
> > >
> > > Here, we present updated results with a similar setup as before but
> with
> > > two main differences:
> > >
> > > 1. Our machines have a ~50Gbps connection, tested using 'iperf3'.
> > > 2. We had a bug in our memory allocation causing us to only use ~1/2
> of
> > > the VM's RAM. Now we properly allocate and fill nearly all of the VM's
> > > RAM.
> > >
> > > Thus, the test setup is as follows:
> > >
> > > We perform multifd live migration over TCP using a VM with 64GB
> memory.
> > > We prepare the machine's memory by powering it on, allocating a large
> > > amount of memory (60GB) as a single buffer, and filling the buffer
> with
> > > the repeated contents of the Silesia corpus[0]. This is in lieu of a
> more
> > > realistic memory snapshot, which proved troublesome to acquire.
> > >
> > > We analyze CPU usage by averaging the output of 'top' every second
> > > during migration. This is admittedly imprecise, but we feel that it
> > > accurately portrays the different degrees of CPU usage of varying
> > > compression methods.
> > >
> > > We present the latency, throughput, and CPU usage results for all of
> the
> > > compression methods, with varying numbers of multifd threads (4, 8,
> and
> > > 16).
> > >
> > > [0] The Silesia corpus can be accessed here:
> > > https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> > >
> > > ** Results
> > >
> > > 4 multifd threads:
> > >
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv
> cpu%|
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |qatzip         | 23.13         | 8749.94        |117.50   |186.49
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |zlib           |254.35         |  771.87        |388.20   |144.40
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |zstd           | 54.52         | 3442.59        |414.59   |149.77
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |none           | 12.45         |43739.60        |159.71   |204.96
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >
> > > 8 multifd threads:
> > >
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv
> cpu%|
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |qatzip         | 16.91         |12306.52        |186.37   |391.84
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |zlib           |130.11         | 1508.89        |753.86   |289.35
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |zstd           | 27.57         | 6823.23        |786.83   |303.80
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |none           | 11.82         |46072.63        |163.74   |238.56
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >
> > > 16 multifd threads:
> > >
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv
> cpu%|
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |qatzip         |18.64          |11044.52        | 573.61  |437.65
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |zlib           |66.43          | 2955.79        |1469.68  |567.47
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |zstd           |14.17          |13290.66        |1504.08  |615.33
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |none           |16.82          |32363.26        | 180.74  |217.17
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >
> > > ** Observations
> > >
> > > - In general, not using compression outperforms using compression in a
> > >   non-network-bound environment.
> > > - 'qatzip' outperforms other compression workers with 4 and 8 workers,
> > >   achieving a ~91% latency reduction over 'zlib' with 4 workers, and a
> > > ~58% latency reduction over 'zstd' with 4 workers.
> > > - 'qatzip' maintains comparable performance with 'zstd' at 16 workers,
> > >   showing a ~32% increase in latency. This performance difference
> > > becomes more noticeable with more workers, as CPU compression is
> highly
> > > parallelizable.
> > > - 'qatzip' compression uses considerably less CPU than other
> compression
> > >   methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in
> > > compression CPU usage compared to 'zstd' and 'zlib'.
> > > - 'qatzip' decompression CPU usage is less impressive, and is even
> > >   slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16 workers.
> >
> > Hi Peter & Yichen
> >
> > I have a test based on the V4 patch set
> > VM configuration:16 vCPU, 64G memory,
> > VM Workload: all vCPUs are idle and 54G memory is filled with Silesia
> data.
> > QAT Devices: 4
> >
> > Sender migration parameters
> > migrate_set_capability multifd on
> > migrate_set_parameter multifd-channels 2/4/8
> > migrate_set_parameter max-bandwidth 1G/10G
> 
> Ah, I think this means GBps... not Gbps, then.
> 
> > migrate_set_parameter multifd-compression qatzip/zstd
> >
> > Receiver migration parameters
> > migrate_set_capability multifd on
> > migrate_set_parameter multifd-channels 2
> > migrate_set_parameter multifd-compression qatzip/zstd
> >
> > max-bandwidth: 1GBps
> >      |-----------|--------|---------|----------|------|------|
> >      |2 Channels |Total   |down     |throughput| send | recv |
> >      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
> >      |-----------|--------|---------|----------|------|------|
> >      |qatzip     |   21607|       77|      8051|    88|   125|
> >      |-----------|--------|---------|----------|------|------|
> >      |zstd       |   78351|       96|      2199|   204|    80|
> >      |-----------|--------|---------|----------|------|------|
> >
> >      |-----------|--------|---------|----------|------|------|
> >      |4 Channels |Total   |down     |throughput| send | recv |
> >      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
> >      |-----------|--------|---------|----------|------|------|
> >      |qatzip     |   20336|       25|      8557|   110|   190|
> >      |-----------|--------|---------|----------|------|------|
> >      |zstd       |   39324|       31|      4389|   406|   160|
> >      |-----------|--------|---------|----------|------|------|
> >
> >      |-----------|--------|---------|----------|------|------|
> >      |8 Channels |Total   |down     |throughput| send | recv |
> >      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
> >      |-----------|--------|---------|----------|------|------|
> >      |qatzip     |   20208|       22|      8613|   125|   300|
> >      |-----------|--------|---------|----------|------|------|
> >      |zstd       |   20515|       22|      8438|   800|   340|
> >      |-----------|--------|---------|----------|------|------|
> >
> > max-bandwidth: 10GBps
> >      |-----------|--------|---------|----------|------|------|
> >      |2 Channels |Total   |down     |throughput| send | recv |
> >      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
> >      |-----------|--------|---------|----------|------|------|
> >      |qatzip     |   22450|       77|      7748|    80|   125|
> >      |-----------|--------|---------|----------|------|------|
> >      |zstd       |   78339|       76|      2199|   204|    80|
> >      |-----------|--------|---------|----------|------|------|
> >
> >      |-----------|--------|---------|----------|------|------|
> >      |4 Channels |Total   |down     |throughput| send | recv |
> >      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
> >      |-----------|--------|---------|----------|------|------|
> >      |qatzip     |   13017|       24|     13401|   180|   285|
> >      |-----------|--------|---------|----------|------|------|
> >      |zstd       |   39466|       21|      4373|   406|   160|
> >      |-----------|--------|---------|----------|------|------|
> >
> >      |-----------|--------|---------|----------|------|------|
> >      |8 Channels |Total   |down     |throughput| send | recv |
> >      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
> >      |-----------|--------|---------|----------|------|------|
> >      |qatzip     |   10255|       22|     17037|   280|   590|
> >      |-----------|--------|---------|----------|------|------|
> >      |zstd       |   20126|       77|      8595|   810|   340|
> >      |-----------|--------|---------|----------|------|------|
> 
> PS: this 77ms downtime smells like it hits some spikes during save/load.
> Doesn't look like reproducable comparing to the rest data.

I agree with this.

> >
> > If the user has enabled compression in live migration, using QAT
> > can save the host CPU resources.
> >
> > When compression is enabled, the bottleneck of migration is usually
> > the compression throughput on the sender side, since CPU decompression
> > throughput is higher than compression, some reference data
> > https://github.com/inikep/lzbench, so more CPU resources need to be
> > allocated to the sender side.
> 
> Thank you, Yuan.
> 
> >
> > Summary:
> > 1. In the 1GBps case, QAT only uses 88% CPU utilization to reach 1GBps,
> >    but ZSTD needs 800%.
> > 2. In the 10Gbps case, QAT uses 180% CPU utilization to reach 10GBps
> >    But ZSTD still cannot reach 10Gbps even if it uses 810%.
> 
> So I assumed you always meant GBps across all the test results, as only
> that matches with max-bandwidth parameter.
> 
> Then in this case 10GBps is actually 80Gbps, which was not a low bandwidth
> test.
> 
> And I think the most interesting one that I would be curious is nocomp in
> low network tests.  Would you mind run one more test with the same
> workload, but with: no-comp, 8 channels, 10Gbps (or 1GBps)?
> 
> I think in this case multifd shouldn't matter a huge deal, but let's still
> enable that just assume that's the baseline / default setup.  I would
> expect this result should obviously show a win on using compressors, but
> just to check.

migrate_set_parameter max-bandwidth 1250M
|-----------|--------|---------|----------|----------|------|------|
|8 Channels |Total   |down     |throughput|pages per | send | recv |
|           |time(ms)|time(ms) |(mbps)    |second    | cpu %| cpu% |
|-----------|--------|---------|----------|----------|------|------|
|qatzip     |   16630|       28|     10467|   2940235|   160|   360|
|-----------|--------|---------|----------|----------|------|------|
|zstd       |   20165|       24|      8579|   2391465|   810|   340|
|-----------|--------|---------|----------|----------|------|------|
|none       |   46063|       40|     10848|    330240|    45|    85|
|-----------|--------|---------|----------|----------|------|------|

QATzip's dirty page processing throughput is much higher than that no 
compression. 
In this test, the vCPUs are in idle state, so the migration can be successful 
even 
without compression.

> > 3. The QAT decompression CPU utilization is higher than compression and
> ZSTD,
> >    from my analysis
> >    3.1 when using QAT compression, the data needs to be copied to the
> QAT
> >        memory (for DMA operations), and the same for decompression.
> However,
> >        do_user_addr_fault will be triggered during decompression because
> the
> >        QAT decompressed data is copied to the VM address space for the
> first time,
> >        in addition, both compression and decompression are processed by
> QAT and
> >        do not consume CPU resources, so the CPU utilization of the
> receiver is
> >        slightly higher than the sender.
> 
> I thought you hit this same issue when working on QPL and I remember you
> used -mem-prealloc.  Why not use it here?
> 
> >
> >    3.2 Since zstd decompression decompresses data directly into the VM
> address space,
> >        there is one less memory copy than QAT, so the CPU utilization on
> the receiver
> >        is better than QAT. For the 1GBps case, the receiver CPU
> utilization is 125%,
> >        and the memory copy occupies ~80% of CPU utilization.
> 
> Hmm, yes I read that part in code and I thought it was a design decision
> to
> do the copy, the comment said "it is faster".  So it's not?
> 
> I think we can definitely submit compression tasks per-page rather than
> buffering, if that would be better.

I think faster here probably refers to QAT throughput, QAT is more friendly to 
large block data compression(e.g. 32K). And QATzip doesn't support batching
compression tasks, so copying multiple small data to a buffer for compression
is a common practice.

> >    I think this is acceptable. Considering the overall CPU usage of the
> sender and receiver,
> >    the QAT benefit is good.
> 
> Yes, I don't think there's any major issue to block this from supported,
> it's more about when we are at it we'd better figure all things out.
> 
> For example, I think we used to discuss the use case where there's 100G*2
> network deployed, but the admin may still want to have some control plane
> VMs moving around using very limited network for QoS.  In that case, I
> wonder any of you thought about using postcopy?  I assume the control
> plane
> workload isn't super critical in this case or it won't get provisioned
> with
> low network for migrations, in that case maybe it'll also be fine to
> post-copy after one round of precopy on the slow-bandwidth network.
> 
> Again, I don't think the answer blocks such feature in any form whoever
> simply wants to use a compressor, just to ask.

I don’t have much experience with postcopy, here are some of my thoughts
1. For write-intensive VMs, this solution can improve the migration success, 
   because in a limited bandwidth network scenario, the dirty page processing
   throughput will be significantly reduced for no compression, the previous
   data includes this(pages_per_second), it means that in the no compression
   precopy, the dirty pages generated by the workload are greater than the
   migration processing, resulting in migration failure.

2. If the VM is read-intensive or has low vCPU utilization (for example, my 
   current test scenario is that the vCPUs are all idle). I think no 
compression +
   precopy + postcopy also cannot improve the migration performance, and may 
also
   cause timeout failure due to long migration time, same with no compression 
precopy.

3. In my opinion, the postcopy is a good solution in this scenario(low network 
bandwidth,
   VM is not critical), because even if compression is turned on, the migration 
may still 
   fail(page_per_second may still less than the new dirty pages), and it is 
hard to predict
   whether VM memory is compression-friendly.

> Thanks,
> 
> --
> Peter Xu

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH v4 0/4] Implement using Intel QAT to offload ZLIB, Yichen Wang, 2024/07/05
- [PATCH v4 4/4] tests/migration: Add integration test for 'qatzip' compression method, Yichen Wang, 2024/07/05
- [PATCH v4 1/4] meson: Introduce 'qatzip' feature to the build system, Yichen Wang, 2024/07/05
- [PATCH v4 2/4] migration: Add migration parameters for QATzip, Yichen Wang, 2024/07/05
  - Re: [PATCH v4 2/4] migration: Add migration parameters for QATzip, Peter Xu, 2024/07/08
- [PATCH v4 3/4] migration: Introduce 'qatzip' compression method, Yichen Wang, 2024/07/05
  - Re: [PATCH v4 3/4] migration: Introduce 'qatzip' compression method, Peter Xu, 2024/07/08
  - RE: [PATCH v4 3/4] migration: Introduce 'qatzip' compression method, Liu, Yuan1, 2024/07/10
- RE: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB, Liu, Yuan1, 2024/07/09
  - Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB, Peter Xu, 2024/07/09
    - RE: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB, Liu, Yuan1 <=
    - Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB, Peter Xu, 2024/07/10
    - RE: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB, Liu, Yuan1, 2024/07/10
    - Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB, Peter Xu, 2024/07/10

Prev by Date: Re: [PATCH] virtio: Always reset vhost devices
Next by Date: Re: [PATCH] tests/avocado: Remove the non-working virtio_check_params test
Previous by thread: Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
Next by thread: Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
Index(es):
- Date
- Thread