On 12/06/2011 07:25 PM, Paolo Bonzini wrote:
is_dup_page is already proceeding in 32-bit chunks. Changing it to 16
bytes using Altivec or SSE is easy, and provides a noticeable improvement.
Pierre Riteau measured 30->25 seconds on a 16GB guest, I measured 4.6->3.9
seconds on a 6GB guest (best of three times for me; dunno for Pierre).
Both of them are approximately a 15% improvement.
I tried playing with non-temporal prefetches, but I did not get any
improvement (though I did get less cache misses, so the patch was doing
its job).
It's worthwhile anyway IMO.
+static int is_dup_page(uint8_t *page)
{
- uint32_t val = ch<< 24 | ch<< 16 | ch<< 8 | ch;
- uint32_t *array = (uint32_t *)page;
+ VECTYPE *p = (VECTYPE *)page;
+ VECTYPE val = SPLAT(p);
I think you can drop the SPLAT and just compare against zero. Full page
repeats of anything but zero are unlikely, so we can simplify the code a
bit here. If we do go with non-temporal loads, it saves an additional miss.