qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v4 0/6] spapr/xics: fix migration of older machi


From: David Gibson
Subject: Re: [Qemu-devel] [PATCH v4 0/6] spapr/xics: fix migration of older machine types
Date: Fri, 16 Jun 2017 22:28:59 +0800
User-agent: Mutt/1.8.0 (2017-02-23)

On Fri, Jun 16, 2017 at 04:23:38PM +0530, Nikunj A Dadhania wrote:
> Nikunj A Dadhania <address@hidden> writes:
> 
> > Greg Kurz <address@hidden> writes:
> >
> >> On Sun, 11 Jun 2017 17:38:42 +0800
> >> David Gibson <address@hidden> wrote:
> >>
> >>> On Fri, Jun 09, 2017 at 05:09:13PM +0200, Greg Kurz wrote:
> >>> > On Fri, 9 Jun 2017 20:28:32 +1000
> >>> > David Gibson <address@hidden> wrote:
> >>> >   
> >>> > > On Fri, Jun 09, 2017 at 11:36:31AM +0200, Greg Kurz wrote:  
> >>> > > > On Fri, 9 Jun 2017 12:28:13 +1000
> >>> > > > David Gibson <address@hidden> wrote:
> >>> > > >     
> >>> > 1) start guest
> >>> > 
> >>> > qemu-system-ppc64 \
> >>> >  -nodefaults -nographic -snapshot -no-shutdown -serial mon:stdio \
> >>> >  -device virtio-net,netdev=netdev0,id=net0 \
> >>> >  -netdev 
> >>> > bridge,id=netdev0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \
> >>> >  -device virtio-blk,drive=drive0,id=blk0 \
> >>> >  -drive 
> >>> > file=/home/greg/images/sle12-sp1-ppc64le.qcow2,id=drive0,if=none \
> >>> >  -machine type=pseries,accel=tcg -cpu POWER8
> >
> > Strangely, your command line does not have multiple threads. Need to see
> > what is the side effect of enabling MTTCG by default here.
> >
> >>> > 
> >>> > 2) migrate
> >>> > 
> >>> > 3) destination crashes (immediately or after very short delay) or
> >>> > hangs  
> >>> 
> >>> Ok.  I'll bisect it when I can, but you might well get to it first.
> >>> 
> >>> 
> >>
> >> Heh, maybe you didn't see in my mail but I did bisect:
> >>
> >> f0b0685d6694a28c66018f438e822596243b1250 is the first bad commit
> >> commit f0b0685d6694a28c66018f438e822596243b1250
> >> Author: Nikunj A Dadhania <address@hidden>
> >> Date:   Thu Apr 27 10:48:23 2017 +0530
> >>
> >>     tcg: enable MTTCG by default for PPC64 on x86
> >
> > Let me have a look at it.
> 
> Interesting problem here, I see that when the migration is completed on
> source and there is a crash on destination:
> 
> [   56.185314] Unable to handle kernel paging request for data at address 
> 0x5deadbeef0000108
> [   56.185401] Faulting instruction address: 0xc000000000277bc8
> 
>    0xc000000000277bb8 <+168>: ld      r7,8(r4)
>    0xc000000000277bbc <+172>: ld      r6,0(r4)                  <========
>    0xc000000000277bc0 <+176>: ori     r8,r8,56302
>    0xc000000000277bc4 <+180>: rldicr  r8,r8,32,31
>    0xc000000000277bc8 <+184>: std     r7,8(r6)
> 
> r4 = 0xf0000000000107a0
> r6 = 0x5deadbeef0000100
> 
> Code at 0xc000000000277bbc <+172>, gave junk value in r6, that leads to
> the guest crash. When I inspect the memory on source and destination in
> qemu monitor, I get the following differences:
> 
> diff -u s.txt d.txt 
> --- s.txt     2017-06-16 10:34:39.657221125 +0530
> +++ d.txt     2017-06-16 10:34:18.452238305 +0530
> @@ -8,8 +8,8 @@
>  f000000000010760: 0x20de0b00 0x000000f0 0x60040100 0x000000f0
>  f000000000010770: 0x00000000 0x00000000 0x0004036d 0x000000c0
>  f000000000010780: 0x6c000100 0xf8ff3f00 0x7817f977 0x000000c0
> -f000000000010790: 0x15000000 0x00000000 0xffffffff 0x01000000
> -f0000000000107a0: 0x3090a96d 0x000000c0 0x3090a96d 0x000000c0
> +f000000000010790: 0x01000000 0x00000000 0xffffffff 0x01000000
> +f0000000000107a0: 0x000100f0 0xeedbea5d 0x000200f0 0xeedbea5d
>  f0000000000107b0: 0x00000000 0x00000000 0x00d0a96d 0x000000c0
>  f0000000000107c0: 0x28000000 0xf8ff3f00 0x8852cc77 0x000000c0
>  f0000000000107d0: 0x00000000 0x00000000 0xffffffff 0x01000000
> 
> Source had a valid address at 0xf0000000000107a0, while garbage on the
> destination.
> 
> Some observations:
> 
> * Source updates the memory location (probably atomic_cmpxchg), but the
>   updated page didnt get transferred to the destination
>   
> * Getting rid of atomic_cmpxchg tcg ops in ldarx/stdcx, makes migration
>   work fine. MTTCG running with 1 cpu.
> 
> While I continue debugging, any hints would help.

My first guess would be that some or all of the new TCG atomic
primitives aren't updating the dirty page bitmap.

My second guess would be a race between the atomic TCG ops and the
migration / dirty map handling which means we can lost a memory update
and not transfer it to the destination.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]