qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Live migration hangs after migration to remote host


From: Eduardo Otubo
Subject: Re: [Qemu-devel] Live migration hangs after migration to remote host
Date: Wed, 29 Jul 2015 14:47:36 +0200
User-agent: Mutt/1.5.23 (2014-03-12)

On Wed, Jul 29, 2015 at 11=38=44AM +0100, Dr. David Alan Gilbert wrote:
> * Eduardo Otubo (address@hidden) wrote:
> > On Wed, Jul 29, 2015 at 10=32=59AM +0100, Dr. David Alan Gilbert wrote:
> > > * Eduardo Otubo (address@hidden) wrote:
> > > > On Wed, Jul 29, 2015 at 09=11=21AM +0100, Dr. David Alan Gilbert wrote:
> > > > > * Eduardo Otubo (address@hidden) wrote:
> > > > > > On Tue, Jul 28, 2015 at 04=19=46PM +0100, Dr. David Alan Gilbert 
> > > > > > wrote:
> > > > > > > * Eduardo Otubo (address@hidden) wrote:
> > > > > > > > Hello all,
> > > > > > > > 
> > > > > > > > I'm facing a weird behavior on my tests: I am able to live 
> > > > > > > > migrate
> > > > > > > > between two virtual machines on my localhost, but not to another
> > > > > > > > machine, both using tcp.
> > > > > > > > 
> > > > > > > > * I am using the same arguments on the command line;
> > > > > > > > * Both virtual machines uses the same qcow2 file visible 
> > > > > > > > through NFS;
> > > > > > > > * Both machines are in the same subnet;
> > > > > > > > * Migration is being done from intel to intel;
> > > > > > > > * Same version of Qemu (github master - f8787f8723);
> > > > > > > > 
> > > > > > > > Using all above I am able to live migrate on the same host: 
> > > > > > > > between two
> > > > > > > > vms on local host or between two vms in the remote host; but 
> > > > > > > > when
> > > > > > > > migrating from local to remote, the guest hangs. I still can 
> > > > > > > > access its
> > > > > > > > console via ctrl+alt+2, though, and everything seems to be 
> > > > > > > > normal. If I
> > > > > > > > issue a reboote via console on the remote, the guest gets back 
> > > > > > > > to
> > > > > > > > normal.
> > > > > > > > 
> > > > > > > > Am I missing something here?
> > > > > > > 
> > > > > > > Just checking, but are you saying that as far as qemu is 
> > > > > > > concerned, the migration
> > > > > > > is happy, it's just the guest that's hung?
> > > > > > 
> > > > > > That's exactly the case. The console (via ctrl+alt+2) is active and
> > > > > > responding to all commands normally, but the screen (ctrl+alt+1) is
> > > > > > frozen and I can't interact with it at all.
> > > > > 
> > > > > Are you driving this via libvirt or using qemu monitor directly?
> > > > > If the latter, can you please get an 'info migrate' from the source
> > > > > and an 'info status' from the destination at the end of migrate.
> > > > 
> > > > I'm using qemu command line directly. And I got the problem :) See
> > > > below.
> > > > 
> > > > > 
> > > > > > > Are the host clocks on the two hosts very close (there are lots of
> > > > > > > weird corner cases with mismatched clocks) - same time zone?
> > > > > > 
> > > > > > Yep. Both machines are in the same room and have the clock sync'ed.
> > > > > 
> > > > > OK, good.
> > > > > 
> > > > > > > 
> > > > > > > Are you using cache=none (given that it's NFS shared)
> > > > > > 
> > > > > > I wasn't. But I tried again with cache=none and I got exactly the 
> > > > > > same
> > > > > > thing.
> > > > > 
> > > > > OK, and this pair of machines, have you tried both directions - i.e.
> > > > > going a->b and b->a - do both directions fail?
> > > > > Is the NFS server one of the two machines?  If it is, and you're 
> > > > > using libvirt,
> > > > > make sure that the directory the disks are on is an NFS mount on both
> > > > > machines; e.g. don't migrate directly from the NFS export.
> > > > > 
> > > > > > Also, I tried with stable-2.2 branch and got the same behavior. I 
> > > > > > really
> > > > > > think that's very unlikely to have unstable code of such an 
> > > > > > important
> > > > > > feature upstream, or on a stable- branch. Most probable thing is 
> > > > > > that
> > > > > > I have something wrong on my environment.
> > > > > 
> > > > > Yes, the challenge is to find what; and if it's something common
> > > > > we should try and find a way of spotting it.
> > > > > 
> > > > > > Anyway, I'll keep tetsing different stable- branches until I find
> > > > > > something that works for me. I'll keep the mailing list posted.
> > > > > 
> > > > > Could you share the qemu command line so we can see if we can
> > > > > spot anything?
> > > > 
> > > > Got the problem! I tried to simplify my qemu command line to the
> > > > smallest possible, excluding things I thought it could cause the issue.
> > > > With no further due, this is the argument:
> > > > 
> > > >     -cpu 'Opteron_G4'
> > > > 
> > > > Without this argument everything works as it should, console responsive
> > > > and guest active :)
> > > 
> > > Can you show cat /proc/cpuinfo off the two hosts?
> > > (Only one CPU, but please include the whole entry)
> > 
> > Intel host:
> >     ssor    : 7
> >     vendor_id   : GenuineIntel
> >     cpu family  : 6
> >     model       : 60
> >     model name  : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
> >     stepping    : 3
> >     microcode   : 0x1c
> >     cpu MHz     : 883.468
> >     cache size  : 8192 KB
> >     physical id : 0
> >     siblings    : 8
> >     core id     : 3
> >     cpu cores   : 4
> >     apicid      : 7
> >     initial apicid  : 7
> >     fpu     : yes
> >     fpu_exception   : yes
> >     cpuid level : 13
> >     wp      : yes
> >     flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
> > pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl 
> > xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor 
> > ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic 
> > movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida 
> > arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase 
> > tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt
> >     bugs        :
> >     bogomips    : 6784.87
> >     clflush size    : 64
> >     cache_alignment : 64
> >     address sizes   : 39 bits physical, 48 bits virtual
> >     power management:
> > 
> > AMD host:
> >     processor   : 5
> >     vendor_id   : AuthenticAMD
> >     cpu family  : 16
> >     model       : 10
> >     model name  : AMD Phenom(tm) II X6 1075T Processor
> >     stepping    : 0
> >     microcode   : 0x10000bf
> >     cpu MHz     : 800.000
> >     cache size  : 512 KB
> >     physical id : 0
> >     siblings    : 6
> >     core id     : 5
> >     cpu cores   : 6
> >     apicid      : 5
> >     initial apicid  : 5
> >     fpu     : yes
> >     fpu_exception   : yes
> >     cpuid level : 6
> >     wp      : yes
> >     flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc 
> > extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm 
> > extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt 
> > cpb hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall
> >     bogomips    : 6027.25
> >     TLB size    : 1024 4K pages
> >     clflush size    : 64
> >     cache_alignment : 64
> >     address sizes   : 48 bits physical, 48 bits virtual
> >     power management: ts ttp tm stc 100mhzsteps hwpstate cpb
> 
> OK, very different CPUs.  My guess is that one or both of them don't support
> some feature of the Opteron_G4.  When specifying -cpu it's often best
> to use the enforce option.
> 
> What happens if you try:
> 
> qemu-system-x86_64 -machine pc,accel=kvm -cpu Opteron_G4,enforce=on -nographic

This is the script I'm using right now on both hosts:

    address@hidden ~ # cat startvm.sh 
    #/bin/bash
    
    /home/otubo/develop/qemu/github/x86_64-softmmu/qemu-system-x86_64 \
        -machine pc,accel=kvm -cpu Opteron_G4,enforce=on \
        -name 'virt-tests-vm1'  \
        -sandbox off  \
        -display sdl \
        -drive id=drive_image1,cache=none,if=none,file=$1 \
        -device 
virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=04 \
        -device 
virtio-net-pci,mac=9a:22:23:24:25:26,id=idqE7Ggl,vectors=4,netdev=idjYAneH,bus=pci.0,addr=05
  \
        -netdev user,id=idjYAneH,hostfwd=tcp::5001-:22  \
        -m 2G,slots=32,maxmem=10G \
        -smp 2,maxcpus=10,cores=1,threads=1,sockets=2  \
        -boot order=cdn,once=c,menu=off \
        -enable-kvm

> 
> on both hosts?

The output follows,
Intel host:

    address@hidden ~ # ./startvm.sh 
/media/virt_images/pb-debian-7-server-latest.qcow2 
    warning: host doesn't support requested feature: CPUID.80000001H:ECX.sse4a 
[bit 6]
    warning: host doesn't support requested feature: 
CPUID.80000001H:ECX.misalignsse [bit 7]
    warning: host doesn't support requested feature: 
CPUID.80000001H:ECX.3dnowprefetch [bit 8]
    warning: host doesn't support requested feature: CPUID.80000001H:ECX.xop 
[bit 11]
    warning: host doesn't support requested feature: CPUID.80000001H:ECX.fma4 
[bit 16]
    qemu-system-x86_64: Host doesn't support requested features


AMD host:

    address@hidden [2015-07-29 14:41:40] ~ # ./startvm-incoming.sh 
/media/virt_images/pb-debian-7-server-latest.qcow2 
    warning: host doesn't support requested feature: 
CPUID.01H:ECX.pclmulqdq|pclmuldq [bit 1]
    warning: host doesn't support requested feature: CPUID.01H:ECX.ssse3 [bit 9]
    warning: host doesn't support requested feature: 
CPUID.01H:ECX.sse4.1|sse4_1 [bit 19]
    warning: host doesn't support requested feature: 
CPUID.01H:ECX.sse4.2|sse4_2 [bit 20]
    warning: host doesn't support requested feature: CPUID.01H:ECX.aes [bit 25]
    warning: host doesn't support requested feature: CPUID.01H:ECX.xsave [bit 
26]
    warning: host doesn't support requested feature: CPUID.01H:ECX.avx [bit 28]
    warning: host doesn't support requested feature: CPUID.80000001H:EDX.rdtscp 
[bit 27]
    warning: host doesn't support requested feature: CPUID.80000001H:ECX.xop 
[bit 11]
    warning: host doesn't support requested feature: CPUID.80000001H:ECX.fma4 
[bit 16]
    qemu-system-x86_64: Host doesn't support requested features

> You need to pick a CPU option that works with that on both of the hosts.
> 

So you think it's just a matter of fine tunning which CPU option is best for
live migration on each platform? Or it should be handled inside Qemu itself?

Regards,

-- 
Eduardo Otubo
ProfitBricks GmbH

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]