* Wen Congyang (address@hidden) wrote:
On 04/22/2015 07:18 PM, Dr. David Alan Gilbert wrote:
* zhanghailiang (address@hidden) wrote:
Hi,
ping ...
I will get to look at this again; but not until after next week.
The main blocked bugs for COLO have been solved,
I've got the v3 set running, but the biggest problem I hit are problems
with the packet comparison module; I've seen a panic which I think is
in colo_send_checkpoint_req that I think is due to the use of
GFP_KERNEL to allocate the netlink message and I think it can schedule
there. I tried making that a GFP_ATOMIC but I'm hitting other
problems with :
Thanks for your test.
I guest the backtrace should like:
1. colo_send_checkpoint_req()
2. colo_setup_checkpoint_by_id()
Because we hold rcu read lock, so we cannot use GFP_KERNEL to malloc memory.
See the backtrace below.
kcolo_thread, no conn, schedule out
Hmm, how to reproduce it? In my test, I only focus on block replication, and
I don't use the network.
that I've not had time to look into yet.
So I only get about a 50% success rate of starting COLO.
I see there are stuff in the TODO of the colo-proxy that
seem to say the netlink stuff should change, maybe you're already fixing
that?
Do you mean you get about a 50% success rate if you use the network?
I always run with the network configured; but the 'kcolo_thread, no conn' bug
will hit very early; so I don't see any output on the primary or secondary
after the migrate -d is issued on the primary. On the primary in the dmesg
I see:
[ 736.607043] ip_tables: (C) 2000-2006 Netfilter Core Team
[ 736.615268] kcolo_thread, no conn, schedule out, chk 0
[ 736.619442] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 736.718273] arp_tables: (C) 2002 David S. Miller
I've not had a chance to look further at that yet.
Here is the backtrace from the 1st bug.
Dave (I'm on holiday next week; I probably won't respond to many mails)
[ 9087.833228] BUG: scheduling while atomic: swapper/1/0/0x10000100
[ 9087.833271] Modules linked in: ip6table_mangle ip6_tables xt_physdev
iptable_mangle xt_PMYCOLO(OF) nf_conntrack_i
pv4 nf_defrag_ipv4 xt_mark nf_conntrack_colo(OF) nf_conntrack_ipv6
nf_defrag_ipv6 nf_conntrack iptable_filter ip_tab
les arptable_filter arp_tables act_mirred cls_u32 sch_prio tun bridge stp llc
sg kvm_intel kvm snd_hda_codec_generic
cirrus snd_hda_intel crct10dif_pclmul snd_hda_codec crct10dif_common
snd_hwdep syscopyarea snd_seq crc32_pclmul crc
32c_intel sysfillrect ghash_clmulni_intel snd_seq_device aesni_intel lrw
sysimgblt gf128mul ttm drm_kms_helper snd_p
cm snd_page_alloc snd_timer snd soundcore glue_helper i2c_piix4 ablk_helper drm
cryptd virtio_console i2c_core virti
o_balloon serio_raw mperf pcspkr nfsd auth_rpcgss nfs_acl lockd uinput sunrpc
xfs libcrc32c sr_mod cdrom ata_generic
[ 9087.833572] pata_acpi virtio_net virtio_blk ata_piix e1000 virtio_pci
libata virtio_ring floppy virtio dm_mirror
dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
[ 9087.833616] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GF
O-------------- 3.10.0-123.20.1.el7.dgilbertcolo
.x86_64 #1
[ 9087.833623] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 9087.833630] ffff880813de8000 7b4d45d276068aee ffff88083fc23980
ffffffff815e2b0c
[ 9087.833640] ffff88083fc23990 ffffffff815dca9f ffff88083fc239f0
ffffffff815e827b
[ 9087.833648] ffff880813de9fd8 00000000000135c0 ffff880813de9fd8
00000000000135c0
[ 9087.833657] Call Trace:
[ 9087.833664] <IRQ> [<ffffffff815e2b0c>] dump_stack+0x19/0x1b
[ 9087.833680] [<ffffffff815dca9f>] __schedule_bug+0x4d/0x5b
[ 9087.833688] [<ffffffff815e827b>] __schedule+0x78b/0x790
[ 9087.833699] [<ffffffff81094fb6>] __cond_resched+0x26/0x30
[ 9087.833707] [<ffffffff815e86aa>] _cond_resched+0x3a/0x50
[ 9087.833716] [<ffffffff81193908>] kmem_cache_alloc_node+0x38/0x200
[ 9087.833752] [<ffffffffa046b770>] ? nf_conntrack_find_get+0x30/0x40
[nf_conntrack]
[ 9087.833761] [<ffffffff814c115d>] ? __alloc_skb+0x5d/0x2d0
[ 9087.833768] [<ffffffff814c115d>] __alloc_skb+0x5d/0x2d0
[ 9087.833777] [<ffffffff814fb972>] ? netlink_lookup+0x32/0xf0
[ 9087.833786] [<ffffffff8153b7d0>] ? arp_req_set+0x270/0x270
[ 9087.833794] [<ffffffff814fbc3b>] netlink_alloc_skb+0x6b/0x1e0
[ 9087.833801] [<ffffffff8153b7d0>] ? arp_req_set+0x270/0x270
[ 9087.833816] [<ffffffffa04a462b>] colo_send_checkpoint_req+0x2b/0x80
[xt_PMYCOLO]
[ 9087.833823] [<ffffffff8153b7d0>] ? arp_req_set+0x270/0x270
[ 9087.833832] [<ffffffffa04a4dd9>] colo_slaver_arp_hook+0x79/0xa0 [xt_PMYCOLO]
[ 9087.833850] [<ffffffffa05fc02f>] ? arptable_filter_hook+0x2f/0x40
[arptable_filter]
[ 9087.833858] [<ffffffff81500c5a>] nf_iterate+0xaa/0xc0
[ 9087.833866] [<ffffffff8153b7d0>] ? arp_req_set+0x270/0x270
[ 9087.833874] [<ffffffff81500cf4>] nf_hook_slow+0x84/0x140
[ 9087.833882] [<ffffffff8153b7d0>] ? arp_req_set+0x270/0x270
[ 9087.833890] [<ffffffff8153bf60>] arp_rcv+0x120/0x160
[ 9087.833906] [<ffffffff814d0596>] __netif_receive_skb_core+0x676/0x870
[ 9087.833914] [<ffffffff814d07a8>] __netif_receive_skb+0x18/0x60
[ 9087.833922] [<ffffffff814d0830>] netif_receive_skb+0x40/0xd0
[ 9087.833930] [<ffffffff814d1290>] napi_gro_receive+0x80/0xb0
[ 9087.833959] [<ffffffffa00e34a0>] e1000_clean_rx_irq+0x2b0/0x580 [e1000]
[ 9087.833970] [<ffffffffa00e5985>] e1000_clean+0x265/0x8e0 [e1000]
[ 9087.833979] [<ffffffff8109506d>] ? ttwu_do_activate.constprop.85+0x5d/0x70
[ 9087.833988] [<ffffffff814d0bfa>] net_rx_action+0x15a/0x250
[ 9087.833997] [<ffffffff81067047>] __do_softirq+0xf7/0x290
[ 9087.834006] [<ffffffff815f4b5c>] call_softirq+0x1c/0x30
[ 9087.834011] [<ffffffff81014cf5>] do_softirq+0x55/0x90
[ 9087.834011] [<ffffffff810673e5>] irq_exit+0x115/0x120
[ 9087.834011] [<ffffffff815f5458>] do_IRQ+0x58/0xf0
[ 9087.834011] [<ffffffff815ea5ad>] common_interrupt+0x6d/0x6d
[ 9087.834011] <EOI> [<ffffffff81046346>] ? native_safe_halt+0x6/0x10
[ 9087.834011] [<ffffffff8101b39f>] default_idle+0x1f/0xc0
[ 9087.834011] [<ffffffff8101bc96>] arch_cpu_idle+0x26/0x30
[ 9087.834011] [<ffffffff810b47e5>] cpu_startup_entry+0xf5/0x290
[ 9087.834011] [<ffffffff815d0a6e>] start_secondary+0x1c4/0x1da
[ 9087.837189] ------------[ cut here ]------------
[ 9087.837189] kernel BUG at net/core/dev.c:4130!