[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] COLO: how to flip a secondary to a primary?
From: |
Wen Congyang |
Subject: |
Re: [Qemu-devel] COLO: how to flip a secondary to a primary? |
Date: |
Tue, 26 Jan 2016 09:06:53 +0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0 |
On 01/26/2016 02:59 AM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (address@hidden) wrote:
>> On 01/23/2016 03:35 AM, Dr. David Alan Gilbert wrote:
>>> Hi,
>>> I've been looking at what's needed to add a new secondary after
>>> a primary failed; from the block side it doesn't look as hard
>>> as I'd expected, perhaps you can tell me if I'm missing something!
>>>
>>> The normal primary setup is:
>>>
>>> quorum
>>> Real disk
>>> nbd client
>>
>> quorum
>> real disk
>> replication
>> nbd client
>>
>>>
>>> The normal secondary setup is:
>>> replication
>>> active-disk
>>> hidden-disk
>>> Real-disk
>>
>> IIRC, we can do it like this:
>> quorum
>> replication
>> active-disk
>> hidden-disk
>> real-disk
>
> Yes.
>
>>> With a couple of minor code hacks; I changed the secondary to be:
>>>
>>> quorum
>>> replication
>>> active-disk
>>> hidden-disk
>>> Real-disk
>>> dummy-disk
>>
>> after failover,
>> quorum
>> replicaion(old, mode is secondary)
>> active-disk
>> hidden-disk*
>> real-disk*
>> replication(new, mode is primary)
>> nbd-client
>
> Do you need to keep the old secondary-replication?
> Does that just pass straight through?
Yes, the old secondary-replication can work in the newest mode.
For example, we don't start colo again after failover, we do nothing.
>
>> In the newest version, we active commit active-disk to real-disk.
>> So it will be:
>> quorum
>> replicaion(old, mode is secondary)
>> active-disk(it is real disk now)
>> replication(new, mode is primary)
>> nbd-client
>
> How does that active-commit work? I didn't think you
> could change the real disk until you had the full checkpoint,
> since you don't know whether the primary or secondaries
> changes need to be written?
I start the active-commit work when doing failover. After failover,
the primary changes after last checkpoint should be dropped(How to cancel
the inprogress write ops?).
>
>>> and then after the primary fails, I start a new secondary
>>> on another host and then on the old secondary do:
>>>
>>> nbd_server_stop
>>> stop
>>> x_block_change top-quorum -d children.0 # deletes use of real
>>> disk, leaves dummy
>>> drive_del active-disk0
>>> x_block_change top-quorum -a node-real-disk
>>> x_block_change top-quorum -d children.1 # Seems to have deleted
>>> the dummy?!, the disk is now child 0
>>> drive_add buddy
>>> driver=replication,mode=primary,file.driver=nbd,file.host=ibpair,file.port=8889,file.export=colo-disk0,node-name=nbd-client,if=none,cache=none
>>> x_block_change top-quorum -a nbd-client
>>> c
>>> migrate_set_capability x-colo on
>>> migrate -d -b tcp:ibpair:8888
>>>
>>> and I think that means what was the secondary, has the same disk
>>> structure as a normal primary.
>>> That's not quite happy yet, and I've not figured out why - but the
>>> order/structure of the block devices looks right?
>>>
>>> Notes:
>>> a) The dummy serves two purposes, 1) it works around the segfault
>>> I reported in the other mail, 2) when I delete the real disk in the
>>> first x_block_change it means the quorum still has 1 disk so doesn't
>>> get upset.
>>
>> I don't understand the purpose 2.
>
> quorum wont allow you to delete all it's members ('The number of children
> cannot be lower than the vote threshold 1')
> and it's very tricky getting the order correct with add/delete; for example
> I tried:
>
> drive_add buddy
> driver=replication,mode=primary,file.driver=nbd,file.host=ibpair,file.port=8889,file.export=colo-disk0,node-name=nbd-client,if=none,cache=none
> # gets children.1
> x_block_change top-quorum -a nbd-client
> # deletes the secondary replication
> x_block_change top-quorum -d children.0
> drive_del active-disk0
The active-disk0 contains some data, and you should not delete it.
If we do active-commit after failover, the active-disk0 is the real disk.
> # ends up as children.0 but in the 2nd slot
> x_block_change top-quorum -a node-real-disk
>
> info block shows me:
> top-quorum (#block615): json:{"children": [
> {"driver": "replication", "mode": "primary", "file": {"port": "8889",
> "host": "ibpair", "driver": "nbd", "export": "colo-disk0"}},
> {"driver": "raw", "file": {"driver": "file", "filename":
> "/home/localvms/bugzilla.raw"}}
> ],
> "driver": "quorum", "blkverify": false, "rewrite-corrupted": false,
> "vote-threshold": 1} (quorum)
> Cache mode: writeback
>
> that has the replication first and the file second; that's the opposite
> from the normal primary startup - does it matter?
it is OK. But reading from children.0 always fails and will read data from
children.1
>
> I can't add node-real-disk until I drive_del active-disk0 (which
> previously used it); and I can't drive_del until I remove
> it from the quorum; but I can't remove that from the quorum first,
> because that leaves an empty quorum.
>
>>> b) I had to remove the restriction in quorum_start_replication
>>> on which mode it would run in.
>>
>> IIRC, this check will be removed.
>>
>>> c) I'm not really sure everything knows it's in secondary mode yet, and
>>> I'm not convinced whether the replication is doing the right thing.
>>> d) The migrate -d -b eventually fails on the destination, not worked
>>> out why
>>> yet.
>>
>> Can you give me the error message?
>
> I need to repeat it to check; it was something like a bad flag from the block
> migration
> code; it happened after the block migration hit 100%.
IIRC, we find some block migration's bug, and fix it. It may be a new bug.
>
>>> e) Adding/deleting children on quorum is hard having to use the
>>> children.0/1
>>> notation when you've added children using node names - it's worrying
>>> which number is which; is there a way to give them a name?
>>
>> No. I think we can improve 'info block' output.
>
> Yes, that would be good; I thought it was the order in the list; but after
> debugging it today I'm not convinced it is; I think it always keeps the same
> name - so for example if you start off with [children.0, children.1]; then
> delete children.0 you now have [children.1]; if you then add a new
> child I *think* that becomes children.0 but you end up with
> [children.1,children.0]
Note that: quorum fifo mode cares this order. I think it is better to read
the older child first.
Thanks
Wen Congyang
>
>>> f) I've not thought about the colo-proxy that much yet - I guess that
>>> existing connections need to keep their sequence number offset but
>>> new connections made by what is now the primary dont need to do
>>> anything
>>> special.
>>
>> Hailiang or Zhijian can answer this question.
>
> Thanks,
>
>> Thanks
>> Wen Congyang
>>
>>>
>>> Dave
>>> --
>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>
>>>
>>> .
>>>
>>
>>
>>
> --
> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>
>
> .
>