qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] blkdebug: fix "once" rule


From: Max Reitz
Subject: Re: [Qemu-devel] [PATCH] blkdebug: fix "once" rule
Date: Fri, 06 Feb 2015 08:22:08 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0

On 2015-02-05 at 19:28, John Snow wrote:


On 02/05/2015 06:43 PM, Max Reitz wrote:
On 2015-02-05 at 18:15, John Snow wrote:
The blkdebug scripts are currently engineered so that when a debug event
occurs, a prefilter browses a master list of parsed rules for a certain
event and adds them to an "active list" of rules to be used for the
forthcoming action, provided the events and state numbers match.

Then, once the request is received, the last active rule is used to
inject
an error if certain parameters match.

This active list is cleared every time the prefilter injects a new rule
for the first time during a debug event.

The "once" rule currently causes the error injection, if it is triggered, to only clear the active list. This is insufficient for preventing future
injections of the same rule.

This patch /deletes/ the rule from the list that the prefilter browses,
so it is gone for good.

Lastly, this impacts iotests 026. Several ENOSPC tests that used "once"
can be seen to have output that shows multiple failure messages. After
this patch, the error messages tend to be smaller and less severe, but
the injection can still be seen to be working.

Patch the expected output to expect the smaller error messages.

Signed-off-by: John Snow <address@hidden>
---
  block/blkdebug.c           |  9 +++++----
  tests/qemu-iotests/026.out | 24 ++++--------------------
  2 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 9ce35cd..d30c44c 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -473,14 +473,15 @@ static BlockAIOCB *inject_error(BlockDriverState
*bs,
      struct BlkdebugAIOCB *acb;
      QEMUBH *bh;
-    if (rule->options.inject.once) {
-        QSIMPLEQ_INIT(&s->active_rules);
-    }
-
      if (rule->options.inject.immediately) {
          return NULL;
      }
+    if (rule->options.inject.once) {
+        QSIMPLEQ_INIT(&s->active_rules);

This will still delete all other currently active rules, which I don't
think is what is intended.

Case in point why this is not really right:

$ ./qemu-img create -f qcow2 test.qcow2 64M
Formatting 'test.qcow2', fmt=qcow2 size=67108864 encryption=off
cluster_size=65536 lazy_refcounts=off
$ ./qemu-io -c 'aio_write 0 64k'
"json:{'driver':'qcow2','file':{'driver':'blkdebug','image':{'driver':'file','filename':'test.qcow2'},'inject-error':[{'event':'write_aio'},{'event':'write_aio','state':42,'once':true}],'set-state':[{'event':'cluster_alloc','new_state':42}]}}"

aio_write failed: Input/output error
Failed to flush the L2 table cache: Input/output error
Failed to flush the refcount block cache: Input/output error

So, here we get three errors; the non-once rule stayed active.

Now let's just turn the rules around:

$ ./qemu-img create -f qcow2 test.qcow2 64M
Formatting 'test.qcow2', fmt=qcow2 size=67108864 encryption=off
cluster_size=65536 lazy_refcounts=off
$ ./qemu-io -c 'aio_write 0 64k'
"json:{'driver':'qcow2','file':{'driver':'blkdebug','image':{'driver':'file','filename':'test.qcow2'},'inject-error':[{'event':'write_aio','state':42,'once':true},{'event':'write_aio'}],'set-state':[{'event':'cluster_alloc','new_state':42}]}}"

aio_write failed: Input/output error

Hmmm...

So what happens is that in the first case, the non-"once" rule had
precedence, so the list of active rules was not changed. In the second
case, however, the "once" rule had precedence, therefore all active
rules (including the non-"once" rule) were deleted.

I think it's fine that the "once" rule does not get deleted in the first
case, but whether the non-"once" rule stays active should not be
influenced by the "once" rule.

Note that this issue was not introduced by this patch, but I think this
patch should fix it. :-)

I was really trying to avoid touching how weird the blkdebug rules look and just make my own test happier, but you've forced my hand, villain!

Max

+        remove_rule(rule);
+    }
+


OK, this gets sort of weird, because we need to think about the lifetimes of certain errors, and what we actually want to have happen.

Here's what your first script does:

blkdebug_debug_event -- [reftable_load]
blkdebug_debug_event -- [l2_load]
blkdebug_debug_event -- [cluster_alloc]
Setting new_state: 42
blkdebug_debug_event -- [refblock_alloc]
blkdebug_debug_event -- [refblock_load]
blkdebug_debug_event -- [write_aio]
Clearing active_rules ...
Adding a rule for [write_aio]
Adding a rule for [write_aio]
blkdebug_debug_event -- [pwritev]
An error is being injected for aio_writev.
blkdebug_debug_event -- [pwritev_done]
aio_write failed: Input/output error
blkdebug_debug_event -- [flush_to_os]
An error is being injected for aio_flush.
An error is being injected for aio_flush.
blkdebug_debug_event -- [refblock_update_part]
blkdebug_debug_event -- [pwritev]
An error is being injected for aio_writev.
blkdebug_debug_event -- [pwritev_done]
Failed to flush the L2 table cache: Input/output error
Failed to flush the refcount block cache: Input/output error
An error is being injected for aio_flush.


The prefilter clears active rules when the write_aio debug event comes in, and adds the two write_aio rules.

It then injects an error after the pwritev event.
We then try to flush twice, but we still have that rule in our active filters, so maddeningly this and all future calls to any of readv, writev or flush will fail.

That's bonkers.


The second script does this:

blkdebug_debug_event -- [reftable_load]
blkdebug_debug_event -- [l2_load]
blkdebug_debug_event -- [cluster_alloc]
Setting new_state: 42
blkdebug_debug_event -- [refblock_alloc]
blkdebug_debug_event -- [refblock_load]
blkdebug_debug_event -- [write_aio]
Clearing active_rules ...
Adding a rule for [write_aio]
Adding a rule for [write_aio]
blkdebug_debug_event -- [pwritev]
An error is being injected for aio_writev.
run_once rule hit; clearing active list and run_once rule.
blkdebug_debug_event -- [pwritev_done]
aio_write failed: Input/output error
blkdebug_debug_event -- [flush_to_os]
blkdebug_debug_event -- [refblock_update_part]
blkdebug_debug_event -- [pwritev]
blkdebug_debug_event -- [pwritev_done]
blkdebug_debug_event -- [flush_to_disk]

Again, we clear the active rules only when a new injection is found after a fresh debug event. This time, though, on the first pwritev, we use the "once" error and delete both it and its perennial cousin from the active list.

Then, no errors are encountered for the rest of the run because we actually don't hit another write_aio event for the rest of execution.

So both outputs are behaving kind of poorly.




I think what we need is:

(1) To only delete the error actually injected from the active_rules list if the once flag was set, and

Right.

(2) To decide when a sane point to clear the active list is, because otherwise, the active_rules which gets populated for e.g. aio_write is going to spill over into aio_flush if nothing stops it.

I don't think we need to clear it. If "once" isn't set, the rule just stays active after the event. If you want a rule to be deactivated at a certain point in time, you set its "state" to 1 and then add a set-state entry for transitioning to a different state when you want it to be deactivated.

The idea was, as far as I know, that once a non-"once" rule is becoming active, all subsequent requests fail (like if your drive just crashed).

The only current "clear active" list point is when a new event comes in for which we have an injection rule for. Otherwise they just persist.

Hm, you're right. That opposes my previous statement. I have no idea why the rules get deleted if a new event comes in (and only if there are rules for that event!).

I can only imagine that the assumption is that you only want to deal with events that you defined errors for. If there are other events (imagine a new event is added to the qcow2 code), you just don't want to deal with them at all which is why the rules remain active in that case; but since you obviously do know about the events you have errors for, it's fine to scratch the list of active rules if they occur.

So, since I have no idea how this is meant to be, I'd just leave it as it is now, that is, let the active rules spill over into "unknown" events (events without any error specified for them); and just do (1).

Max

I could clear all active requests on every event, but a lot of tests that use "write_aio" to perhaps catch things that happen under "pwritev" would begin to fail; but perhaps that's more correct/accurate anyway.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]