|
From: | George Kennedy |
Subject: | Re: [Qemu-devel] [PATCH v2] lsi: Reselection needed to remove pending commands from queue |
Date: | Tue, 23 Oct 2018 17:36:43 -0400 |
User-agent: | Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 |
On 10/23/2018 10:33 AM, Paolo Bonzini wrote:
On 22/10/2018 23:28, George Kennedy wrote:As you suggested I moved the loading of "s->resel_dsp" down to the "Wait Reselect" case. The address of the Reselection Scripts, though, is contained in "s->dsp - 8" and not in s->dnad.Are you sure? s->dsp - 8 should be the address of the Wait Reselect instruction itself. But you're right that s->dnad is the address at which to jump "if the LSI53C895A is selected before being reselected" (as the spec puts it) so the reselection DSP should be just s->dsp.
See within the 1st 25 lines of lsi_execute_script() where dsp is bumped up by 8, "s->dsp += 8", so it needs to be adjusted back to what it was.
Agree that it should happen as you describe, but under heavy IO (fio), it does not.The reason the timeout is needed is that under heavy IO some pending commands stay on the pending queue longer than the 30 second command timeout set by the linux upper layer scsi driver (sym53c8xx). When command timeouts occur, the upper layer scsi driver sends SCSI Abort messages to remove the timed out commands. The command timeouts are caused by the fact that under heavy IO, lsi_reselect() in qemu "hw/scsi/lsi53c895a.c" is not being called before the upper layer scsi driver 30 second command timeout goes off. If lsi_reselect() were called more frequently, the command timeout problem would probably not occur. There are a number of places where lsi_reselect() is supposed to get called (e.g. at the end of lsi_update_irq()), but the only place that I have observed lsi_reselect() being called is from lsi_execute_script() when lsi_wait_reselect() is called because of a SCRIPT "Wait Select" IO Instruction.Reselection should only happen when the target needs access to the bus, which is when I/O has finished. There should be no need for such a deadline; reselection should already be happening at the right time when lsi_transfer_data calls lsi_queue_req, which in turn calls lsi_reselect.
When it works as expected the check for "s->waiting == 1" (Wait Reselect instruction has been issued) in lsi_transfer_data() is true. Under heavy IO, s->waiting is not "1" for an extended period of time and lsi_queue_req() does not get called, which leaves any pending commands "stuck" on the queue because lsi_reselect() does not get called.
The Scripts are the only place where lsi_wait_reselect() is called and the only place where "s->waiting = 1" is set. So, the delay in getting a Scripts Wait Reselect command is the root cause of the problem.
The check in lsi_transfer_data() where it decides whether to call lsi_queue_req() is probably the preferred place to add a fix, but I have not been able to come up with a fix here that does not run into problems because of Script state.
Maybe many of the places that call lsi_irq_on_rsl(s) also need to check s->want_resel?
I've added debug to all the places where lsi_reselect() should be called, but under heavy IO lsi_reselect() does not get called for a period of time exceeding the upper layer's 30 second command timeout, hence the need for the patch which injects a Scripts Wait Reselect IO command.
My test setup consists of 5 remote iscsi disks. Here are the fio write arguments, which show the problem:
[global] bs=256k iodepth=2 direct=1 ioengine=libaio randrepeat=0 group_reporting time_based runtime=60 numjobs=40 name=test rw=write [job1] filename=/dev/sda filename=/dev/sdb filename=/dev/sdc filename=/dev/sdd filename=/dev/sdeI am not strongly attached to my proposed fix. If an alternative fix can be suggested, I'd be more than willing to try that.
Thank you, George
Paolo
[Prev in Thread] | Current Thread | [Next in Thread] |