qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-sy


From: wang Tiger
Subject: Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
Date: Fri, 23 Jul 2010 11:29:45 +0800

在 2010年7月22日 下午11:47,Stefan Hajnoczi <address@hidden> 写道:
> 2010/7/22 wang Tiger <address@hidden>:
>> 在 2010年7月22日 下午9:00,Jan Kiszka <address@hidden> 写道:
>>> Stefan Hajnoczi wrote:
>>>> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei <address@hidden> wrote:
>>>>> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
>>>>>
>>>>>> Am 21.07.2010 09:03, schrieb Chen Yufei:
>>>>>>> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei<address@hidden>  wrote:
>>>>>>>>
>>>>>>>>> We are pleased to announce COREMU, which is a 
>>>>>>>>> "multicore-on-multicore" full-system emulator built on Qemu. (Simply 
>>>>>>>>> speaking, we made Qemu parallel.)
>>>>>>>>>
>>>>>>>>> The project web page is located at:
>>>>>>>>> http://ppi.fudan.edu.cn/coremu
>>>>>>>>>
>>>>>>>>> You can also download the source code, images for playing on 
>>>>>>>>> sourceforge
>>>>>>>>> http://sf.net/p/coremu
>>>>>>>>>
>>>>>>>>> COREMU is composed of
>>>>>>>>> 1. a parallel emulation library
>>>>>>>>> 2. a set of patches to qemu
>>>>>>>>> (We worked on the master branch, commit 
>>>>>>>>> 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>>>>>>>>
>>>>>>>>> It currently supports full-system emulation of x64 and ARM MPcore 
>>>>>>>>> platforms.
>>>>>>>>>
>>>>>>>>> By leveraging the underlying multicore resources, it can emulate up 
>>>>>>>>> to 255 cores running commodity operating systems (even on a 4-core 
>>>>>>>>> machine).
>>>>>>>>>
>>>>>>>>> Enjoy,
>>>>>>>>>
>>>>>>>> Nice work. Do you plan to submit the improvements back to upstream 
>>>>>>>> QEMU?
>>>>>>>>
>>>>>>> It would be great if we can submit our code to QEMU, but we do not know 
>>>>>>> the process.
>>>>>>> Would you please give us some instructions?
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Chen Yufei
>>>>>>>
>>>>>> Some hints can be found here:
>>>>>> http://wiki.qemu.org/Contribute/StartHere
>>>>>>
>>>>>> Kind regards,
>>>>>> Stefan Weil
>>>>> The patch is in the attachment, produced with command
>>>>> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
>>>>>
>>>>> In order to separate what need to be done to make QEMU parallel, we 
>>>>> created a separate library, and the patched QEMU need to be compiled and 
>>>>> linked with that library. To submit our enhancement to QEMU, maybe we 
>>>>> need to incorporate this library into QEMU. I don't know what would be 
>>>>> the best solution.
>>>>>
>>>>> Our approach to make QEMU parallel can be found at 
>>>>> http://ppi.fudan.edu.cn/coremu
>>>>>
>>>>> I will give a short summary here:
>>>>>
>>>>> 1. Each emulated core thread runs a separate binary translator engine and 
>>>>> has private code cache. We marked some variables in TCG as thread local. 
>>>>> We also modified the TB invalidation mechanism.
>>>>>
>>>>> 2. Each core has a queue holding pending interrupts. The COREMU library 
>>>>> provides this queue, and interrupt notification is done by sending 
>>>>> realtime signals to the emulated core thread.
>>>>>
>>>>> 3. Atomic instruction emulation has to be modified for parallel 
>>>>> emulation. We use lightweight memory transaction which requires only 
>>>>> compare-and-swap instruction to emulate atomic instruction.
>>>>>
>>>>> 4. Some code in the original QEMU may cause data race bug after we make 
>>>>> it parallel. We fixed these problems.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Chen Yufei
>>>>
>>>> Looking at the patch it seems there is a global lock for hardware
>>>> access via coremu_spin_lock(&cm_hw_lock).  How many cores have you
>>>> tried running and do you have lock contention data for cm_hw_lock?
>>
>> The global lock for hardware access is only for ARM target in our
>> implementation. It is mainly because that we are not quite familiar
>> with ARM. 4 ARM cores (Cortex A9 limitation) could be emulated in such
>> way.
>> For x86_64 target, we have already made hardware emulation
>> concurrently accessed. We can emulate 255 cores on a quad-core
>> machine.
>>
>>>
>>> BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a
>>> sleepy lock here which is likely better for the code paths protected by
>>> it in upstream. Are they shorter in COREMU?
>>>
>>>> Have you thought about making hardware emulation concurrent?
>>>>
>>>> These are issues that qemu-kvm faces today since it executes vcpu
>>>> threads in parallel.  Both qemu-kvm and the COREMU patches could
>>>> benefit from a solution for concurrent hardware access.
>>
>> In our implementation for x86_64 target, all devices except LAPIC are
>> emulated in a seperate thread. VCPUs are emulated  in other threads
>> (one thread per VCPU).
>> By observing some device drivers in linux, we have a hypothethis that
>> drivers in OS have already ensured correct synchronization on
>> concurrent hardware accesses.
>
> This hypothesis is too optimistic.  If hardware emulation code assumes
> it is only executed in a single-threaded fashion, but guests can
> execute it in parallel, then this opens up the possibility of race
> conditions that malicious guests can exploit.  There needs to be
> isolation: a guest should not be able to cause QEMU to crash.

In our prototype, we assume the guest behaves correctly. If hardware
emulation code can ensure atomic access(behave like real hardware),
VCPUS can access device freely.  We actually refine some hardward
emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
hardware access.

>
> If you have one hardware thread that handles all device emulation and
> vcpu threads do no hardware emulation, then all hardware emulation is
> serialized anyway.  Does this describe COREMU's model?

In our previous implementation, VCPU threads do no hardware emulation.
When VCPU w/r an ioport, it put the ioport address and value
infomation into a lock free queue. Hardware thread polls the queue to
serve the request. When hardware issues an interrupt, hardware thread
also put the irq information into a per VCPU lockfree queue. VCPU sets
its LAPIC and serve the interrupt request.

For performance reason, we abandon this approach. VCPU is allowed to
modify hardware state.
Hardware code is only to be slightly modified. The misbehavior of the
guest OS can be easily detected.
>
>> For example, when emulating IDE with bus master DMA,
>> 1. Two VCPUs will not send disk w/r requests at the same time.
>> 2. New DMA request will not be sent until the previous one has completed.
>> These two points guarantee the emulated IDE with DMA can be
>> concurrently accessed by either VCPU thread or hw thread with no
>> additional locks.
>>
>> The only work we need to do is to fix some misbehaving emulated device
>> in current Qemu.
>> For example, in the function ide_write_dma_cb of Qemu
>>
>> if (s->nsector == 0) {
>>        s->status = READY_STAT | SEEK_STAT;
>>        ide_set_irq(s->bus);
>> /* In parallel emulation, OS may receive interrupt here before the DMA
>> state is updated */
>>    eot:
>>        bm->status &= ~BM_STATUS_DMAING;
>>        bm->status |= BM_STATUS_INT;
>>        bm->dma_cb = NULL;
>>        bm->unit = -1;
>>        bm->aiocb = NULL;
>>        return;
>>    }
>>
>> The DMA state is changed after the IRQ has been sent. This is correct
>> in sequantial emulation. But in parallel emulation, OS may find the
>> DMA is busy even after an end of request interrupt is received.
>> The correct solution should be:
>>
>> if (s->nsector == 0) {
>>        s->status = READY_STAT | SEEK_STAT;
>> /* For coremu dma state need to be changed before irq is sent */
>>        bm->status &= ~BM_STATUS_DMAING;
>>        bm->status |= BM_STATUS_INT;
>>        bm->dma_cb = NULL;
>>        bm->unit = -1;
>>        bm->aiocb = NULL;
>>        ide_set_irq(s->bus);
>>        return;
>>       eot:
>>       ...
>> }
>>
>> The DMA state need to be changed before the IRQ has been sent as what
>> real hardware does.
>>
>> Our evaluation shows that the implementation based on this hypothethis
>> could correctly handle concurrent  device accesses.
>> We also use a per VCPU lock-free queue to hold interrupts information
>> for each VCPU.
>>
>> For your convience, here is the url for our project
>> http://sourceforge.net/p/coremu/
>> We will do our best to merge our code to upstream. :-)
>>
>>>
>>> While we are all looking forward to see more scalable hardware models
>>> :), I think it is a topic that can be addressed widely independent of
>>> parallelizing TCG VCPUs. The latter can benefit from the former, for
>>> sure, but it first of all has to solve its own issues.
>>>
>>> Note that --enable-io-thread provides truly parallel KVM VCPUs also in
>>> upstream these days. Just for TCG, we need that sightly suboptimal CPU
>>> scheduling inside single-threaded tcg_cpu_exec (was renamed to
>>> cpu_exec_all today).
>>>
>>> Jan
>>>
>>> --
>>> Siemens AG, Corporate Technology, CT T DE IT 1
>>> Corporate Competence Center Embedded Linux
>>>
>>>
>>
>>
>>
>> --
>> Zhaoguo Wang, Parallel Processing Institute, Fudan University
>>
>> Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China
>>
>> address@hidden
>> http://ppi.fudan.edu.cn/zhaoguo_wang
>>
>



-- 
Zhaoguo Wang, Parallel Processing Institute, Fudan University

Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China

address@hidden
http://ppi.fudan.edu.cn/zhaoguo_wang



reply via email to

[Prev in Thread] Current Thread [Next in Thread]