bug-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 64bit startup


From: Luca
Subject: Re: 64bit startup
Date: Tue, 06 Jun 2023 20:56:53 +0000

Il 6 giugno 2023 20:22:53 UTC, Sergey Bugaev <bugaevc@gmail.com> ha scritto:
>On Mon, Jun 5, 2023 at 3:03 PM Sergey Bugaev <bugaevc@gmail.com> wrote:
>> That is going to be much easier to debug than debootstrap, thank you!
>
>Unfortunately I'm facing some troubles :|
>
>For one thing you seem to have rebuilt/updated the packages, but not
>the rootfs image, so now the debuginfo I download doesn't match the
>binaries in the image. Please update the image too!
>
>Also, gdb seems to suddenly have some sort of trouble with
>interpreting the debuginfo files, specifically I'm trying ones from
>the 'hurd-dbsym' package (to be even more specific:
>/usr/lib/debug/.build-id/bf/dd0c0525d0ca383bd842796063345a2dd0c001.debug
>from that package, which corresponds to ext2fs.static -- but I've done
>a quick check and other files seem to behave the same way too). GDB
>loads regular symbols from them, but not the debuginfo, i.e. I can see
>what function is at which address, but not map addresses to source
>lines or access local variables or use types (tcbhead_t is the one I
>currently need most). I don't know enough about GDB and DWARF to
>diagnose exactly what's going on; readelf --debug-dump=info seems to
>dump the debuginfo just fine.
>
>Please try to reproduce this with your GDB (no Hurd system required),
>and if you have changed something recently about how debug files are
>generated, maybe that's what has broken it.
>
>So that all being said, here's one crash I am (and have been) seeing a
>lot: the crash at any sort of TCB access when fs_base suddenly turns
>out to be equal to the address of _kret_popl_ds. This makes no sense
>-- surely userspace would never set that, so it must be a gnumach bug.
>
>I've got a little theory of how something like that could happen:
>
>It is my understanding that "the PCB stack" (whatever that is) where
>locore.S pushes user's registers and thread->pcb->iss is really the
>exact same place, pushing registers onto that stack is exactly writing
>to the thread's i386_saved_state structure. The first four members of
>struct i386_saved_state are unsigned long fsbase, gsbase, gs, fs --
>and being the first members of the struct means they have the lowest
>addresses, i.e. are located at the top of the PCB stack.
>
>locore.S actually skips pushing or popping these four members:
>
>#define PUSH_FSGS               \
>        subq    $32,%rsp
>
>#define POP_FSGS                \
>        addq    $32,%rsp
>
>This is because fs and gs we don't care about, and fsbase/gsbase of a
>thread state can only be changed by explicit thread_set_state calls
>and not by the thread itself, so, no need to rdmsr and push it, since
>the value is already saved in the PCB slot.
>
>However, *something* goes wrong and the fsbase slot gets overwritten
>with an unrelated value (_kret_popl_ds). The real %fs_base MSR keeps
>the proper value -- until we context-switch away from the thread and
>then back to it, at which point the bogus value gets loaded into
>%fs_base and then the userland tries to use it and faults.
>
>I don't know nearly enough about x86 interrupts/traps to say, but
>could it be that we get another interrupt/trap, while presumably
>executing _kret_popl_ds, and that causes the faulting %rip to be
>pushed onto the stack, but since we're at the PCB stack at that point
>it clobbers the stored fsbase? That doesn't cause issues for all the
>other registers because we have already popped their values off and
>won't be accessing them anymore; we'll push the new values the next
>time the thread enters the kernel -- though I guess it could show up
>in thread_get_state if you do that without stopping the thread on an
>SMP kernel.
>
>cc'ing Luca -- does what I'm saying make sense? could this happen? can
>you reproduce %fs_base getting set to _kret_popl_ds?
 
yes this makes perfectly sense, I think I'm chasing the same bug currently, or 
some variation of it. With some tracing I saw that this corruption seems to 
happen when an irq (usually timer) fires when returning from a trap, although 
not necessarily at one specific point.

I still have to find exactly the reason, in my tests fsbase gets overwritten 
either with a kernel address or 0x17, which might be a segment value. Btw in 
locore.S the 64-bit-only segment handling code is doing way too much, e.g. 
es/ds and such could be ignored, I guess simplifying this part may also solve 
this issue.


Luca
Hi Sergey,



reply via email to

[Prev in Thread] Current Thread [Next in Thread]