OpenMosix Dojo (version 1.0. Copyright of Mulyadi Santosa)

Qemu and OpenMosix: The Internal Power of Virtualization
        
	Trying the first adventure into clustering arena? Then maybe you
began to gather some old PC from your garage, borrowing PCs from your 
friend, or even sneaking into your neighbour's home trying to 
"pick"their PC? :-) All just because âoh boy, i just have one 
PC.........and I want to play with openMosix for a while, but I have 
no more PC...â.

Or maybe, you are a brave spirit try to âconquerâ openMosix, so you 
install openMosix on 4 PCs and then "booommmm" you got nasty 
segfaults. and then someone suggests you to download and try new 
version of openMosix patch....now it's time for another leg and hand 
sport, moving around between PCs to update the kernel. Well, LTSP 
might helps, but maybe it's not a good idea.

So, it's time to gather your strength. If you know Chi practice on 
kungfu, now we do the same for your lonely PC....:-) If you ever heard 
tools like VMWare, Bochs, Xen, Plex86, User Mode Linux or the gangs, 
then it is time to meet 
Qemu(http://http://fabrice.bellard.free.fr/qemu/) Grab the source 
tarball at http://fabrice.bellard.free.fr/qemu/qemu-0.5.5.tar.gz, this 
is the latest version (0.5.5)

Unpack the tarball (using tar -xzvf). Now, before you do the actual 
"make", apply the following patch

---------------------CUT Start of the Patch-----------------------
--- ./before-diff/sdl.c	2004-05-18 10:33:05.000000000 +0700
+++ ./sdl.c	2004-05-18 10:40:55.000000000 +0700
@@ -130,6 +130,7 @@
 static void sdl_process_key(SDL_KeyboardEvent *ev)
 {
     int keycode, v;
+    static int modif;
     
     /* XXX: not portable, but avoids complicated mappings */
     keycode = ev->keysym.scancode;
@@ -150,6 +151,78 @@
     } else {
         keycode = 0;
     }
+    /* Adjust shift-key states when leaving window */
+
+    if (ev->keysym.scancode == 0) {
+        if ((modif ^ ev->keysym.mod) & KMOD_LSHIFT)
+            kbd_put_keycode(0x2a | (modif & KMOD_LSHIFT ? 0x80 : 0));
+        if ((modif ^ ev->keysym.mod) & KMOD_RSHIFT)
+            kbd_put_keycode(0x36 | (modif & KMOD_RSHIFT ? 0x80 : 0));
+        if ((modif ^ ev->keysym.mod) & KMOD_LCTRL)
+            kbd_put_keycode(0x1d | (modif & KMOD_LCTRL ? 0x80 : 0));
+        if ((modif ^ ev->keysym.mod) & KMOD_RCTRL) {
+            kbd_put_keycode(0xe0 );
+            kbd_put_keycode(0x1d | (modif & KMOD_RCTRL ? 0x80 : 0));
+        }
+        if ((modif ^ ev->keysym.mod) & KMOD_LALT)
+            kbd_put_keycode(0x38 | (modif & KMOD_LALT ? 0x80 : 0));
+        if ((modif ^ ev->keysym.mod) & KMOD_RALT) {
+            kbd_put_keycode(0xe0 );
+            kbd_put_keycode(0x38 | (modif & KMOD_RALT ? 0x80 : 0));
+        }
+        modif = ev->keysym.mod;
+    }
+
+    /* remember shift-key state */
+
+    switch (keycode) {
+    case 0x2a:                          /* Left Shift */
+        if (ev->type == SDL_KEYUP)
+            modif &= ~KMOD_LSHIFT;
+        else
+            modif |= KMOD_LSHIFT;
+        break;
+    case 0x36:                          /* Right Shift */
+        if (ev->type == SDL_KEYUP)
+            modif &= ~KMOD_RSHIFT;
+        else
+            modif |= KMOD_RSHIFT;
+        break;
+    case 0x1d:                          /* Left CTRL */
+        if (ev->type == SDL_KEYUP)
+            modif &= ~KMOD_LCTRL;
+        else
+            modif |= KMOD_LCTRL;
+        break;
+    case 0x1de0:                        /* Right CTRL */
+        if (ev->type == SDL_KEYUP)
+            modif &= ~KMOD_RCTRL;
+        else
+            modif |= KMOD_RCTRL;
+        break;
+    case 0x38:                          /* Left ALT */
+        if (ev->type == SDL_KEYUP)
+            modif &= ~KMOD_LALT;
+        else
+            modif |= KMOD_LALT;
+        break;
+    case 0x38e0:                        /* Right ALT */
+        if (ev->type == SDL_KEYUP)
+            modif &= ~KMOD_RALT;
+        else
+            modif |= KMOD_RALT;
+        break;
+    case 0x45:                          /* Num Lock */
+        kbd_put_keycode(0x45);
+        kbd_put_keycode(0xc5);
+        return;
+    case 0x3a:                          /* Caps Lock */
+        kbd_put_keycode(0x3a);
+        kbd_put_keycode(0xba);
+        return;
+
+    }
+
     
     /* now send the key code */
     while (keycode != 0) {

---------------CUT End of Patch------------------------------

basically this is a patch for fixing a keyboard problem in the SDL 
Graphic output. This patch is adjusted for SDL-1.2.5-3 on Redhat 9, so 
feel free to adjust the patch for your distro/setting. Do I mention 
SDL? yes, you need to install SDL and SDL devel package if you want 
graphical output (it is heavily recommended....at least from my point 
of view)

Now, do the usual mantra. I assume that you will 
install into /usr/local/qemu:
# ./configure --prefix=/usr/local/qemu/
# make && make install

Now, we are ready to build the disk image. You can imagine disk image 
as virtual hard drive for Qemu. I assume you want to create the disk 
image inside /mnt/qemu:
dd of=/mnt/qemu/myimage bs=1M seek=700 count=0

The above command is example on creating 700 MB of empty image. You 
can set another size by changing "seek" and "bs" parameter. "man dd" 
for complete reference


export this directory on QEMU_TMPDIR environment variable:

        export QEMU_TMPDIR=/mytmpfs

after that, pick you Linux CD or ISO image and run the following 
command (from now on, please self adjust the actual path to qemu and 
qemu-fast binary):

# qemu -hda /mnt/qemu/myimage -cdrom /mnt/cdrom -boot boot d -mem 64

This is relatively easy to understand, it tolds qemu to boot from CD 
Rom and also load the disk image so you can start the instalation. 
Couple weeks ago, I install debian 3.0 woody inside the disk image 
because i think it is relatively stable and compact. You can pick 
another distro of you flavour...just remember to give enough room 
because so far I don't know how to resize the disk image :-)

Just install Linux as usual and don't forget to set swap partition. 
So, actually when you finish installing Linux, inside the disk image, 
it should contains the root partition and the swap.

The things you need to include are gcc/glibc, shells (of course, who 
can live without it ;-) ), automake/autoconf, tar, gzip/gunzip. 

After finishing the Linux instalation, quit first from Qemu and now we 
move to openMosix kernel compilation. Put the below patch on your 
openMosix patched kernel to make it compatible with qemu-fast:

----------------CUT Start of patch------------------------
diff -Naur ./linux/arch/i386/vmlinux.lds 
./linux-qemu/arch/i386/vmlinux.lds--- ./linux/arch/i386/vmlinux.lds	
2002-02-26 02:37:53.000000000 +0700+++ 
./linux-qemu/arch/i386/vmlinux.lds	2004-05-17 17:15:37.000000000 
+0700@@ -6,7 +6,7 @@ ENTRY(_start)
 SECTIONS
 {
-  . = 0xC0000000 + 0x100000;
+  . = 0x90000000 + 0x100000;
   _text = .;			/* Text and read-only data */
   .text : {
 	*(.text)
diff -Naur ./linux/include/asm-i386/page.h 
./linux-qemu/include/asm-i386/page.h--- 
./linux/include/asm-i386/page.h	2004-05-14 12:26:48.000000000 +0700+++ 
./linux-qemu/include/asm-i386/page.h	2004-05-17 17:14:50.000000000 
+0700@@ -78,7 +78,7 @@  * and CONFIG_HIGHMEM64G options in the kernel 
configuration.  */
 
-#define __PAGE_OFFSET		(0xC0000000)
+#define __PAGE_OFFSET		(0x90000000)
 
 /*
  * This much address space is reserved for vmalloc() and iomap()
--------------------------CUT end of patch---------------------------

This patch is modifying several kernel page offset, so it becomes 
compatible with qemu-fast.....

Why do we need qemu-fast? Why not using plain Qemu? The answer is: 
(copied from Qemu documentation)
"qemu-fast uses the host Memory Management Unit (MMU) to simulate the 
x86 MMU. It is fast but has limitations because the whole 4 GB address 
space cannot be used and some memory mapped peripherials cannot be 
emulated accurately yet"

In other word, qemu-fast doesn't simulate MMU, instead it use the 
host's MMU.....should be faster right? But yes, there is 4GB 
limitation, but who want 4GB just for simulation? :-) It should be 
fine for general case AFAIK

On kernel configuration, remember to add kernel native (not module) 
for ne2k and ne2000: ( you can found them on "Network device 
support"-->"Etherne 10 or 100 MBit")
CONFIG_NE2000=y
CONFIG_NE2K_PCI=y

I am not sure which one actually needed for Qemu, but adding both 
won't hurt :-) Feed another option if you think you will need them 

Do the usual kernel compilation, and move the finished bzImage (i 
prefer bzImage, it is up to you the pick the final type of kernel), 
vmlinux and System.map to a directory. if you had modules, we will 
move them later inside the disk image. Lets assume you move them to 
/boot/qemu Oh, BTW, it is also a good idea to put tmpfs mounted 
directory for Qemu's need. here I create 1 Gigabyte tmpfs:

mount -t tmpfs -o size=1G tmpfs /mytmpfs/

Now, we need to testdrive the kernel. Put following command as shell 
script :
/usr/local/qemu/bin/qemu-fast -hda /mnt/qemu/myimage -hdb /dev/hda 
-kernel /boot/qemu/bzImage -append "root=/dev/hda1 
ide3=noprobe ide4=noprobe ide5=noprobe"

The above script assume that you create root filesystem inside the 
disk image on first partition, that's why the root parameter is 
"/dev/hda1". And what is "-hdb /dev/hda"? Well, we need to copy 
several files from host system, so we need to mount the disk inside 
the Qemu :-) If your layout is different, again feel free to modify 
the parameter

You get the login prompt? Congratulations ! Now, login and make sure 
you have following report from "dmesg":
NE*000 ethercard probe at 0x300: 52 54 00 12 34 56
eth0: NE2000 found at 0x300, using IRQ 9.

This lines indicate that kernel succesfully detect emulated NE2000 
card. So far I have no problem with the "fake" NE2000, so the only 
trick is....just make sure you are including NE2000 support. 

We have moved half-way so far. Now we move the kernel modules....how? 
by mounting fake "/dev/hdb" inside Qemu. e.g:
mount -t ext2 /dev/hdb1 /mnt/host

There....you can access the host filesystem, now copy the /lib/modules 
straight into the disk image. After that, "halt" the guest system and 
restart qemu using above script. You should find out that now "the 
missing modules" are loaded successfuly

openMosix need user land tools, right? Same like above, transfer the 
userland tools tarball (use version 0.3.5) inside the guest and do 
compilation. This way, we make sure that it is compiled against 
correct gcc/glibc.Oh wait? you need oM kernel headers? Mount the host 
filesystem and create the soft link from openMosix kernel source 
toward /usr/src/linux-openmosix and then the compilation will goes 
smooth

I will skip the oM spesific setting, just refer to the HOWTO for how 
to setup /etc/inittab, setting maps etc. Also remember to setup ip 
address for eth0 (on Debian, you can turn it on after start up using 
/etc/networks/interface)

Again, shut down the guest system. Now we move to setting up the 
second node. "What, doing above steps again? You gotta be 
kidding, right? I need faster way" ! Ok, relax :-) that's why we will 
create COW (Copy on Write Image). What is it? You can imagine as a way 
for sharing original disk image between Qemu instances, but each 
instance keeping its own copy of disk image if they do some 
modifications inside the original disk image. The original image will 
be safe.....

Lets create two COWs (yes COW, but not cows which produces milk, ok? 
:-))) ):

# qemu-mkcow -f /mnt/qemu/myimage /mnt/qemu/mycow1.cow
# qemu-mkcow -f /mnt/qemu/myimage /mnt/qemu/mycow2.cow

Not so difficult, right? After that, create script for enabling the 
TUN/TAP device. "Wait wait....TUN/TAP, why do I need it?" Well, 
TUN/TAP is virtual device that acting as network bridge between 
guest system and its host. So, if you don't turn it on, there is 
no "network connection" between guest and its host

here is the example of the script:
#!/bin/sh
sudo /sbin/ifconfig $1 192.168.1.11 netmask 255.255.255.0

Modify above script for each TUN/TAP of the guests and remember not to 
assign same IP to other TUN/TAP or guest's IP 

I suggest to separate the netmask of TUN/TAP device and the interface 
inside the guest against the netmask of host. I use this trick so I 
won't mess a lot with host's routing table. For this "Dojo" i use 
following topology:

                        host (10.1.1.1)
                        /        \
                       /          \
                      /            \
1st TUN/TAP (192.168.1.11)      2nd TUN/TAP (192.168.1.12)
        |                               |
        |                               |
1st guest (192.168.1.21)        2nd guest (192.168.1.22)

Got brighter picture from above diagram? I hope so.....:-) So, back to 
the TUN script, you should write two script:

For 1st TUN/TAP: (name it /mnt/qemu/qemu-ifup)
#!/bin/sh
sudo /sbin/ifconfig $1 192.168.1.11 netmask 255.255.255.0

For 2nd TUN/TAP: (name it /mnt/qemu/qemu-ifup2)
#!/bin/sh
sudo /sbin/ifconfig $1 192.168.1.12 netmask 255.255.255.0

eth0 inside 1st guest: 192.168.21
eth0 inside 2nd guest: 192.168.22

I use above IP numbering so I can quickly remind myself about the 
topology (x.x.x.x1 is for 1st group, x.x.x.x2 for 2nd group). You 
don't have to follow my idea :-)

because you already have two COWs, modify the qemu start script, so it 
becomes:
(for 1st guest)
/usr/local/qemu/bin/qemu-fast -hda /mnt/qemu/mycow1.cow -macaddr 
52:54:00:12:34:56  -kernel /boot/qemu/bzImage -n ./qemu-ifup
-append "root=/dev/hda1 ide3=noprobe ide4=noprobe ide5=noprobe"

(for 2nd guest)
/usr/local/qemu/bin/qemu-fast -hda /mnt/qemu/mycow2.cow -macaddr 
52:54:00:12:34:60  -kernel /boot/qemu/bzImage -n ./qemu-ifup2
-append "root=/dev/hda1 ide3=noprobe ide4=noprobe ide5=noprobe"

"I am noticing -macaddr switch...Why do we need it?" The answer: if 
you don't set it explicitly, you will get same mac address for the 
both guest system...and that would confuse TCP/IP arp resolve 
mechanism. So, we need to set distinct MAC adress for each of 
guests. Still confuse on what I am talking about? Go to RFC about ARP 
or TCP/IP and read about IP to MAC Adress resolution mechanism.

Now, fire up both qemu instance and watch them load the openMosix 
kernel until you got login prompt. back to host system, now we need to 
setup a bridge connecting these 2 guests "Oh boy, another pain is 
come....when it will stop? " :))) remember, dojo is the place to 
practice, not for instant skill like Neo when he got kung fu skill 
inside Matrix :-) The quote "No pain no gain" must be followed here 
:_)))

OK, back to bridge. You can imagine bridge as "a hub connecting 
any target network interface" Copy following script to setup bridge 
between TUN0 and TUN1:
(let's name is start-bridge.sh)
#!/bin/bash
/sbin/modprobe bridge
/sbin/route del -net 192.168.1.0 netmask 255.255.255.0
/sbin/route del -net 192.168.1.0 netmask 255.255.255.0
/usr/sbin/brctl addbr br0
/usr/sbin/brctl addif br0 tun0
/usr/sbin/brctl addif br0 tun1
/sbin/ifconfig br0 192.168.1.13 netmask 255.255.255.0
/sbin/ifconfig tun0 0.0.0.0
/sbin/ifconfig tun1 0.0.0.0

Basically, the default kernel on redhat 9 (the one I use as 
experiment) comes with bridging capability as module (bridge.o) If you 
don't found one, recompile the kernel and make sure you include this:
CONFIG_BRIDGE=m (as module)--> preferred
or
CONFIG_BRIDGE=y (as native kernel part)

You can find them under "Networking options". It is named "802.1d 
Ethernet Bridging".

Why do we need to do "route del"? Well, remember that we previously 
turn up the TUN/TAP device? On Linux (recently), "ifconfig" 
automatically setup routing for each new IP address assigned to a 
device. So, basically we clean them up becaue we don't need them! 

The next line is about setting up the bridge itself. You need to 
install bridge-utils RPM (RH 9 includes this tools). If you don't 
think your distro doesn't include it, go to 
http://www.math.leidenuniv.nl/~buytenh/bridge and grab the tarball 
there. Actually, what I am goinf to explain is short version of Bridge 
Mini Howto, you can find more about bridging on www.tldp.org and 
search about "bridge". usually many distribution includes this docs.

Lets analyze the command
/usr/sbin/brctl addbr br0
--> here we create new bridge interface named "br0"

/usr/sbin/brctl addif br0 tun0
/usr/sbin/brctl addif br0 tun1

--> here we "bond" the tun0 and tun1 so they were attached "inside" 
the bridge 

/sbin/ifconfig br0 192.168.1.13 netmask 255.255.255.0

--> like you know, assign an IP address and netmask for the bridge. 
You still need to assure that the bridge on same subnet like the 
guests are.....

/sbin/ifconfig tun0 0.0.0.0
/sbin/ifconfig tun1 0.0.0.0

--> easy, just assign 0.0.0.0 IP (but not turn down, i repeat DO NOT 
turn the TUNs down)  for the TUNs :-)

The topology becomes

                        host (10.1.1.1/24)
                        /        \
                       /          \
                      /            \
                    T H E   B R I D G E (192.168.1.13/24)
                        |                       |
                        |                       |
        1st TUN/TAP (0.0.0.0)              2nd TUN/TAP (0.0.0.0)
                |                               |
                |                               |
        1st guest (192.168.1.21/24)        2nd guest (192.168.1.22/24)

Now try to ping from guest 1 to guest 2 and likewise.....success? Now 
start the openMosix (just copy the openMosix start/stop script from 
openMosix userland tarball) and confirm that "mosmon" see all the 
nodes !

Let me state something before we goes further. Something inside Qemu 
screw up openMosix auto detection of system's speed, so make sure you 
include this line on your openMosix startup script

mosctl setspeed 15000

feel free to adjust the number, but make sure you set same number 
across all guest system, if not, you will got weird load levelling 
mechanism...believe me....:-) It takes 2 days for me just to find out 
about this "speed" thing when I saw openMosix doesn't load 
balance my program  :-)))

After that, try the migration between guests.....this won't be an 
openMosix cluster if it can't migrate process, right? :-) Just compile 
simple C program like below:void main()
{
    int a=0,b=0;
    for (a=0;a<=1000000;a++)
        for (b=0;b<=1000000;b++)
        {
        };
}

suppose you name it "silly.c" then compile it as "silly" and run silly 
in the background (add "&"). 2 instance of "silly" is sufficient for 
start..

Success? Then congratulations...you have exercised your Chi into 
highest level :-) have fun with your new virtual Cluster

reference:
- Qemu user documentation and technical documentation
- openMosix HOWTO
- Ethernet Bridge mini HOWTO
- Documentation/networking/tmpfs.txt inside the kernel source 
directory