Exploring a new paging interface

l4-hurd
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Exploring a new paging interface

From:	Neal H. Walfield
Subject:	Exploring a new paging interface
Date:	20 Jun 2002 17:10:53 +0200
User-agent:	Gnus/5.0808 (Gnus v5.8.8) Emacs/21.2
Introduction
--- ---- - -

Virtual memory has been the canonical way to deal with the ubiquitous
lack of physical memory for more than thirty years.  In this model,
virtual pages are multiplexed onto physical frames.  As physical
memory becomes scarce, frames are reclaimed by invalidating virtual
memory mappings and, if need be, saving the corresponding data to
secondary storage.  Later, when the invalidated virtual pages are
referenced, the thread which is attempting to access the memory is
suspended, a frame is allocated with the saved data, the mapping is
reestablished and finally, the thread is resumed.  All of this is
completely transparent to the applications--expect that is, with
respect to performance.

Most mainstream operating systems used today place the virtual memory
manager (VMM) within a monolithic kernel.  By completely integrating
the VMM into the trusted computing base, several optimizations become
apparent.  First, a lot of the important data for making global memory
decisions are trivially within the VMM's reach: it can twiddle the
kernel's internal data structures thereby reducing the need for strict
formal protocols.  Second, the VMM can blindly trust the rest of the
kernel to do the Right Thing; it does not have to concern itself with
untrusted external components who may or may not choose to coöperate.
This also makes the algorithms superficially simpler.

The speed difference between accessing physical memory and accessing
disk is several orders of magnitude: the former is typically measured
in nanoseconds while the latter is measured in milliseconds.  Thus,
consistently eliminating pages from memory only to bring them back in
a few seconds later can significantly degrade performance.  It is
therefore essential that systems take some time to make the best
possible choice.  As difficult as it is to predict how oneself is
going to act in the future, it is even harder to predict how others
will act.  Yet, this is what typical VMMs have been doing: they rely
on global information and averages to make local decisions.  As long
as programs conform to this preconceived idea of what a typical access
pattern looks like, they run reasonably well.  Programs which do not
fall into this category are, however, severally penalized.  Some of
the most obvious examples from this latter category are garbage
collectors, databases, programs which in memory caches and programs
which use memory mapped files.

Designers of these systems have recognized the limitations and have
made certain inroads into fixing these performance issues.  Most of
the solutions are far from adequate.  Unix-like systems have a
function called madivse which gives hints to the kernel about how the
memory will be accessed.  The number of hints is by definition limited
and are far from a complete solution.  Other ideas have involved
detecting access patterns.  Although certain types can be detected
with a fair amount of success, each one must be preprogrammed and each
adds a certain amount of overhead.

In order to allow applications to provide more input into the decision
making process with respect to the management of their working set, a
significant amount of complexity and trust issues are introduced.
Before we can explore these, we first need to break down the paging
process and understand where processes might be interested in and
could benefit from participating.  Virtual memory management can
roughly be separated into three parts:

    * Page table layout and management
    * Where and how to store pages
    * Which pages to evict when

Allowing processes to control how their virtual address space is
constructed has been explored in the exokernel developed at MIT [1].
For instance, an application may want to use inverted page tables.
One of the problems with this is that it is difficult to make a
general framework that is portable across hardware platforms.
Additionally, to take advantage of this, lookups must be done in
software which not all processors efficiently support.  Although the
goal of L4 is to introduce as little policy into the microkernel as
possible, it was decided early on that in order to provide a hardware
independent interface that this particular detail would have to be
made transparent to users.

External pagers were popularized by the Mach microkernel [2].  In a
typical Mach based operating system, the file systems live in user
space.  Applications, rather than making system calls, send RPCs to
the file system servers to open, read from and write to files.  Mach
also allowed applications to map memory into their virtual address
spaces that was backed, not by the kernel, but by memory managers such
as the file systems.  As pages are referenced by clients, Mach detects
the page faults and, if not already paged in, i.e. cached, the pages
are requested from the manager.

Although Mach was a major step forward with respect to flexibility, it
did nothing to export the page eviction policy; managers are unable to
indicate to the kernel which memory should be paged nor are they ever
consulted.  This problem was examined in [3].  The result was PREMO
pagers.  They fail, however, to provide either an elegant or an
efficient solution; the PREMO extensions just load the kernel with
even more baggage.

The VMS operating system was designed in 1978 with a user level
eviction policy which has remained central to the current system [4].
In this system, each task is allowed to allocate a fixed maximum
number of physical pages.  When a process approaches its allocation
limit, even if there is free physical memory in the system, the kernel
asks it to return pages before it will allocate it additional
resources.  The kernel does not necessarily free pages that are
returned.  Instead, it places them on a free list which the
application may fault back in.  Only when there is memory pressure is
the data actually sent to disk.  Thus, even if the working set is much
larger than the system imposed resource limit, the application's
performance is degraded.  Although VMS does have excellent support in
this regard, it has no support for external pagers: pagers live in the
kernel.

The V++ cache kernel developed at Stanford has support for both
external pagers and external paging policies while providing an
architecturally independent interface [5].  The following proposal is
based quite close to this model.  [5] should be considered required
reading before continuing.


A Memory Management Scheme for L4/Hurd
--- ------------------------------ - -

This scheme is closely based on the current features used by the Hurd
running on Mach and also the aforementioned references.  One of the
main objectives is to break apart the monolithic virtual memory
manager, i.e. Mach, by exporting the paging policy and management
while not sacrificing the current flexibility of external pages.

                                  ______
                                 /      \
                                | Sigma0 |
                                 \______/
                                    ^|
                                    ||
                                    |v
  \                               ________
   \          _______            /        \
  S |        /       \          | Physical |
  W | <===> |  Swapd  |  <===>  |  Memory  |
  A |        \_______/          |  Server  |
  P |                       //|  \________/  |\\
   /                      //     //|    |\\    \\
 /                      //      //        \\     \\
                      //      |//          \\|     \\
                    //   _________          ________ \\
                  //    /         \        /        \  \\
                //     | Anonymous |<====>|   File   |   \\
              //       |  Memory   |      | Server A |    ||
             ||         \_________/        \________/     ||
             ||     //| ^|   ^|   |\\  //|   ^|    |\\    ||
             ||    //   ||   |v      XX      |v      \\   ||
             ||   //    ||  ______\//  \\|______       \\ ||v
             v| |//     || /      \      /      \      ________
            ______      |||  Bash  |    | Editor |    /        \
           /      \      \\\______/      \______/    |   File   |
          | Scheme |      \========================> | Server B |
           \______/  <=============================>  \________/

        Figure 1: Interactions between servers and clients


The physical memory server obtains all of the available physical
memory from sigma0 at boot time.  All other programs in the system
contact it directly for their physical memory needs.  The physical
memory server hands out the memory based on contracts or leases.  A
contract might say that the receiving task will receive a certain
amount of physical memory, however, when asked, it has to give it back
within a certain amount of time.  Another contract may guarantee a
loan of consecutive pages for a fixed amount of time.  This would be
useful for memory intended to be used in DMA operations.  Physical
memory could also be provided on a market as suggested in my last
email.  This idea is throughly discussed in section 2.4 of [5] and
looks quite promising.

When the physical memory server starts to detect memory pressure, it
can request pages back.  Depending on the details of the loan, a task
might have a certain amount of time to clean and flush the memory
before the contract is violated.  If a task violates its contract,
then the task is considered to be badly behaved.  The physical memory
server may choose to double page tasks which are badly behaved, it
might terminate them or take some other, as of yet, undecided action.
This decision will be based on some system policy.

Only the physical memory server can page to the swapd daemon.  There
are two motivating reasons for this restriction.  First, as pages are
sent to the physical memory server, they may not need to ever go to
swap: the physical memory server chooses when the actual eviction
occurs and can act as a small cache.  This would allow another
important optimization: large clustered writes as in VMS.  Second, by
only having the memory server communicate with the swapd, there is no
need to have a policy to access the swap space.  If processes were
allowed to send pages to the swapd without also giving up the
corresponding physical memory, rogue processes could fill the swap
space inducing a denial of service.

More traditional servers can be built on top of the physical memory
server.  For instance, an anonymous memory server using an LRU policy
could be used as a source for anonymous pages that most Unix-like
applications require.  This server could also deal with copy on write,
sharing etc.  Alternatively, there could be a serparate copy on write
server that could be stacked between the anonymous memory server and
clients.  This latter strategy is being explored in Sawmill.

One important question that needs to be answered is if the physical
memory server should treat the anonymous memory server specially.  As
it will potentially be servicing many clients, to have to compete with
a regular program which only services itself would be unfair.  An
initial idea that I had is to give processes memory priorities in much
the same way they have scheduling priorities.  Then, as the system
comes up, root can start the anonymous memory server with a certain
nice value giving it priority over standard tasks.

The file servers would use physical memory to hold the data from the
backing store.  Or rather, the backing store would transfer physical
memory to the file server on a read.  By using physical memory, file
servers are free to impose their paging policy on files as they best
see fit and to maintain large internal cache buffers without the fear
of double paging.  The file servers would have a default paging policy
for file data, however, they could also expose an interface similar to
madvise for additional control.  For even more flexibility, file
servers may choose to reveal an interface similar to the physical
memory server.  This is especially important when a file is acting as
a backing store, e.g. file server B in figure 1 is using a file from
file server A.  Additionally, file server A may not want to transfer
the memory to file server B as it then loses control of it which may
introduce cache consistency problems.

Applications just as bash and editors might not require any special
memory policy (Emacs could, of course, take full advantage of the
physical memory server and even provide hooks to implement the policy
is elisp).  In this case, the applications can obtain their memory
from the file servers and the anonymous memory servers and avoid
dealing with any of the details directly.  This is not what all
applications need, however.  Take for instance a scheme interpreter.
This application uses a garbage collector.  It would clearly behoove
it to implement its own policy.  There is completely possible and
safe.  Remember, the contract model permits untrusted applications to
use physical memory sans risk to the system.

When a server asks a client that it free memory, it is not asking for
any memory but memory that *it* lent the client.  This may complicate
memory management a bit.  One problem might arise when a client has
leases from many different servers.  It could happen that the physical
memory server contacts several top level servers, asks them to free
memory and then they all contact this client asking for memory back.
In this case, the client would be overloaded with requests from
different servers which do not necessarily communicate nor trust each
other.  I am not sure if this is just a pathological case or something
that we need to worry about.

                       ________
                      /        \
                     |  Client  |
                      \________/
           Lease      //|     |\\    Device write
                   |//          \\|
                ________          ______
               /        \        /      \
              |  Memory  |      | Device | DMA operation 
              |  Server  |      | Driver | ======>
               \________/        \______/
                   ^|
         Lease     || 
                   |v
                ________
               /        \
              | Physical |
              |  Memory  |
              |  Server  |
               \________/

              Figure 2: Writing to a device

Since physical memory is now external to the servers, there is a
significant problem with DMA--not so much with reading, but with
writing.  Consider figure 2.  Here, a client wants to write a page to
the device.  It calls an RPC and maps the page into the device driver.
The device driver wants to do a DMA operation, however, this
(generally) means providing the physical address of the page and also,
making sure that the memory is pinned during the operation.  The
device driver, however, has no easy way to do either of these--it must
contact the client then the memory server and the physically memory
server.  We could copy the memory to avoid this, however, this can
become quite expensive.  The L4 group has run into this problem with
Sawmill and Prime (a new multiserver OS).  They are current
investigating the use of translation/pinning tables that are shared
among the parties and have a trust relationship that traces back to
the physical memory server.


Conclusions
--- --- - -

I have presented much of this proposal in front of a large part of the
L4 group.  They were quite warm to the model, however warned that
integrating the driver framework would be the most difficult part
especially with respect to DMA as has already been mentioned.

In this discussion, I have not included a proposal for the actual
interfaces for a poignant reason: I only have a few rough sketches
which are not yet mature enough to share.  More importantly, however,
I would like some feedback from the rest of the community to be sure
that we at least agree on the general direction.  In the coming days,
I will continue to investigate other systems especially V++ as I try
to build a coherent interface.

Comments and criticisms are more than welcome.

Thanks.


Bibliography
--- ---- - -

[1] Engler, Dawson R., Engler, M. Frans Kaashoek and James
    W. O'Toole. Exokernel: An Operating System Architecture for
    Application-Level Resource Management. 1995.
    http://www.pdos.lcs.mit.edu/~engler/sosp-95.ps

[2] Loepere, Keith. Server Writer's Guide. 1992.
    http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/mach/public/www/doc/osf.html

[3] Mcnamee, Dylan. Extending The Mach External Pager Interface To
    Accommodate User-Level Page Replacement Policies. 1990.
    http://citeseer.nj.nec.com/mcnamee90extending.html

[4] Levy, Henty M. and Peter H. Lipman.  Virtual Memory Management in
    the VAX/VMS Operating System,IEEE Computer,15(3):35--41, March
    1982.

[5] Harty, Kieran and David R. Cheriton.  Application-Controlled
    Physical Memory using External Page-Cache Management. 1992.
    http://citeseer.nj.nec.com/harty92applicationcontrolled.html
[Prev in Thread]
Current Thread
[Next in Thread]
Exploring a new paging interface, Neal H. Walfield <=
- Re: Exploring a new paging interface, Eric Gillespie, 2002/06/25
  - Re: Exploring a new paging interface, Neal H. Walfield, 2002/06/25
    - Re: Exploring a new paging interface, Espen Skoglund, 2002/06/26
Prev by Date: Re: Paging Interface
Next by Date: Re: Fluke
Previous by thread: Re: Fluke
Next by thread: Re: Exploring a new paging interface
Index(es):
- Date
- Thread