gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] user space file systems


From: Thomas Lord
Subject: [Gnu-arch-users] user space file systems
Date: Tue, 10 Jan 2006 10:18:52 -0800

  Building a user space file system is a good idea for lots of
  reasons.  But how?

  Don't use an RDBMS, persistent hash table, or other database
  middle-ware as back-end.  Sure, their support for transactions helps
  a little and they do offer portable APIs to native storage.
  Unfortunately they also come with a lot of code and baggage for
  functionality not really needed and the APIs they provide aren't a
  particularly natural target language for implementing a POSIX-style
  file system.

  Why not make a portable library whose API resembles an idealized,
  simplified raw disk?  That should be easy to simply rewrite for
  every platform one wants to port to;  it should give good
  performance;  it is a very natural target for writing a file system
  implementation.  (And if you *must* use a database -- build this
  API on top of that.)

  Here's such an API that can be implemented in about 610 lines of
  code on Posix.  I can' imagine it would be any harder on Windows
  using native calls there.


* The API

  /* Pages are 1Kb
   */
  #define c_vudev_page_size_bits        ((unsigned int)10)
  #define c_vudev_page_size     ((size_t)1 << c_vudev_page_size_bits)

  
  /* There are 2^32 pages -- theoretical 4Tb capacity.
   */
  #define c_vudev_page_addr_bits        ((unsigned int)32)
  #define c_vudev_max_page_addr ((t_vudev_page_addr)0xffffffff)
  typedef t_uint32 t_vudev_page_addr;

    
  /* Client programs "connect" to virtual disks.
   */
  typedef <unspecified-pointer-type> t_vudev_connection;


  /* A "chunk" is a (virtual) DMA area for transfers between
   * the client and the raw virtual disk.
   */
  typedef <unspecified-pointer-type> t_vudev_chunk;


  The location of a virtual disk image (e.g., a file on a native 
  system where that file contains the complete virtual file system)
  is specified by a URI:


    * int vudev_create_device 
           (const t_uchar ** const err,
            const t_uchar * const uri,
            t_vudev_page_addr const n_control_pages);

        Create an initialize a new virtual disk.

        `n_control_pages' is a performance hint.  The implementation
        should try to make access to pages `0..(n_control_pages - 1)'
        as fast as possible.




    * t_vudev_connection vudev_connect (const t_uchar ** const err,
                                         const t_uchar * const uri);

        Connect to an existing virtual disk.


    * t_vudev_connection vudev_dup (const t_uchar ** const err,
                                     t_vudev_connection conn);

        Copy a connection.  (May return the argument connection in 
        which case connections are reference counted.)


    * int vudev_disconnect (const t_uchar ** const err,
                             t_vudev_connection conn);

        Terminate a connection.


    * int vudev_write_lock (const t_uchar ** const err,
                            t_vudev_connection const conn);
    * int vudev_write_unlock (const t_uchar ** const err,
                              t_vudev_connection const conn,
                              t_uchar * const control_pages);
    * t_uchar * vudev_read_lock (const t_uchar ** const err,
                                  t_vudev_connection const conn);
    * int vudev_read_unlock (const t_uchar ** const err,
                              t_vudev_connection const conn,
                              t_uchar * const control_pages);

        Begin/end a write/read transaction.

        Transactions may not be nested, promoted, or demoted.



    * t_vudev_chunk vudev_chunk (const t_uchar ** const err,
                                  t_vudev_connection const conn,
                                  t_vudev_page_addr addr,
                                  t_vudev_page_addr n_pages);
    * t_uchar * vudev_chunk_data (const t_uchar ** const err,
                                  t_vudev_connection const conn,
                                  t_vudev_chunk const chunk);
    * t_vudev_page_addr vudev_chunk_addr (const t_uchar ** const err,
                                          t_vudev_connection const conn,
                                          t_vudev_chunk const chunk);
    * t_vudev_page_addr vudev_chunk_n_pages (const t_uchar ** const err,
                                             t_vudev_connection const
conn,
                                             t_vudev_chunk const chunk);
    * int vudev_chunk_dirty (const t_uchar ** const err,
                              t_vudev_connection const conn,
                              t_vudev_chunk const chunk);
    * int vudev_chunk_stale (const t_uchar ** const err,
                             t_vudev_connection const conn,
                             t_vudev_chunk const chunk);
    * int vudev_free_chunk (const t_uchar ** const err,
                             t_vudev_connection const conn,
                             t_vudev_chunk const chunk);


        Actual I/O is performed by modifying buffers which may or may
        not be active DMA areas.   A chunk is a handle for a buffer
        for an arbitrary choice of contiguous pages.   Multiple 
        overlapping chunks may concurrently exist.

        If a chunk is allocated during a read transaction it's
        initial data is consistent with the state of the disk
        during that transaction.   If a chunk is left over from a 
        previous transaction, it's data may be invalid unless
        the chunk is passed to `vudev_chunk_stale'.

        Writing is accomplished by modifying chunk data and, after
        the modifications, calling `vudev_chunk_dirty'.

        When no longer needed, `vudev_free_chunk' releases a chunk.

        Clients should assume that there is a performance penalty
        for concurrently overlapping chunks and for very large
        chunks.


    * int vudev_sync (const t_uchar ** const err,
                      t_vudev_connection const conn);

        Wait until all chunks marked dirty have reached stable 
        storage.


  The transactional semantics of this API are relatively weak:

  The contents of chunk data is undefined except during a read or
  write transaction.

  If a chunk was not allocated during the current transaction it's
  contents are invalid for reading until the chunk is passed to 
  `vudev_chunk_stale'.

  If a chunk is modified during a write transaction then the 
  chunk *must* be passed to `vudev_chunk_dirty' after the
  modifications but before the end of the transaction.

  Concurrent writes to a page produce undefined results.

  Data written into a chunk may reach stable storage at any time after
  it is first written but before the next call to `vudev_sync'
  completes.  Data may reach stable storage in any order consistent
  with the `vudev_sync' constraint.

  A crash *during* `vudev_sync' leaves the contents of all
  pages written since the previous `vudev_sync' undefined.









reply via email to

[Prev in Thread] Current Thread [Next in Thread]