gzz-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gzz-commits] manuscripts/storm article.rst


From: Benja Fallenstein
Subject: [Gzz-commits] manuscripts/storm article.rst
Date: Fri, 07 Feb 2003 22:56:49 -0500

CVSROOT:        /cvsroot/gzz
Module name:    manuscripts
Changes by:     Benja Fallenstein <address@hidden>      03/02/07 22:56:49

Modified files:
        storm          : article.rst 

Log message:
        More; section 3 looks ok now.

CVSWeb URLs:
http://savannah.gnu.org/cgi-bin/viewcvs/gzz/manuscripts/storm/article.rst.diff?tr1=1.109&tr2=1.110&r1=text&r2=text

Patches:
Index: manuscripts/storm/article.rst
diff -u manuscripts/storm/article.rst:1.109 manuscripts/storm/article.rst:1.110
--- manuscripts/storm/article.rst:1.109 Fri Feb  7 18:27:22 2003
+++ manuscripts/storm/article.rst       Fri Feb  7 22:56:49 2003
@@ -288,20 +288,67 @@
 
 In Storm, all data is stored
 as *blocks*, byte sequences identified by a SHA-1 cryptographic content hash 
-[ref SHA-1 and our ht'02 paper]. Blocks have a similar granularity
+[ref SHA-1 and our ht'02 paper]. 
+Being purely a function of a block's content, block ids
+are completely independent of network location.
+Blocks have a similar granularity
 as regular files, but they are immutable, since any change to the
 byte sequence would change the hash (and thus create a different block).
 Mutable data structures are built on top of the immutable blocks
-(see Section 6). 
+(see Section 6).
+
+When used in a network environment, such ids do not provide
+a hint as to where a specific block is stored.
+However, many existing peer-to-peer systems could be used to
+find arbitrary blocks in a location-independent fashion; 
+for example, Freenet [ref], recent Gnutella-based clients 
+(e.g. Shareaza [ref]), and Overnet/eDonkey2000 [ref] 
+also use SHA-1-based identifiers [e.g. ref: magnet uri]. 
+(However, we have not put a network 
+implementation into regular use yet and thus can only describe our 
+design, not report on implementation experience.
+We discuss peer-to-peer implementations in Section 7, below.)
+
+Storm blocks are MIME messages [ref MIME], i.e., objects with
+a header and body as used in Internet mail or HTTP.
+This allows them to carry any metadata that can be carried
+in a MIME header, most importantly a content type.
 
 Storing data in immutable blocks may seem strange at first, but
-has a number of advantages. Firstly, it makes it easy
+has a number of advantages. First of all, it makes identifiers
+self-certifying: no matter where we have downloaded a block from,
+we are able to check we have the correct data by checking
+the cryptographic hash in the identifier. Therefore, we can 
+safely download blocks from an untrusted peer.
+
+While digital signatures also allow for self-certifying identifiers,
+they raise the need for a public-key infrastructure (PKI)
+and for a timestamping mechanism in order to be reliable
+for more than a short time.
+
+When we make a reference to a block, we can be sure
+that even the original author of the target will not be able 
+to change it. For example, if a newspaper refers to a letter
+to the editor this way, the letter's sender won't be able to change 
+the reference into an advertisement for a pornographic web page.
+
+Secondly, caching becomes trivial, since it is
+never necessary to check for new versions of blocks. It is easy
 to replicate data between systems: A replica of a block never
 needs to be updated; cached copies can be kept as long as desired.
 When a document is replicated, different versions of it can
 coexist on the same system without naming conflicts, since
 each version will be stored in its own block with its own id.
 
+If peers make the blocks in their caches available on the network,
+the flash crowd problem could be alleviated: The more users
+request a block, the more locations there are to download it from.
+This resembles e.g. the Squirrel
+web cache [ref] [more refs? -Hermanni]; however, downloads can be
+from *any* peer since the source does not need to be trusted.
+On the other hand, there are privacy 
+concerns with exposing one's browser cache to the outside world.
+
 To replicate all data from computer A
 on computer B, it suffices to copy all blocks from A to B that B
 does not already store. This can be done through a simple 'copy'
@@ -311,28 +358,55 @@
 when a user wants to incorporate a set of changes, but not
 required at replication time.)
 
-Secondly, immutable blocks increase *reliability*. 
+.. On the other hand for instance, several popular 
+   database management systems (e.g. Lotus Notes [ref]) have complex 
+   replication schemes, which may led awkward replication conflicts, 
+   because of they lack the immutable properties of data. 
+   [Or does this belong to diff section ? -Hermanni]
+
+The same namespace is used for local data and data
+retrieved from the network. When an online document has been
+permanently downloaded to the local harddisk, it can be found
+by a browser just as data from the network. This is convenient 
+for offline browsing, for example in mobile environments:
+After a document has been downloaded, links to it will *never*
+cease to work, online or offline.
+
+Thirdly, immutable blocks increase *reliability*. 
 When saving a document, an application will only *add* blocks,
 never overwrite existing data. When a bug causes an application
 to write malformed data, only the changes from one session
 will be lost; the previous version of the data will still
-be accessible. This makes Storm well suited as a basis
-for implementing experimental projects (such as ours).
+be accessible [#]_. This makes Storm well suited as a basis
+for implementing experimental projects (such as ours, Gzz).
 Even production systems occasionally corrupt existing data
 when an overwriting save operation goes awry; for example,
 one of the authors has had this problem with
 Microsoft Word many times.
 
-.. On the other hand for instance, several popular 
-   database management systems (e.g. Lotus Notes [ref]) have complex 
-   replication schemes, which may led awkward replication conflicts, 
-   because of they lack the immutable properties of data. 
-   [Or does this belong to diff section ? -Hermanni]
-
-Storm blocks are MIME messages [ref MIME], i.e., objects with
-a header and body as used in Internet mail or HTTP.
-This allows them to carry any metadata that can be carried
-in a MIME header, most importantly a content type.
+.. [#] Unfortunately, efficient versioned storage (Section 6)
+   makes matters a little more complicated; still,
+   the basic assertion holds.
+
+Even when a publisher's server fails to serve a block,
+links to it will not cease to work until *no* other peer
+holds a copy. This means that providing mirrors is trivial.
+Even after failure of all of the publisher's mirrors,
+a document may still be available from peers that have
+downloaded it. An archive of published blocks, in the spirit
+of the Web archive [ref], would only be yet another backup;
+normal links to a block would work as long as the archive
+holds a copy. It would also be hard to purposefully remove
+a published document from the network; whether this is
+a good or a bad property we leave for the reader to judge.
+
+These advantages are bought by an utter incompatibility with
+the dominant paradigms of file names and URLs. We hope that
+it would be possible to port existing applications to use Storm
+without too much effort, but we have not investigated
+the issue closely. This is because Storm was developed
+for the experimental Gzz system, a platform explicitly developed
+to overcome the limitations of traditional file-based applications.
 
 Collections of Storm blocks are called *pools*. Pools provide
 the following interface::
@@ -343,53 +417,38 @@
     delete(block)
     
 Implementations may store blocks in RAM, in individual files,
-in a Zip archive, in a database or through other means.
+in a Zip archive, in a database, in a p2p network, 
+or through other means.
 We have implemented the first three (using hexadecimal
 representations of the block ids for file names).
 
-When used in a network environment, Storm IDs do not provide
-a hint as to where a specific block is stored in the network.
-However, many existing peer-to-peer systems could be used to
-find arbitrary blocks in a location independent fashion; 
-for example, Freenet [ref], recent Gnutella-based clients 
-(e.g. Shareaza [ref]), Overnet/eDonkey2000 [ref] also use SHA-1-based 
identifiers 
-[e.g. ref: magnet uri]. Footnote:However, we have not put a network 
-implementation into regular use yet and thus can only describe our 
-design, not report on implementation experience.
-We discuss peer-to-peer implementations in Section 7, below.
-
-The immutability of blocks should make caching trivial, since it is
-never necessary to check for new versions of blocks.
-Since the same namespace is used for local data and data
-retrieved from the network, online documents that have been
-permanently downloaded to the local harddisk can also be found
-by the caching mechanism. This is convenient for offline browsing,
-for example in mobile environments: Users can download documents
-while they are online, store them locally, and be sure that
-their software will be able to access them as if downloaded
-from the net, without broken links.
-[Previous sentence doesn't parse to me: more simple :( -Hermanni]
-
-Given a peer-to-peer distribution mechanism, it would be possible
-to retrieve blocks from any participating peer online that has a copy
-in its local cache or permanent storage. This is similar to the Squirrel
-web cache [ref] [more refs? -Hermanni], but does not require trust 
-between the peers, since it is possible to check the blocks' integrity by 
using 
-cryptographic hashes, as used in many peer-to-peer applications 
-(e.g. [ref: ed2k/overnet, shareaza]). Since much-requested blocks would be 
-cached on many systems, a network could deal with hotspots 
-much more easily. On the other hand, there are privacy 
-concerns with exposing one's browser cache to the outside world.
-
-
-[Merge this paragraph with 5) ? -Hermanni]
-That all data is stored in blocks means that links to it
-are completely independent of location; when data is moved
-between servers, references to it do not break. (Footnote: Of course, 
-this requires that the blocks can be found no matter what server
-they are on. Again, see Section 7.)
-
-[Is there disadvantages/issus which we are aware of ? -Hermanni]
+Sometimes it is useful to think about *zones* blocks are in,
+related to distribution policy: for example, a *public*
+zone for blocks served to others in the network, a *private*
+zone for a user's own data, and a *local* zone with blocks
+served to anyone on the local intranet. It is essential
+that blocks from a private zone are not leaked into a public 
+zone without the user's consent. On the other hand, making
+a document available on the network should be as easy as
+hitting a 'publish' button moving it to a public zone,
+possibly also uploading it to a server permanently connected
+to the Internet, if one is available.
+
+Unfortunately, we have not found a satisfactory representation
+of zones yet. In particular, how do we decide which zone
+a new block should be in? Probably in the private zone
+in many cases, but if we have been editing a document
+collaboratively edited in a workgroup, we would want
+our changes to be available to the same group.
+
+An important open issue with block storage are
+UI conventions for listing, moving and deleting blocks.
+Currently, the only interface is a file system directory
+containing a set of blocks as files with hexadecimal, 
+random-looking names. In Gzz, we currently trick our way around
+the problem; at startup time, we simply load the most current
+version of a document whose identifier is hard-wired into
+the software (mutable documents are described in section 6.1).
 
 
 4. Xanalogical storage
@@ -474,12 +533,17 @@
 that this will not be a major scalability problem. Otherwise,
 systems that allow range queries, such as skip graphs [ref] 
 and skipnet [ref], may prove useful.
-[how does video work here, i.e. are they huge blocks or collections of many
-small, frames? or a sequence between two keyframes?]
 
-[This might be relevant also:
-http://www.hpl.hp.com/techreports/2002/HPL-2002-209.pdf
--Hermanni]
+.. [how does video work here, i.e. are they huge blocks or collections of many
+   small, frames? or a sequence between two keyframes?]
+
+   Benja says: Probably large, but the number of links to them 
+   still won't be that big, I guess. Not sure how to discuss this
+   in the text; feel free to propose something :-)
+
+.. [This might be relevant also:
+   http://www.hpl.hp.com/techreports/2002/HPL-2002-209.pdf
+   -Hermanni]
 
 One question raised by xanalogical storage is which links to show
 for a popular document that has been linked to by many users.
@@ -502,9 +566,11 @@
 implementation on top of networking overlay (e.g. distributed hashtable)
 will be trivial.
 
-[Benja, this might be useful for defining Storm APIs for DHTs etc: 
-http://sahara.cs.berkeley.edu/jan2003-retreat/ravenben_api_talk.pdf
-Full paper will appear in IPTPS 2003 -Hermanni]
+.. [Benja, this might be useful for defining Storm APIs for DHTs etc: 
+   http://sahara.cs.berkeley.edu/jan2003-retreat/ravenben_api_talk.pdf
+   Full paper will appear in IPTPS 2003 -Hermanni]
+
+   Benja says: Hm, does that belong in the p2p section?
 
 In Storm, applications are not allowed to put arbitrary
 mappings into the index. Instead, applications that want 
@@ -566,7 +632,12 @@
 are able to scale to the load to store a mapping for each word
 occuring in a document.
 
-[There are two refs about keywords in DHTs-- should we ref these ? -Hermanni]
+.. [There are two refs about keywords in DHTs-- should we ref these ? 
-Hermanni]
+
+   Not sure how applicable they are: our system is *not*
+   as general or performant as a DHT (as explained above).
+   Should read & find out whether they could be implemented
+   through our index system at all... -b
 
 
 6. Versioning
@@ -582,7 +653,8 @@
 mapping pointer identifiers to blocks providing targets for that pointer.
 Through this mechanism, we can keep old versions of documents
 along with the current versions.
-[Figure ? -Hermanni]
+
+.. [Figure ? -Hermanni]
 
 Secondly, in the spirit of version control systems like CVS,
 we do not store *each version*, but only the differences between versions.
@@ -591,7 +663,8 @@
 When we want to access a particular version, we reconstruct it
 using the differences, and then check the result using
 the cryptographic hash in the full version's block id.
-[Figure ? -Hermanni]
+
+.. [Figure ? -Hermanni]
 
 6.1. Pointers
 -------------
@@ -932,4 +1005,16 @@
 9. Conclusions
 ==============
 
-XXX
\ No newline at end of file
+XXX
+
+
+10. Acknowledgements
+====================
+
+We would like to thank Sarah Stehlig for discussions.
+
+.. (Add others here. Sarah pointed out the privacy issues with
+   exposing one's cache-- duh, but I for one hadn't thought
+   about it! And the Squirrel paper doesn't mention it either.
+   Probably it's pointed out somewhere in the literature--
+   too basic a problem not to have been noticed-- but still... -b)
\ No newline at end of file




reply via email to

[Prev in Thread] Current Thread [Next in Thread]