[Gzz] Storm blocks and metadata (Re: P2P and RDF)

gzz-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gzz] Storm blocks and metadata (Re: P2P and RDF)

From:	Benja Fallenstein
Subject:	[Gzz] Storm blocks and metadata (Re: P2P and RDF)
Date:	Tue, 25 Mar 2003 11:22:34 +0100
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030319 Debian/1.3-3


Hi Reto,

[drifting towards off-topic, but leaving on www-rdf-interest for nowbecause it still concerns the use of RDF]


Reto Bachmann-Gmuer wrote:

I think this is a very good approach, you could use freenetconten-hash uris to identify the blocks.
We'll probably register our own URN namespace, among other goalsbecause we want to use 'real,' registered URIs. (We're alsoconsidering putting a MIME content type in the URI, so that a blockserved up through our system would be basically as useful as a fileretrieved through HTTP, and allowing us to easily serve blocks througha HTTP proxy, too. Not yet decided, though-- some people I'vecontacted commented that MIME types do not belong in URIs.)
hmm, I don't see the analogy with http, since http-urls should notcontain a content-type indicator but leave the task to browser andserver to negotiate the best content-type deliverable. Of course yourcase is different, since your uri immutably references a sequence ofbytes.

Yes, that would have been my argument. However, you make a good pointbelow: If we refer to an RDF 'metadata' block containing the URI of theactual block, we can include references to alternative versions-- evenallowing some degree of content negotiation. This is something I have tomull about :-)

I strongly disagree with putting the mime-type into the url,because the mime type is meta information for which I see no reason tobe threaded differently than other meta-information,

It is necessary for the interpretation of the data we get; and it'susually easy to agree on (people won't too often assign different mimetypes to the same bytes). One thing about content hashes is, when twopeople put the same file into a hash-based system, they will use thesame identifier for it. With MIME types, that's still pretty much true;with more elaborate metadata, it isn't.

Using the same identifier is important for queries like, "Whichdocuments include this image?" If the three documents that use the imageuse three different kinds of IDs for it (because they refer to threedifferent kinds of metadata), you're out of luck.

rather theoretically is it possible that the same sequence of bytes (block)represents different contents being interpreted with a differentmime-type/encoding, should the host then store the block twice?

Up to the host. Since it *is* rather unlikely, I don't think there wouldbe big penalties to storing the block twice in this case. I wouldn't doit anyway, but for a different reason: Other systems do not include theMIME type in their hash-based identifiers, and we should be able to findblocks and serve them to those systems even when we do not know the MIMEtype.

Higher level applications should not use block-uris anyway but deal with anabstraction representing the content (like http urls should).

You mean as in, with content negotiation applied? You use a single URIwhich maps to different representations of the same resource?

An example to be more explicit:
<urn:urn-5:G7Fj> <DC:title> "Ulisses"
<urn:urn-5:G7Fj> <DC:decription> "bla bli"

This, for example, I would not include here. :-) Firstly, it issomething I would want to be versioned independently: if I change thedescription of an image, that should not create a new version of theimage. Secondly, I don't see a reason why the URI of the image wouldneed to refer to this. Thirdly, I don't think that when a file is putinto the system-- and thus given its identifier-- is necessarily thetime to create this kind of metadata. It would seem to hold up the taskat hand. Rather, I'd like to be able to add it later on, and maybesomeone else can do that even better than me-- like a librarian who hasscientific background in giving metadata about stuff.

It seems like you could easily put this data in another block withoutlosing much (assuming that the second block could be easily foundthrough an appropriate query).

<urn:urn-5:G7Fj> <ex:type> <ex:text>
<urn:urn-5:G7Fj> <ex:utf8-encoding> <urn:content-hash: jhKHUL7HK>
<urn:urn-5:G7Fj> <ex:latin1-encoding> <urn:content-hash: Dj&/fjkZRT68>
<urn:urn-5:lG5d> <ex:englishVersion> <urn:urn-5:G7Fj>
<urn:urn-5:lG5d> <ex:spanishVersion> <urn:urn-5:kA2L>

These, on the other hand, are very good cases, because they can be usedby the computer in ways that require a certain level of trust: We wantto retrieve only the data that the referrer intended to be retrieved,and we want to be able to check this cryptographically-- so thisactually needs to be part of what we protect cryptographically.

One technical side note, though. We'd have two types of URIs, somethinglike,


    urn:foo:content-hash:jv24kt5
    urn:foo:ref:rs53h85p

The first would be just a plain byte stream identified by a contenthash. The second would be a content hash, too, but we'd know that thetarget should be interpreted as an RDF file with data like you giveabove. Now, when we retrieved this block, we need to know at which nodewe need to start looking to find the block we're interested in, so Ithink we'd need to write this as something like,


<urn:urn-5:G7Fj> <ex:type> <ex:text>
<urn:urn-5:G7Fj> <ex:utf8-encoding> <urn:content-hash: jhKHUL7HK>
<urn:urn-5:G7Fj> <ex:latin1-encoding> <urn:content-hash: Dj&/fjkZRT68>
<> <ex:englishVersion> <urn:urn-5:G7Fj>
<> <ex:spanishVersion> <urn:urn-5:kA2L>

I.e., "this resource" is <> (the empty URI reference) and we starttraversing the graph from there.

I found another use case for RDF metadata: Creative Commons licenses. Itwould make sense to me if this would be part of the reference, allowingthe computer to automatically conclude how data may be copied and used.

In this example application should reference "urn:urn-5:G7Fj" (whichdoes not have a mime type) rather than "urn:content-hash: Dj&/fjkZRT68"(which has a mime type in a specific context) wherever possible, in manycases a higher abstraction "urn:urn-5:lG5d" can be used .

Um, using a urn-5 doesn't work since it's just a random number-- if weuse just a random number, we cannot check whether the data we mayretrieve from a p2p network is really what the person making thereference wanted us to see. We would need to use "urn:foo:ref:[blah]",which would be the above RDF data, from which we could then get thespecific representation.

While you canonly deficiently use http to server a block,


Why?

you could server the uri ofboth the abstractions (urn:urn-5:G7Fj and urn:urn-5:lG5d) directly usinghttp 1.1.features.


(Again, you'd have to use hashes, or you could be arbitrarily spoofed.)

But am I right that this makes rdf-literals obsolete for everythingbut small decimals?
Hm, why? :-)
well, why use literal if you can make a block out of it, shorteningqueries and unifying handling?

Ah, that depends on many factors. Speed is one; you may need to load alot of blocks to get the data for all the literals in a graph. Also, ifwe store each block as a file on a file system, there are some filesystems that perform badly when faced with a large number of reallysmall files.

And how do you split the metadata in blocks
Well, depends very much on the application. How do you split metadatainto files? :-)
Not at all ;-). The splitting into file is rudimentary representedmeta-data, if you use RDF the filesystem is a legacy application.


Um, but if you put metadata on an http server, you split it too?

The rule of thumb is: Split it in units you would want to transferindependently. E.g. in Annotea, you would make one block = oneannotation. When putting email into RDF, you might make one block = oneemail. You might want to put your FOAF data in one block. If you havemetadata about many documents, you might make a metadata block for eachdocument you process. If you publish your personal TV recommendationseach week, you'd make one block each week.

Of course if the granularity doesn't fit the task at hand-- you want tosend a friend all love story recommendations of the last year-- thecomputer can split up those blocks automatically and reassemble them ina different way. It's just that for many applications a certaingranularity fits usage patterns pretty well-- for example, you'd most ofthe time transmit an annotation as a whole. Then, if you've downloadedan annotation once, you never need to download it again (that's one ofthe benefit of putting them in blocks, you can cache them indefinitely).

So anyway, there are a number of reasons why we need to do powerfulqueries over a set of Storm blocks. For example, since we use hashesas the identifiers for blocks, we don't have file names as hints tohumans about their content; instead, we'll use RDF metadata, storedin *other* blocks. As a second example, on top of the unchangeableblocks, we need to create a notion of updateable, versionedresources. We do this by creating metadata blocks saying e.g.,"Block X is the newest version of resource Y as of2003-03-20T22:29:25Z" and searching for the newest such statement.
I don't quite understand: isn't there a regression problem if themetadata is itself contained in blocks? Or is at least the timestampof a block something external to the blocks?
A metadata block does not usually have a 'second-level' metadata blockwith information about the first metadata block, if you mean that;
Say you want to change the description of an entity, not just add a newone, I think you should tell about another metadata block that it iswrong (in the era starting now ;-)).

"Not usually" just meant that *most* metadata blocks do not have asecond-level metadata block, in case you were worried that we'd need aninfinite number of metametametameta blocks otherwise :)

no, timestamps are not external to the blocks.
When the user synchronizes his laptop with the home-pc I guess themetadata may be contradictory, I thought with an external timestampcontradictions could be handled (the newer is the right one). If thetimestamp is part of the metadata the application should probablyenforce it (while generally giving the user the maximum power to makeall sorts of metadata constructs).

The timestamp is on the assertion, "Block X is the newest version ofresource Y," and it gives the time when the user said X is the currentversion (i.e., when the user saved the document). If the user saves thedocument on the desktop, and then on the laptop, that would be differentsaves, made at different times, so the timestamps wouldn't becontradictory: they would simply be the timestamps of two different things.

(There's another problem in this scenario, though: If the user edited adocument independently on desktop and laptop, it wouldn't be nice if theversion saved later would supersede the other one; rather, the changesfrom both should be merged. We actually use a slightly different systemfor synchronization of independent systems; instead of storing atimestamp, we store a list of obsoleted versions... but that's leadingus astray here :-) )


- Benja

[Prev in Thread]

Current Thread

[Next in Thread]

[Gzz] Storm blocks and metadata (Re: P2P and RDF), Benja Fallenstein <=
- [Gzz] Address meanings, not contents! (Re: Storm blocks and metadata), Reto Bachmann-Gmuer, 2003/03/27
  - [Gzz] Re: Address meanings, not contents! (Re: Storm blocks and metadata), Benja Fallenstein, 2003/03/27

Prev by Date: Re: [Gzz] Asko 2003-03-24 (navidoc)
Next by Date: [Gzz] Asko 2003-03-25 (controller)
Previous by thread: [Gzz] General about presentation
Next by thread: [Gzz] Address meanings, not contents! (Re: Storm blocks and metadata)
Index(es):
- Date
- Thread