gzz-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gzz] Storm blocks and metadata (Re: P2P and RDF)


From: Benja Fallenstein
Subject: [Gzz] Storm blocks and metadata (Re: P2P and RDF)
Date: Tue, 25 Mar 2003 11:22:34 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030319 Debian/1.3-3


Hi Reto,

[drifting towards off-topic, but leaving on www-rdf-interest for now because it still concerns the use of RDF]

Reto Bachmann-Gmuer wrote:
I think this is a very good approach, you could use freenet conten-hash uris to identify the blocks.

We'll probably register our own URN namespace, among other goals because we want to use 'real,' registered URIs. (We're also considering putting a MIME content type in the URI, so that a block served up through our system would be basically as useful as a file retrieved through HTTP, and allowing us to easily serve blocks through a HTTP proxy, too. Not yet decided, though-- some people I've contacted commented that MIME types do not belong in URIs.)

hmm, I don't see the analogy with http, since http-urls should not contain a content-type indicator but leave the task to browser and server to negotiate the best content-type deliverable. Of course your case is different, since your uri immutably references a sequence of bytes.

Yes, that would have been my argument. However, you make a good point below: If we refer to an RDF 'metadata' block containing the URI of the actual block, we can include references to alternative versions-- even allowing some degree of content negotiation. This is something I have to mull about :-)

I strongly disagree with putting the mime-type into the url, because the mime type is meta information for which I see no reason to be threaded differently than other meta-information,

It is necessary for the interpretation of the data we get; and it's usually easy to agree on (people won't too often assign different mime types to the same bytes). One thing about content hashes is, when two people put the same file into a hash-based system, they will use the same identifier for it. With MIME types, that's still pretty much true; with more elaborate metadata, it isn't.

Using the same identifier is important for queries like, "Which documents include this image?" If the three documents that use the image use three different kinds of IDs for it (because they refer to three different kinds of metadata), you're out of luck.

rather theoretically is it possible that the same sequence of bytes (block) represents different contents being interpreted with a different mime-type/encoding, should the host then store the block twice?

Up to the host. Since it *is* rather unlikely, I don't think there would be big penalties to storing the block twice in this case. I wouldn't do it anyway, but for a different reason: Other systems do not include the MIME type in their hash-based identifiers, and we should be able to find blocks and serve them to those systems even when we do not know the MIME type.

Higher level applications should not use block-uris anyway but deal with an abstraction representing the content (like http urls should).

You mean as in, with content negotiation applied? You use a single URI which maps to different representations of the same resource?

An example to be more explicit:
<urn:urn-5:G7Fj> <DC:title> "Ulisses"
<urn:urn-5:G7Fj> <DC:decription> "bla bli"

This, for example, I would not include here. :-) Firstly, it is something I would want to be versioned independently: if I change the description of an image, that should not create a new version of the image. Secondly, I don't see a reason why the URI of the image would need to refer to this. Thirdly, I don't think that when a file is put into the system-- and thus given its identifier-- is necessarily the time to create this kind of metadata. It would seem to hold up the task at hand. Rather, I'd like to be able to add it later on, and maybe someone else can do that even better than me-- like a librarian who has scientific background in giving metadata about stuff.

It seems like you could easily put this data in another block without losing much (assuming that the second block could be easily found through an appropriate query).

<urn:urn-5:G7Fj> <ex:type> <ex:text>
<urn:urn-5:G7Fj> <ex:utf8-encoding> <urn:content-hash: jhKHUL7HK>
<urn:urn-5:G7Fj> <ex:latin1-encoding> <urn:content-hash: Dj&/fjkZRT68>
<urn:urn-5:lG5d> <ex:englishVersion> <urn:urn-5:G7Fj>
<urn:urn-5:lG5d> <ex:spanishVersion> <urn:urn-5:kA2L>

These, on the other hand, are very good cases, because they can be used by the computer in ways that require a certain level of trust: We want to retrieve only the data that the referrer intended to be retrieved, and we want to be able to check this cryptographically-- so this actually needs to be part of what we protect cryptographically.

One technical side note, though. We'd have two types of URIs, something like,

    urn:foo:content-hash:jv24kt5
    urn:foo:ref:rs53h85p

The first would be just a plain byte stream identified by a content hash. The second would be a content hash, too, but we'd know that the target should be interpreted as an RDF file with data like you give above. Now, when we retrieved this block, we need to know at which node we need to start looking to find the block we're interested in, so I think we'd need to write this as something like,

<urn:urn-5:G7Fj> <ex:type> <ex:text>
<urn:urn-5:G7Fj> <ex:utf8-encoding> <urn:content-hash: jhKHUL7HK>
<urn:urn-5:G7Fj> <ex:latin1-encoding> <urn:content-hash: Dj&/fjkZRT68>
<> <ex:englishVersion> <urn:urn-5:G7Fj>
<> <ex:spanishVersion> <urn:urn-5:kA2L>

I.e., "this resource" is <> (the empty URI reference) and we start traversing the graph from there.

I found another use case for RDF metadata: Creative Commons licenses. It would make sense to me if this would be part of the reference, allowing the computer to automatically conclude how data may be copied and used.

In this example application should reference "urn:urn-5:G7Fj" (which does not have a mime type) rather than "urn:content-hash: Dj&/fjkZRT68" (which has a mime type in a specific context) wherever possible, in many cases a higher abstraction "urn:urn-5:lG5d" can be used .

Um, using a urn-5 doesn't work since it's just a random number-- if we use just a random number, we cannot check whether the data we may retrieve from a p2p network is really what the person making the reference wanted us to see. We would need to use "urn:foo:ref:[blah]", which would be the above RDF data, from which we could then get the specific representation.

While you can only deficiently use http to server a block,

Why?

you could server the uri of both the abstractions (urn:urn-5:G7Fj and urn:urn-5:lG5d) directly using http 1.1.features.

(Again, you'd have to use hashes, or you could be arbitrarily spoofed.)

But am I right that this makes rdf-literals obsolete for everything but small decimals?

Hm, why? :-)

well, why use literal if you can make a block out of it, shortening queries and unifying handling?

Ah, that depends on many factors. Speed is one; you may need to load a lot of blocks to get the data for all the literals in a graph. Also, if we store each block as a file on a file system, there are some file systems that perform badly when faced with a large number of really small files.

And how do you split the metadata in blocks

Well, depends very much on the application. How do you split metadata into files? :-)

Not at all ;-). The splitting into file is rudimentary represented meta-data, if you use RDF the filesystem is a legacy application.

Um, but if you put metadata on an http server, you split it too?

The rule of thumb is: Split it in units you would want to transfer independently. E.g. in Annotea, you would make one block = one annotation. When putting email into RDF, you might make one block = one email. You might want to put your FOAF data in one block. If you have metadata about many documents, you might make a metadata block for each document you process. If you publish your personal TV recommendations each week, you'd make one block each week.

Of course if the granularity doesn't fit the task at hand-- you want to send a friend all love story recommendations of the last year-- the computer can split up those blocks automatically and reassemble them in a different way. It's just that for many applications a certain granularity fits usage patterns pretty well-- for example, you'd most of the time transmit an annotation as a whole. Then, if you've downloaded an annotation once, you never need to download it again (that's one of the benefit of putting them in blocks, you can cache them indefinitely).

So anyway, there are a number of reasons why we need to do powerful queries over a set of Storm blocks. For example, since we use hashes as the identifiers for blocks, we don't have file names as hints to humans about their content; instead, we'll use RDF metadata, stored in *other* blocks. As a second example, on top of the unchangeable blocks, we need to create a notion of updateable, versioned resources. We do this by creating metadata blocks saying e.g., "Block X is the newest version of resource Y as of 2003-03-20T22:29:25Z" and searching for the newest such statement.

I don't quite understand: isn't there a regression problem if the metadata is itself contained in blocks? Or is at least the timestamp of a block something external to the blocks?

A metadata block does not usually have a 'second-level' metadata block with information about the first metadata block, if you mean that;

Say you want to change the description of an entity, not just add a new one, I think you should tell about another metadata block that it is wrong (in the era starting now ;-)).

"Not usually" just meant that *most* metadata blocks do not have a second-level metadata block, in case you were worried that we'd need an infinite number of metametametameta blocks otherwise :)

no, timestamps are not external to the blocks.

When the user synchronizes his laptop with the home-pc I guess the metadata may be contradictory, I thought with an external timestamp contradictions could be handled (the newer is the right one). If the timestamp is part of the metadata the application should probably enforce it (while generally giving the user the maximum power to make all sorts of metadata constructs).

The timestamp is on the assertion, "Block X is the newest version of resource Y," and it gives the time when the user said X is the current version (i.e., when the user saved the document). If the user saves the document on the desktop, and then on the laptop, that would be different saves, made at different times, so the timestamps wouldn't be contradictory: they would simply be the timestamps of two different things.

(There's another problem in this scenario, though: If the user edited a document independently on desktop and laptop, it wouldn't be nice if the version saved later would supersede the other one; rather, the changes from both should be merged. We actually use a slightly different system for synchronization of independent systems; instead of storing a timestamp, we store a list of obsoleted versions... but that's leading us astray here :-) )

- Benja





reply via email to

[Prev in Thread] Current Thread [Next in Thread]