[Bug-gnupedia] Design Solutions Review

bug-gne
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-gnupedia] Design Solutions Review

From:	Bob Dodd
Subject:	[Bug-gnupedia] Design Solutions Review
Date:	Thu, 25 Jan 2001 09:07:17 -0800 (PST)
I've tried to review the solutions presented (so far) for storing
encyclopedia entries. I've looked at a web/file-based approach and at a
database approach. I'm afraid the result is a littl long, so you have a
choice... you can either read the text below, or you can readt the
attached text file (same content).

I've tried to be as impartial as I can, but inevitably my own bias will
show, it's that sort of subject.

It's part of a large submission, but i'd be interested in feedback in
this part first.

/Bob Dodd

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Pedia Back
End

1. Introduction

Something, somewhere, is going to have to handle submissions to the
pedia, and present those submissions in an ordered form to users of
the pedia.

The somewhere is relatively easy: we are expecting it to be the GNU
servers (at least for now),

The something is an entirely different matter.  We need to balance
several different factors:

a) We need the submission process to be easy to use, have multiple
submission routes, and be relatively secure (so having some idea of who
sent what)

b) We need searching/browsing of the pedia to be relatively fast from
a users perspective.

c) We need searching/browsing of the pedia to be relatively efficient
in terms of resources, from a server point of view

d) We need the whole back-end process to be relatively
low-maintenance: we dont want teams of system administrators toiling
away daily just to keep the servers up and running

e) We need a design that will scale well (up or down, depending on how
much info we want to mirror)

f)  We need a design that the majority of volunteers have the skills to
implement and maintain. And indeed have the enthusiasm to do so.

g) We need a system that will support our (as yet undefined)
submissions policy

h) We need a system that will support or (as yet) undefined ratings
policy

i) We need a design that allows for an incremental development cycle
with many and frequent releases

j) We need a mechanism for ensuring our design is properly documented,
and that tracks bugs

k) We need a system that supports multiple languages, both in terms of
submission/rating, and in terms of content storage/presentation of
content

l) We need a system that supports disabled users, especially the
visually impaired

m) We need a system that can track changes to a submission over time,
and which allows older versions to be referenced and viewed

n) We need a system that allows both progamatic access to submissions
(i.e. from other software), and appropriate user interfaces to view our
content

o) We need a system in place quickly

p) We need tools to allow for easy maintenance of all submissions


2. Submission Storage Solutions

There are two ways we can go here: 
a) A bespoke system, designed from scratch (or near to scratch)
b) Tailoring of existing back-ends to match our requirements

Before we can think of truly bespoke systems, we need first to consider
what already exists, and the pros and cons of those systems.

2.1. Web-based approaches

By web-based approaches, I mean HTML/WML/XML style file hierarchies,
description tags, and hyperlinks. Unless stated otherwise, Ill use the
term XML, to mean any of these hypertexting formats.

OK, there are several different ways we could approach this, and they
are discussed below. But first we need to first ask, exactly what are
we storing here? Is it the basic submissions, or something that
includes information from the review/rating process? For now, Im going
to make the assumption that these are just basic submissions.

2.1.1. Indexed Flat(ish)-File System
This really is as basic as saying we create an index in each directory
where ewe store the submissions by sorting the directory listing (I
think this has already been posted as a suggestion?) and allowing web
browsers to navigate the directories.  It does beg the question: one
directory or many? And if many, flat (all from the root, or
hierarchical. If hierarchical, then how do you choose the hierarchy?

>From a practical point of view, you would want to limit the number of
files in a directory: you dont want Eunuchs/Windoesnt falling over
trying to do an ls command 

If you do split between directories, you then need a policy. It cant
be alphabetic, unless you know where Chinese pictograms go in your
alpahabet (or you split first by language, at which point you then have
to endure you hold enough content to keep the same article submitted in
multiple languages tied together). You can try very hard, but the only
simple and mechanical approach is to do it by date-stamp, so the first
512 entries g intot he first directors, the next 512 into the following
etc. It also means you need an index of an index at root level to make
the darned thing searchable.

So I would suggest this solutions score well for simplicity, but very
low for scaleability and performance (imagine trying to find your
submission to update it).  Also, there is the problem of maintaining
that master top-level index.

2.1.2. Indexed XML Wrapped Content

This approach builds on 2.1.1 above. It allows for more sophisticated
organisation of the information, so that as well as a basic directory
index, we have ways of keeping related content tied together, so that
article submitted by the same author and subject (which allows
multi-langauge support),  could be hyperlinked together independent of
the directory in which they are physically stored.  It would therefore
be possible to keep multiple top-level indexes to help when searching
on a language/author/submission status.  So you get an XML wrapper
around (presumably) an XML wrapped submission.

This is a step-up on 2.1.1 above, in that we now have some way of
keeping related submissions together, and assuming the XML wrapper is
auto-generated by parsing the XML wrapper of the submission, a
relatively low-maintenance way of organising and accessing information.

2.1.3. Navigable XML Wrapped content

This approach builds on 2.1.2 above. If you can wrap the submission to
tag related submissions, you can also use the same mechanism to apply
arbitrary and multiple hierarchies to the submissions. If the basic XML
wrapper around the submission allows the user to identify both subject,
and parent subject(s), we could automatically generate a view of the
encyclopedia based on at least the authors perception of what they
think their article is about. 

To my mind, this solution is basically a Wiki, at least in data
structure (albeit one with multiple hierarchical views, some
auto-generated based upon submission rules, some based on the authors
viewpoint).

2.2. Database Approaches

By database approaches I mean, storing content inside some form of
database (relational, hierarchical, object oriented).

As with the web-based approaches, we also need to ask: what do we think
we are storing in the database? Its possible to do one of two things:
either ALL the content goes physically inside the database, or just the
basic header information (author, subject etc.) goes into the database,
and the body gets saved externally with a link to it within the
database entry. We need to consider both scenarios, and they are listed
below.

We also need to consider what we think any database structure/indexing
would give us. For now I am assuming that, as with section 2.1. we are
interested only in the basic submission, and not a content or subject
oriented hierarchy.

2.2.1. All Content In The Database

With this approach, we dissect the submission information (however we
obtain it), and potentially at least break up the submission into small
parts to get at (and index) internal sections of the submission. How
far you take this depends on the granularity that you wish to apply to
indexing & referencing the individual sections of the submission. It is
interesting to compare this with (say) the Wiki-style approach in
2.1.3, where we would (probably) need multiple smaller submissions from
the author to allow equivalent indexing.

This approach definitely scores highly in terms of maintenance:
indexing, versioning, locking for edit, mirroring of data is pretty
much automatic and provided for free by the database. The equivalent
solution in 2.1 requires us to handle these issues ourselves.

Where the truly all in the database approach begins to fall down is
in sizing: once we get large entries e.g. long blocks of text like this
article, pictures, or other multi-media content, we are likely to
seriously reduce efficiency of the database. Most databases have ways
of handling very large blocks of content, but the user manuals usually
go on the suggest you dont do it

The second problem with this approach is in presentation of the
material. With solution 2.1, we have content already well structured
for presentation in (say) Netscape/Mozzilla. With this solution, we
need to rebuild the documents structure first (or store it whole and
create index entries to the locations of the decomposed elements)

The third problem is an implicit assumption: the content of the
submissions can be usefully broken down. That depends a great deal on
the content format, which is a subject in itself If there is no useful
further decomposition, holding ALL the content (perhaps 10s of Kbytes
of picture, Mbytes of video) begins to look less appealing.

The fourth problem is that we can only decompose those content formats
we know and love: each time a content format is updated, or a new one
added to our list of supported formats, we have lots of maintenance to
do on the database and supporting code.

The fifth problem is what we do if there is a syntax error in the file
format of the content of the submission. How would we store a damaged
entry? The answer I guess, is that either we reject it outright (and
hence have yet another storage system to hold rejected submissions), or
we design a part of the database to hold damaged un-expandable
entries. 

An alternative approach that may minimize problem 5 is to create
entries in the database to represent decomposed elements of the content
(adding a level of indirection). If the content is damaged, these
entries are missing.
2.2.2. Only Header Information In The Database

With this approach, we hold only enough information about the
submission to allow indexing/searching of the entry, we dont hold the
body of the submission.

The immediate advantage of this approach is that we dont get the
sizing problems associated with 2.2.1.  But we lose some of the
advantages of the database too: versioning, mirroring of data. We now
need to do at least some of that ourselves. We still get to keep the
locking (if all access goes through the database).

If we dont keep the content in the database, then, if we wanted the
advantages of 2.2.1, we would need to use the proposed solution to
problem 5 above, and create rows in the database to represent
decomposed elements of the content. ). Thats why Header is in quotes
in the title of this section. Of course, as in 2.2.2, this only makes
sense if we are prepared to parse the content for the information (with
the limits on content format that imposes.

There is also a question of what we consider header information. We
could think of the header like the information you would put in an XML
wrapper around a submission, or as all the text content of the
submission, with only pictures/video/animation held outside the
database. The second solution would require us to understand the
content format of the submission, the first would not.

3. Content Organization Solutions

To be a useful encyclopedia, we will finally have to provide some
organization of the content, either as given by the author, or by
another means. In selecting a back-end we need to consider how we would
support such an environment.

3.1. Organization As Part Of The Submissions Process

The reality is that we cannot escape this aspect of content
organization, even if only to group related submissions together (e.g.
same entry, multiple languages).  The real question comes in how much
more you do

And how much more you do depends on who you think organizes
encyclopedia content. The Wiki approach is to allow the author a string
say in where their submission should sit, although in order to do this,
you need to decide on the basic structure of the pedia: hierarchical,
horizontal, or both? 

I would argue that hints from the author are useful, even if others
would organize the information differently, and whatever other
organization is applied, the authors hints should be taken on-board.
One solution is there in 2.1.3: add an XML wrapper around the basic
submissions XML header to provide this view. It therefore becomes part
of the submission storage, and this is basically correct: if it is an
authors hint, he/she needs to be able to easily update that part of
the submission to correct it.  The same solution is also easily
encompassed in any database solution: its just another (set of)
relation column(s) in the data model.

As for adding further organization/indexing of the submission, we have
to decide how many organizations /view of the submissions we expect to
support. And indeed where those views will be stored If I understand
the charter of this project, I think we are expecting many people to
rate and index our content for their own use (e.g. making subject
specific encyclopedias, childrens encyclopedias etc.)  On that basis,
we can expect at least some of that indexing to be done off our
servers.  If that is the case, such organization cannot be part of the
submission process.
3.2. Organization As Part Of A Rating System

Well, lets be more precise: our rating system, whatever that may be.

If we are talking about OUR rating system, then we can assume that we
can host the ratings system on the same servers we use for the
submissions system.  If this assumption is untrue, the following
material is not applicable.

Foist we have to decide: are we simply overlaying the ratings comments
over the submission data, or are we also organizing the submissions as
a result?

3.2.1. Overlaying The Submissions Data

This is the basic level of organization we will need to do. For each
submission we will have an entry (somewhere) which allows reviewers to
rate the content of the submission, make comments, suggest improvements
etc. Presumably each submission will also have a review lifecycle
(submitted, checked for spam, ratings comments received, superceded by
new draft) and we will need to record this also.

How we record such information will depend to some extent on how we
stored the original submission. If we have stored the submission in a
direction wrapped in XML, we can:

1. Add an extra layer of XML that contain the additional reviewing
fields
2. Create a parallel review XML document to accompany the submission
and link to the original
3. Create a review database to hold the rating information.

The most flexible solution would be (2), in that people would be able
to submit reviews as they would articles for inclusion in the
encyclopedia.

If we dont already hold the submission in a database, it would be
difficult to justify creating one only for the review cycle. Since we
arent talking about re-organizing the material, the additional power
to index the submissions would not particularly help us.

If we start off with either the whole submission in a database, or just
the header information, then clearly the review cycle is best viewed as
entries in a reviews table, each row referencing the submission under
review.  This would mean that if the internal structure of the
submission was modelled (as suggested earlier), review/rating comments
would be able to be more precisely targeted that in the case of the
Wiki model above.

3.2.2. Organizing Submissions Due To Rating/Reviewing

One part of a rating/reviewing system could reasonably be a suggested
alternative subject title, or parent subject (compare tot he hints
given by the author). One thing we could do, would be to allow that
view of the encyclopedia too. In other word, review submission could
include alternative entries within their submissions. A
browser/search engine could then choose to present a view of the
encyclopedia based on either/both the authors original hints and the
new hints from the (possibly) many reviewers.

How we would organize such additional views of the submission would
depend heavily upon how the submission is stored.  Wiki-style they
become hyperlinks in the XML of the review submission.  Database style,
they become columns in the reviews table. So both are possible, and
solution neutral.

3.3. Organization Based On Personal Prejudice

For personal prejudice read catalog. Hopefully there will be multiple
catalogs of our submissions, and few if any of them held by us.
Therefore in this case, we need to make the assumption that catalogs do
not run on our servers and design appropriately. 

This immediately leads to the conclusion that the catalogs are not part
of the back-end we have been discussion, even if we choose to implement
a catalog as part of our project. What we have instead is the
requirement to provide an API to our submissions data.

So what sort of API? Well, that depends almost entirely on how we chose
to store the submissions.

3.3.1. Web-based Storage Solutions

We would presumably provide ftp get access to our directory
structure. Catalogers would then be at liberty to download and search
our indexes. We could also provide a Perl/A.N.Other Language API to
navigate the submissions data in a more abstract way.
3.3.2. Database-based Storage Solutions

This is one of the places a database solution really scores: we can set
up JDBC drivers (or whatever db access tools you want) to the database.
Job done.  It would also be a good way to handle the upload of bulk
submissions








__________________________________________________
Do You Yahoo!?
Yahoo! Auctions - Buy the things you want at great prices. 
http://auctions.yahoo.com/
Pedia Back End.txt
Description: Pedia Back End.txt
[Prev in Thread]
Current Thread
[Next in Thread]
[Bug-gnupedia] Design Solutions Review, Bob Dodd <=
- Re: [Bug-gnupedia] Design Solutions Review, Hector Facundo Arena, 2001/01/25
Prev by Date: [Bug-gnupedia] hi from a newbie
Next by Date: Re: [Bug-gnupedia] Peace In Our Time
Previous by thread: [Bug-gnupedia] hi from a newbie
Next by thread: Re: [Bug-gnupedia] Design Solutions Review
Index(es):
- Date
- Thread