[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[circle] circle web search wishlist (long)
From: |
Oskar Kosciusko |
Subject: |
[circle] circle web search wishlist (long) |
Date: |
Mon, 30 Jun 2003 19:54:12 +0200 |
Hello, folks -- I'm trying to design a web search protocol that will run on top
of the circle network. Below are my thoughts on it, and I'd like to know what
you think.
Thanks,
Oskar Kosciusko
mail:address@hidden
circle:bangpath
Circle web search.
a.) Design Goals
1. speed: use the network of nodes to parallelize the search
2. freshness: distributed, coordinated crawlers for frequent updates
3. independence: no one entity controls any large or contiguous part of the
database
Each node is expected to understand all parts of the proposed search protocol:
search-request(keyword, searchUID?, limit?)
e.g. please send me the top N web pages that match this keyword.
rank-request(url, keyword, search-node?, searchUID?)
please send me the 'score' of this keyword(s) for this url
rank-result(data, searchUID?)
(response to rank-request) title, descr, url, rank, etc.
cache-store-request(url)
please store this URL in your cache
cache-read(url)
pelase send me this url from your cache
keyword-announce(url, keyword)
this url contains these keywords (top 5%? top 100?)
b.) Anatomy of a search
A search-request is what starts the process. A user wants pages
matching a keyword, say, 'dvd'. The MD5 of that is
'ca59e84c56b9f8367276f899f722646a'. The user's node finds the node that
corresponds to that hash, and issues a search-request to it. This 'index' node
consults its index cache for urls containing that keyword. For each hash, the
index node sends a rank-request to the corresponding node. These 'doc' nodes
read the document from their cache and score it based on the keyword(s). They
then send a rank-result back to the original node, who collates them and
presents them to the user.
Sounds ok, but where does the database come from? My first idea was to
have each node do a blind crawl, but there are several problems with this.
Self-interest would be hard to combat, as anyone could claim that their site
has a score of 1,000,000 for the keyword 'sex'. The data would be scattered all
over the place, making collation difficult. There would be a lot of needless
duplication of crawling and storage, and the load would not be evenly
distributed. It would also take many months before the database was large
enough to be worth the trouble.
The database needs to be self-collating, self-balancing, self-healing, and
useful from day one. So let's add to this scenario:
The corresponding node, the one whose 'arc' of the Circle contains the
hash of an index or document, will not just hold a pointer but the data itself.
There are some benefits to this approach:
1. the data will be more evenly distributed across the nodes
2. it will not concentrate 'contiguous' data (the indices for 'dvd' and
'dvds', or all pages from a website) on any one node
3. no one can predict beforehand what data a node will hold, so there is
little incentive to post bad data
If an index node discovers it doesn't have any (or many) urls matching
a search-request, it performs a metasearch. It queries a few established search
engines and gathers the top urls. It saves these keyword-url mappings to its
cache, and continues with the search process. These urls become the seeds for
our crawlers, while making the search useful right away.
If a doc node does not have the url in its cache, it retrieves it and
continues with the search process. The node has now assumed responsibility for
that url. Periodically it will recheck the url for changes and send
keyword-announce messages to the right nodes. It will also parse links from the
document and send cache-store-requests to the nodes corresponding the the url
hashes. They will retrieve the referenced documents, send keyword-announce and
cache-store-requests, and so on down the line. The reasons for this are
four-fold:
1. it maintains the direct mapping of url hashes to nodes
2. distributes the burden of crawling
3. avoids duplication of effort
4. makes the crawling hard to detect
c.) Implementation details
Building the database: since the database will not be blindly compiled,
it will at first tend to contain only documents that users search for the most.
As the 'crawl space' gets larger the database will have more complete coverage,
but since nodes will probably use LRU to prune their caches, popular pages will
tend to stick around more. hopefully this will allow us to have the same
percieved keyword coverage as google without having to store all 3 billion
pages. 3x10^9 * 30k = 900TB. We can't require each user donate 1GB or more,
though 100-200MB is probably not asking too much. The hope is to grow user
capacity faster than the database. Linking the database to actual use helps to
do this.
Client control: each client will be able to control how much disk space
it will donate for index and doc caches, and can keep them in check with LRU
and expiration sweeps. each client may choose to limit the number of
connections, pages, results, etc.
Load balancing: Overlap may be desirable, to avoid putting too much
load on the unfortunate machine that handles the 'sex' keyword. let's add a
rule: an index or doc request may be handled not only by the arc matching the
hash, but also some inverse or bit-shift of the hash.
d.) Todo
multi-keyword searches: the querystring should be parsed by the users's
node, and search-requests sent for each keyword. but we should also send the
complete querystring to each, so the doc servers can rank on all of them.
page attributes: how to include language and geographic preferences in
the search?
growth: what happens when a new node splits my keyword space with me?
should it rebuild on its own or get it from me? how can the new node trust what
i give it? maybe it should 'inherit' data from me, but expire it as soon as
possible and replace it with its own.
____________________________________________________________
Get advanced SPAM filtering on Webmail or POP Mail ... Get Lycos Mail!
http://login.mail.lycos.com/r/referral?aid=27005
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [circle] circle web search wishlist (long),
Oskar Kosciusko <=