[circle] circle web search wishlist (long)

circle-discuss
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[circle] circle web search wishlist (long)

From:	Oskar Kosciusko
Subject:	[circle] circle web search wishlist (long)
Date:	Mon, 30 Jun 2003 19:54:12 +0200
Hello, folks -- I'm trying to design a web search protocol that will run on top 
of the circle network. Below are my thoughts on it, and I'd like to know what 
you think.
  
  Thanks,
   Oskar Kosciusko
   mail:address@hidden
   circle:bangpath

Circle web search.

a.) Design Goals
1. speed: use the network of nodes to parallelize the search
2. freshness: distributed, coordinated crawlers for frequent updates
3. independence: no one entity controls any large or contiguous part of the 
database

Each node is expected to understand all parts of the proposed search protocol:

    search-request(keyword, searchUID?, limit?)
        e.g. please send me the top N web pages that match this keyword.

    rank-request(url, keyword, search-node?, searchUID?)
        please send me the 'score' of this keyword(s) for this url

    rank-result(data, searchUID?)
        (response to rank-request) title, descr, url, rank, etc.

    cache-store-request(url)
        please store this URL in your cache

    cache-read(url)
        pelase send me this url from your cache

    keyword-announce(url, keyword)
        this url contains these keywords (top 5%? top 100?)


b.) Anatomy of a search

        A search-request is what starts the process. A user wants pages 
matching a keyword, say, 'dvd'. The MD5 of that is 
'ca59e84c56b9f8367276f899f722646a'. The user's node finds the node that 
corresponds to that hash, and issues a search-request to it. This 'index' node 
consults its index cache for urls containing that keyword. For each hash, the 
index node sends a rank-request to the corresponding node. These 'doc' nodes 
read the document from their cache and score it based on the keyword(s). They 
then send a rank-result back to the original node, who collates them and 
presents them to the user.

        Sounds ok, but where does the database come from? My first idea was to 
have each node do a blind crawl, but there are several problems with this. 
Self-interest would be hard to combat, as anyone could claim that their site 
has a score of 1,000,000 for the keyword 'sex'. The data would be scattered all 
over the place, making collation difficult. There would be a lot of needless 
duplication of crawling and storage, and the load would not be evenly 
distributed. It would also take many months before the database was large 
enough to be worth the trouble.

The database needs to be self-collating, self-balancing, self-healing, and 
useful from day one. So let's add to this scenario:

        The corresponding node, the one whose 'arc' of the Circle contains the 
hash of an index or document, will not just hold a pointer but the data itself. 
There are some benefits to this approach:

  1. the data will be more evenly distributed across the nodes
  2. it will not concentrate 'contiguous' data (the indices for 'dvd' and 
'dvds', or all pages from a website) on any one node
  3. no one can predict beforehand what data a node will hold, so there is 
little incentive to post bad data

        If an index node discovers it doesn't have any (or many) urls matching 
a search-request, it performs a metasearch. It queries a few established search 
engines and gathers the top urls. It saves these keyword-url mappings to its 
cache, and continues with the search process. These urls become the seeds for 
our crawlers, while making the search useful right away.

        If a doc node does not have the url in its cache, it retrieves it and 
continues with the search process. The node has now assumed responsibility for 
that url. Periodically it will recheck the url for changes and send 
keyword-announce messages to the right nodes. It will also parse links from the 
document and send cache-store-requests to the nodes corresponding the the url 
hashes. They will retrieve the referenced documents, send keyword-announce and 
cache-store-requests, and so on down the line. The reasons for this are 
four-fold: 

  1. it maintains the direct mapping of url hashes to nodes
  2. distributes the burden of crawling
  3. avoids duplication of effort
  4. makes the crawling hard to detect


c.) Implementation details

        Building the database: since the database will not be blindly compiled, 
it will at first tend to contain only documents that users search for the most. 
As the 'crawl space' gets larger the database will have more complete coverage, 
but since nodes will probably use LRU to prune their caches, popular pages will 
tend to stick around more. hopefully this will allow us to have the same 
percieved keyword coverage as google without having to store all 3 billion 
pages. 3x10^9 * 30k = 900TB. We can't require each user donate 1GB or more, 
though 100-200MB is probably not asking too much. The hope is to grow user 
capacity faster than the database. Linking the database to actual use helps to 
do this.

        Client control: each client will be able to control how much disk space 
it will donate for index and doc caches, and can keep them in check with LRU 
and expiration sweeps. each client may choose to limit the number of 
connections, pages, results, etc.

        Load balancing: Overlap may be desirable, to avoid putting too much 
load on the unfortunate machine that handles the 'sex' keyword. let's add a 
rule: an index or doc request may be handled not only by the arc matching the 
hash, but also some inverse or bit-shift of the hash. 


d.) Todo
        multi-keyword searches: the querystring should be parsed by the users's 
node, and search-requests sent for each keyword. but we should also send the 
complete querystring to each, so the doc servers can rank on all of them.

        page attributes: how to include language and geographic preferences in 
the search?

        growth: what happens when a new node splits my keyword space with me? 
should it rebuild on its own or get it from me? how can the new node trust what 
i give it? maybe it should 'inherit' data from me, but expire it as soon as 
possible and replace it with its own.



____________________________________________________________
Get advanced SPAM filtering on Webmail or POP Mail ... Get Lycos Mail!
http://login.mail.lycos.com/r/referral?aid=27005
[Prev in Thread]
Current Thread
[Next in Thread]
[circle] circle web search wishlist (long), Oskar Kosciusko <=
Prev by Date: [circle] chat fields
Previous by thread: [circle] chat fields
Index(es):
- Date
- Thread