Managing Gigabytes (Book)

Sometimes it’s more than just ‘search’. We may want it ‘faster’, and many times we want it ‘smaller’.

(And for the case of database/index size, smaller one is probably the faster one — less things to looking for.)

Managing Gigabytes: Compressing and Indexing Documents and Images by Ian H. Witten, Alistair Moffat, and Timothy C. Bell. (read reviews)

From the authors of the book, MG, an open-source indexing and retrieval system for text, images, and textual images. read more

Google File System

How to search things from a collection is one problem.

How to keep things (in a collection) for a searching is another problem.

And the latter one could be a really big problem, if you have to keep “3,307,998,701 web pages” like Google does.

Google File System: Technical paper, by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. This is a technical paper that explains Google’s custom scalable cluster filesystem for storing their gigantic database of the entire Web across thousands of low-cost PCs. read more

Web Graphs and P2P

Web Graphs

All people in computer science and some fields of engineering (e.g. industrial engineering?) are very familiar with “Graphs” — those nodes and arcs. And, actually, we can represent the web as a [huge] graph. Where node=webpage, arc=(hyper)link.

From this representation, it gives us a way to understand the characteristic of the web better (as we do well with normal graphs).

graph structure in the web | web graph | more on web graph

Peer-to-Peer

Talking about representing document/site as a node in a graph, Peer-to-Peer people already done this since their early day. read more