Web Graphs and P2P

Web Graphs

All people in computer science and some fields of engineering (e.g. industrial engineering?) are very familiar with “Graphs” — those nodes and arcs. And, actually, we can represent the web as a [huge] graph. Where node=webpage, arc=(hyper)link.

From this representation, it gives us a way to understand the characteristic of the web better (as we do well with normal graphs).

graph structure in the web | web graph | more on web graph


Talking about representing document/site as a node in a graph, Peer-to-Peer people already done this since their early day.

Making it more relevant to this blog, one of the most popular P2P application is obviously an IR-like system — search for mp3 song or DivX movie, given a title or singer’s name.

Searching things on P2P network is not like a traditional search engine searching its database (which is a snapshot of a part of the web at a particular time, collected by spiders/web spiders).

Rather, the P2P search visits each node, doing searching in that node, jump to other node .. and so on, in “real time”. Clearly, it is impossible to visits every nodes in the network, there are just too many nodes out there. To decide which node it will make a visit or not, it needs a routing algorithm.

As a result, we can simplified a search problem in P2P network as a routing problem, loosely.

[ to find a document is to find a way to that document ]

There are even some more advance routing algorithm that use semantics!

bact’: I used to think about using NLP with P2P routing. But it just “thinking” anyway, never do .. lazy me 🙁

Summarization for Search Engine

Talking about Document Clustering/Categorization/Classification, about ‘approach’ to aid user access to mountains of pages may be a Summarization.

Instead of just only page title, url, and few first (nonsense) paragraphs from the page.

Short summaries may help users to decide which pages are whattheywant and whattheydontwant.

นอกจากจะแบ่งกลุ่มเอกสารที่หามาได้ ให้หา(ต่อโดยผู้ใช้ว่าอันไหนจะเอา อันไหนไม่เอา)ง่ายๆ แล้ว

ถ้าเรามีเนื้อหาย่อๆ ของเอกสารแต่ละหน้า ก็น่าจะทำให้ผู้ใช้ตัดสินใจได้ง่ายขึ้น เร็วขึ้น

อ่านเปเปอร์ข้างล่าง ถ้าสนใจ:

For papers about Summarization for Search Engine, try starts from here:

Dragomir R. Radev, Weiguo Fan (2000), “Automatic summarization of search engine hit lists”.

CiteSeer? Hey! Citation graph is also another feature that we can use, .. have no idea about it yet.

จริงๆ การใช้หลักของ citation ในเปเปอร์ มันก็ช่วยบอกอะไรบางอย่างเกี่ยวกะ “ความสำคัญ” และ “ความเกี่ยวข้อง” ของเอกสารได้

ถ้าอ้างถึงกัน มันก็น่าจะเกี่ยวกัน และถ้าถูกอ้างถึงบ่อย ก็แสดงว่ามันน่าจะสำคัญ (ทำนอง PageRank เลย?)


จาก whatwewant.www

Information Retrieval (and related) research groups in Thailand