International Conference on Information Technology and Computer Science, 3rd (ITCS 2011)
Download citation file:
- Ris (Zotero)
- Reference Manager
When internet users are facing a great many search results, document clustering techniques are very helpful. Most of these techniques rely on statistical proximity or dependency between single terms of the documents. Since the phrases can typically represent the concepts expressed in text more accurately than single terms, higher clustering accuracy can be achieved using a phrase-based document similarity measure. A phrase-based hierarchical clustering method for clustering search engine results is presented in this paper. This method mainly consists of a phrase-based document similarity measure and an improved hierarchical clustering algorithm. The document similarity measure is motivated by a measure of semantic relatedness, i.e. the Extended Gloss Overlaps Measure. The measure extracts matching phrases using a novel phrases-based document index model, namely the Document Index Graph (DIG). To emphasize the effect of these phrases, it assigns each matching phrase a much greater score than the summation of scores assigned to its constituent terms. Then an improved hierarchical clustering algorithm (IHCA) is proposed to cluster search results. It seeks and merges eligible mutual nearest neighbor pairs at each hierarchy. When the state of mutual nearest neighbor pairs is stable, the intermediate results are clustered sequentially.