Intelligent Engineering Systems through Artificial Neural Networks Volume 18
64 Web Content Extraction through Histogram Clustering
-
Published:2008
Download citation file:
We describe a method to extract content text from diverse Web pages by using the HTML document's Text-To-Tag Ratio (TTR) rather than specific HTML cues that are not constant across various Web pages. We describe how to compute the TTR on a line-by-line basis and then cluster the results into content and non-content areas. The resulting TTR-histogram is not easily clustered because of its one dimensionality; therefore we present a technique to better represent the histogram in two-dimensions. Next, we compare clustering techniques such as EM, K-Means, and Farthest First — in density and distance modes — with a threshold partitioning technique on the resulting two-dimensional data. These clustering techniques are also enhanced with the use of histogram smoothing techniques. We then evaluate our approach using standard accuracy, precision and recall metrics.