Skip to Main Content
Skip Nav Destination
ASME Press Select Proceedings
Intelligent Engineering Systems through Artificial Neural Networks Volume 18
Editor
Cihan H. Dagli
Cihan H. Dagli
Search for other works by this author on:
ISBN-10:
0791802823
ISBN:
9780791802823
No. of Pages:
700
Publisher:
ASME Press
Publication date:
2008

We describe a method to extract content text from diverse Web pages by using the HTML document's Text-To-Tag Ratio (TTR) rather than specific HTML cues that are not constant across various Web pages. We describe how to compute the TTR on a line-by-line basis and then cluster the results into content and non-content areas. The resulting TTR-histogram is not easily clustered because of its one dimensionality; therefore we present a technique to better represent the histogram in two-dimensions. Next, we compare clustering techniques such as EM, K-Means, and Farthest First — in density and distance modes — with a threshold partitioning technique on the resulting two-dimensional data. These clustering techniques are also enhanced with the use of histogram smoothing techniques. We then evaluate our approach using standard accuracy, precision and recall metrics.

Abstract
Introduction
Threshold Partitioning
Histogram Clustering in 2-Dimensions
Experimentation and Results
Conclusion and Future Work
Acknowledgements
References
This content is only available via PDF.
You do not currently have access to this chapter.
Close Modal

or Create an Account

Close Modal
Close Modal