Skip to Main Content
ASME Press Select Proceedings

International Conference on Information Technology and Computer Science, 3rd (ITCS 2011)

Editor
V. E. Muhin
V. E. Muhin
National Technical University of Ukraine
Search for other works by this author on:
W. B. Hu
W. B. Hu
Wuhan University
Search for other works by this author on:
ISBN:
9780791859742
No. of Pages:
656
Publisher:
ASME Press
Publication date:
2011

Several text representation methods, such as bag- of-words and N-gram models, have been widely used in natural language processing, text mining, web data analysis, and so on. The bag-of-words representation can be simply implemented and provide high performance. But it becomes complicated to process documents in oriental languages, since intrinsic separators are not useful in this case. The N-gram representation can be applied to process different languages, whether there are separators or not. It processes documents by moving a window through them by character. Some problems, such as sparseness and zero frequency problem, are still not solved in N-gram model. We proposed a pattern representation scheme using data compression (PRDC) in our former study. The PRDC method does not only independently process text data, but also processes multimedia data effectively. In this study, we will introduce the proposed approach and compare it to the aforementioned two text representation methods. The performance will be compared in terms of clustering ability. Based on the experiment results we will analyze the text representation methods.

This content is only available via PDF.
Close Modal
This Feature Is Available To Subscribers Only

Sign In or Create an Account

Close Modal
Close Modal