International Conference on Information Technology and Computer Science, 3rd (ITCS 2011)
65 Comparative Study of Text Representation Methods
Download citation file:
- Ris (Zotero)
- Reference Manager
Several text representation methods, such as bag- of-words and N-gram models, have been widely used in natural language processing, text mining, web data analysis, and so on. The bag-of-words representation can be simply implemented and provide high performance. But it becomes complicated to process documents in oriental languages, since intrinsic separators are not useful in this case. The N-gram representation can be applied to process different languages, whether there are separators or not. It processes documents by moving a window through them by character. Some problems, such as sparseness and zero frequency problem, are still not solved in N-gram model. We proposed a pattern representation scheme using data compression (PRDC) in our former study. The PRDC method does not only independently process text data, but also processes multimedia data effectively. In this study, we will introduce the proposed approach and compare it to the aforementioned two text representation methods. The performance will be compared in terms of clustering ability. Based on the experiment results we will analyze the text representation methods.