International Conference on Information Technology and Computer Science, 3rd (ITCS 2011)
Download citation file:
- Ris (Zotero)
- Reference Manager
Nowadays Internet presents a huge amount of information with special formats for users. How to extract information quickly and effectively from various sources becomes very important. This paper investigates a novel approach for extracting data from HTML sites based on studying HTMLParser in depth. With this approach, we can extract hyperlink and other formatted information conveniently, and can translate relevant pieces of HTML pages into XML. Alternatively, we can store these data into SQL database after cleaning the information detail. We also extend HTMLParser to extract custom tags of information for much more applications. Experimental results confirm the feasibility of the approach.