|
ABSTRACT
Title |
: |
AN EFFICIENT APPROACH FOR TEMPLATE EXTRACTION |
Authors |
: |
Pravallika.CH, Swapna Goud.N, Vishnu Murthy.G |
Keywords |
: |
Template extraction, Clustering web pages, MDL principle. |
Issue Date |
: |
August 2012 |
Abstract |
: |
The World Wide Web is a vast and rapidly growing source of useful information which is used to publish and access the information on the Internet. It uses different templates with contents for providing easy access for readers. But, for search engine detecting the template and displaying the content to the users is a major task in retrieval of web pages from the web. The templates are considered harmful because they compromise the performance of clustering and classification of the web pages. In this paper, we present novel algorithm for extracting templates from web documents which are generated from heterogeneous template structures. In the proposed, we are clustering the web documents based on the similarity in the template structure so that the template for each cluster is extracted simultaneously. The resultant clusters will be given as input to the Roadrunner system, which is used to extract information from template web pages. |
Page(s) |
: |
348-352 |
ISSN |
: |
2229-3345 |
Source |
: |
Vol. 3, Issue.08 |
|
|
|