Web Content Extraction Using Hybrid Approach

ICTACT Journal on Soft Computing ( Volume: 4 , Issue: 2 )

Abstract

vioft2nntf2t|tblJournal|Abstract_paper|0xf4ff3d9214000000d497020001000300
The World Wide Web has rich source of voluminous and heterogeneous information which continues to expand in size and complexity. Many Web pages are unstructured and semi-structured, so it consists of noisy information like advertisement, links, headers, footers etc. This noisy information makes extraction of Web content tedious. Many techniques that were proposed for Web content extraction are based on automatic extraction and hand crafted rule generation. Automatic extraction technique is done through Web page segmentation, but it increases the time complexity. Hand crafted rule generation uses string manipulation function for rule generation, but generating those rules is very difficult. A hybrid approach is proposed to extract main content from Web pages. A HTML Web page is converted to DOM tree and features are extracted and with the extracted features, rules are generated. Decision tree classification and Naïve Bayes classification are machine learning methods used for rules generation. By using the rules, noisy part in the Web page is discarded and informative content in the Web page is extracted. The performance of both decision tree classification and Naïve Bayes classification are measured with metrics like precision, recall, F-measure and accuracy.

Authors

K. Nethra, J. Anitha, G.Thilagavathi
Sri Ramakrishna Engineering College, India

Keywords

Web Mining, Web Content Extraction, Decision Tree Learning, Naïve Bayes Classification, DOM Tree

Published By
ICTACT
Published In
ICTACT Journal on Soft Computing
( Volume: 4 , Issue: 2 )
Date of Publication
January 2014
Pages
692-696

ICT Academy is an initiative of the Government of India in collaboration with the state Governments and Industries. ICT Academy is a not-for-profit society, the first of its kind pioneer venture under the Public-Private-Partnership (PPP) model

Contact Us

ICT Academy
Module No E6 -03, 6th floor Block - E
IIT Madras Research Park
Kanagam Road, Taramani,
Chennai 600 113,
Tamil Nadu, India

For Journal Subscription: journalsales@ictacademy.in

For further Queries and Assistance, write to us at: ictacademy.journal@ictacademy.in