NORMALIZATION TECHNIQUES FOR IDENTIFYING DUPLICATE RECORDS FROM MULTIPLE DATA SOURCES

ICTACT Journal on Soft Computing ( Volume: 10 , Issue: 1 )

Abstract

vioft2nntf2t|tblJournal|Abstract_paper|0xf4ff9f562b0000001b7f040001000500
In this paper, K-Nearest Neighbor (K-NN), a supervised web-scale forum crawler is used. This approach helps to identify each forums containing information are originally nested with the data they presented or not. It also helps to remove anonymous informative links from forum data that helps to avoid anonymous web usage and user timing on crawling the WebPages. The goal of systematic way of novel implementation deep Web learning using K-NN in the direction of real-time information with exclusive stage of implications. A focused online based information duplicate records crawler analyzes its move slowly boundary to find the hyperlinks that are in all likelihood to be maximum applicable for the move slowly, and avoids beside the point areas of the web. It identifies the next most important and relevant link to follow by counting on probabilistic models for correctly predicting the relevancy of the file. It can mine a group of duplicate records before selecting a value for an attribute of a normalized record. The overall performance of a focused Duplicate record web page crawling depends at the richness of links inside the specific subject matter being searched by using the user Based on this observation, the web forum crawling problem is reduced to a URL-type recognition problem. And shown how to learn accurate and effective regular expression patterns of implicit navigation paths from automatically created training sets using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as five annotated forums and applied to a large set of unseen forums. Our test results show that K-NN achieved over 98% effectiveness and 97% coverage on a large set of test forums powered by over 150 different forum software packages. In addition, the results of applying K-NN on more than 100 community Question and Answer sites and Blog sites demonstrated that the concept of implicit navigation path could apply to other social media sites.

Authors

P Abinaya, R Jayavadivel
Vivekanadha College of Engineering for Women, India

Keywords

Web learning, Neural Networks, Datasets, Regular Expression Patterns and Classifiers

Published By
ICTACT
Published In
ICTACT Journal on Soft Computing
( Volume: 10 , Issue: 1 )
Date of Publication
October 2019
Pages
1994-1998

ICT Academy is an initiative of the Government of India in collaboration with the state Governments and Industries. ICT Academy is a not-for-profit society, the first of its kind pioneer venture under the Public-Private-Partnership (PPP) model

Contact Us

ICT Academy
Module No E6 -03, 6th floor Block - E
IIT Madras Research Park
Kanagam Road, Taramani,
Chennai 600 113,
Tamil Nadu, India

For Journal Subscription: journalsales@ictacademy.in

For further Queries and Assistance, write to us at: ictacademy.journal@ictacademy.in