ICTACT Journals

NORMALIZATION TECHNIQUES FOR IDENTIFYING DUPLICATE RECORDS FROM MULTIPLE DATA SOURCES

ICTACT Journal on Soft Computing ( Volume: 10 , Issue: 1 )

Abstract

In this paper, K-Nearest Neighbor (K-NN), a supervised web-scale forum crawler is used. This approach helps to identify each forums containing information are originally nested with the data they presented or not. It also helps to remove anonymous informative links from forum data that helps to avoid anonymous web usage and user timing on crawling the WebPages. The goal of systematic way of novel implementation deep Web learning using K-NN in the direction of real-time information with exclusive stage of implications. A focused online based information duplicate records crawler analyzes its move slowly boundary to find the hyperlinks that are in all likelihood to be maximum applicable for the move slowly, and avoids beside the point areas of the web. It identifies the next most important and relevant link to follow by counting on probabilistic models for correctly predicting the relevancy of the file. It can mine a group of duplicate records before selecting a value for an attribute of a normalized record. The overall performance of a focused Duplicate record web page crawling depends at the richness of links inside the specific subject matter being searched by using the user Based on this observation, the web forum crawling problem is reduced to a URL-type recognition problem. And shown how to learn accurate and effective regular expression patterns of implicit navigation paths from automatically created training sets using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as five annotated forums and applied to a large set of unseen forums. Our test results show that K-NN achieved over 98% effectiveness and 97% coverage on a large set of test forums powered by over 150 different forum software packages. In addition, the results of applying K-NN on more than 100 community Question and Answer sites and Blog sites demonstrated that the concept of implicit navigation path could apply to other social media sites.

Authors

P Abinaya, R Jayavadivel
Vivekanadha College of Engineering for Women, India

Keywords

Web learning, Neural Networks, Datasets, Regular Expression Patterns and Classifiers

Published By

ICTACT

Published In

ICTACT Journal on Soft Computing
( Volume: 10 , Issue: 1 )

Date of Publication

October 2019

Pages

1994-1998

DOI

10.21917/ijsc.2019.0281