REPTREE CLASSIFIER FOR IDENTIFYING LINK SPAM IN WEB SEARCH ENGINES
Abstract
vioft2nntf2t|tblJournal|Abstract_paper|0xf4ff8b770e0000001124020001000600
Search Engines are used for retrieving the information from the web. Most of the times, the importance is laid on top 10 results sometimes it may shrink as top 5, because of the time constraint and reliability on the search engines. Users believe that top 10 or 5 of total results are more relevant. Here comes the problem of spamdexing. It is a method to deceive the search result quality. Falsified metrics such as inserting enormous amount of keywords or links in website may take that website to the top 10 or 5 positions. This paper proposes a classifier based on the Reptree (Regression tree representative). As an initial step Link-based features such as neighbors, pagerank, truncated pagerank, trustrank and assortativity related attributes are inferred. Based on this features, tree is constructed. The tree uses the feature inference to differentiate spam sites from legitimate sites. WEBSPAM-UK-2007 dataset is taken as a base. It is preprocessed and converted into five datasets FEATA, FEATB, FEATC, FEATD and FEATE. Only link based features are taken for experiments. This paper focus on link spam alone. Finally a representative tree is created which will more precisely classify the web spam entries. Results are given. Regression tree classification seems to perform well as shown through experiments.

Authors
S.K. Jayanthi1 and S. Sasikala2
1Vellalar College for Women, India,2KSR College of Arts and Science, India

Keywords
Web Link Spam, Classification, Reptree, Decision Tree, Search Engine
Yearly Full Views
JanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecember
000000000000
Published By :
ICTACT
Published In :
ICTACT Journal on Soft Computing
( Volume: 3 , Issue: 2 , Pages: 498-505 )
Date of Publication :
January 2013
Page Views :
112576
Full Text Views :
1

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.