AUTOMATIC TAGGING OF PERSIAN WEB PAGES BASED ON N-GRAM LANGUAGE MODELS USING MAPREDUCE

Abstract
Page tagging is one of the most important facilities for increasing the accuracy of information retrieval in the web. Tags are simple pieces of data that usually consist of one or several words, and briefly describe a page. Tags provide useful information about a page and can be used for boosting the accuracy of searching, document clustering, and result grouping. The most accurate solution to page tagging is using human experts. However, when the number of pages is large, humans cannot be used, and some automatic solutions should be used instead. We propose a solution called PerTag which can automatically tag a set of Persian web pages. PerTag is based on n-gram models and uses the tf-idf method plus some effective Persian language rules to select proper tags for each web page. Since our target is huge sets of web pages, PerTag is built on top of the MapReduce distributed computing framework. We used a set of more than 500 million Persian web pages during our experiments, and extracted tags for each page using a cluster of 40 machines. The experimental results show that PerTag is both fast and accurate

Authors
Saeed Shahrivari , Saeed Rahmani, Hooman Keshavarz
Freelancer, Iran

Keywords
Automatic Web Page Tagging, Persian Web, MapReduce
Published By :
ICTACT
Published In :
ICTACT Journal on Soft Computing
( Volume: 5 , Issue: 4 )
Date of Publication :
July 2015

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.