TRAINING TREE ADJOINING GRAMMARS WITH HUGE TEXT CORPUS USING SPARK MAP REDUCE

ICTACT Journal on Soft Computing ( Volume: 5 , Issue: 4 )

Abstract

vioft2nntf2t|tblJournal|Abstract_paper|0xf4ffa9a61b0000008dfc030001000200
Tree adjoining grammars (TAGs) are mildly context sensitive formalisms used mainly in modelling natural languages. Usage and research on these psycho linguistic formalisms have been erratic in the past decade, due to its demanding construction and difficulty to parse. However, they represent promising future for formalism based NLP in multilingual scenarios. In this paper we demonstrate basic synchronous Tree adjoining grammar for English-Tamil language pair that can be used readily for machine translation. We have also developed a multithreaded chart parser that gives ambiguous deep structures and a par dependency structure known as TAG derivation. Furthermore we then focus on a model for training this TAG for each language using a large corpus of text through a map reduce frequency count model in spark and estimation of various probabilistic parameters for the grammar trees thereafter; these parameters can be used to perform statistical parsing on the trained grammar.

Authors

Vijay Krishna Menon, S. Rajendran, K.P. Soman
Amrita Vishwa Vidyapeetham, India

Keywords

TAGs, Spark, Probabilistic Grammar, RDDs, Parsing

Published By
ICTACT
Published In
ICTACT Journal on Soft Computing
( Volume: 5 , Issue: 4 )
Date of Publication
July 2015
Pages
1021-1026

ICT Academy is an initiative of the Government of India in collaboration with the state Governments and Industries. ICT Academy is a not-for-profit society, the first of its kind pioneer venture under the Public-Private-Partnership (PPP) model

Contact Us

ICT Academy
Module No E6 -03, 6th floor Block - E
IIT Madras Research Park
Kanagam Road, Taramani,
Chennai 600 113,
Tamil Nadu, India

For Journal Subscription: journalsales@ictacademy.in

For further Queries and Assistance, write to us at: ictacademy.journal@ictacademy.in