LOSSLESS TEXT COMPRESSION FOR UNICODE TAMIL DOCUMENTS

Abstract
Data compressions for different world languages including Indian languages are in high need and demand. Tamil language is one of the longest-surviving classical languages in the world. Usage of Tamil language for communication and storage was increased due to the digitization of government documents and orders. Lossless text compression process for Tamil language document involves substituting an ASCII character in place of Unicode Tamil characters, since the size of an ASCII character is one byte where as a Unicode character size range between 1 byte to 4 bytes depends on the encoding file storage type. The decompression process involves the reverse of compression technique (i.e) replacing ASCII characters with Unicode characters. This paper describes about the architecture of compression and decompression process for Tamil text documents.

Authors
B Vijayalakshmi, N Sasirekha
Vidyasagar College of Arts and Science, India

Keywords
Compression, Decompression, Unicode, ASCII and Substitution
Published By :
ICTACT
Published In :
ICTACT Journal on Soft Computing
( Volume: 8 , Issue: 2 )
Date of Publication :
January 2018
DOI :

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.