ANALYSIS OF IMAGE PREPROCESSING TECHNIQUES TO IMPROVE OCR OF GARHWALI TEXT OBTAINED USING THE HINDI TESSERACT MODEL

ICTACT Journal on Image and Video Processing ( Volume: 12 , Issue: 2 )

Abstract

vioft2nntf2t|tblJournal|Abstract_paper|0xf4ff4b312c000000dcd30a0001000400
A huge amount of information exists in the form of textbooks, paper documents, newspapers, and other physical forms, that is required to be digitized for its effective access and long-time availability. Optical Character Recognition (OCR) is an effective way to digitize the text. In this study, we have used Google’s Tesseract as the OCR tool. The focus of our study is to improve Tesseract’s accuracy on machine-printed Garhwali documents by using image pre-processing techniques including Super-Resolution (SR), different binarization methods (Otsu and adaptive thresholding), skew correction, morphological operations, and ImageMagick methods. To improve the Tesseract results, we used the three proposed approaches – two approaches differed by the binarization method (Otsu and adaptive thresholding), and the third approach used ImageMagick methods for pre-processing. For evaluation purposes, we created a dataset by capturing images from a sample of five Garhwali textbooks using two mobile cameras with different resolutions; two books were captured by a high-resolution camera and the other three were captured through a low-resolution camera. Our experiments showed good results in specific cases, for high-resolution images, 88.13% accuracy was achieved for Otsu thresholding without applying the Super-Resolution and for low-resolution images, 87.44% accuracy was achieved for ImageMagick with Super-Resolution.

Authors

Sukhbindra Singh Rawat1,Ashutosh Sharma 2,Rachana Gusain3
Doon University, India1,Doon University, India2,Doon University, India3

Keywords

Optical Character Recognition, Garhwali Language, Devanagari Script, Image Preprocessing, ImageMagick

Published By
ICTACT
Published In
ICTACT Journal on Image and Video Processing
( Volume: 12 , Issue: 2 )
Date of Publication
November 2021
Pages
2588-2594
Page Views
966
Full Text Views
9

ICT Academy is an initiative of the Government of India in collaboration with the state Governments and Industries. ICT Academy is a not-for-profit society, the first of its kind pioneer venture under the Public-Private-Partnership (PPP) model

Contact Us

ICT Academy
Module No E6 -03, 6th floor Block - E
IIT Madras Research Park
Kanagam Road, Taramani,
Chennai 600 113,
Tamil Nadu, India

For Journal Subscription: journalsales@ictacademy.in

For further Queries and Assistance, write to us at: ictacademy.journal@ictacademy.in