1Institute of Computer Science and Information Technology, Faculty of Management and Computer Sciences, The University of Agriculture, Peshawar, Khyber Pakhtunkhwa, Pakistan; 2Department of Computer Science, Shaheed Benazir Bhutto University, Sheringal, Upper Dir, 18200, Pakistan; 3FAST National University of Computer and Emerging Sciences, Islamabad, Pakistan.
ABSTRACT
This paper presents a novel approach to segmenting typed Pashto text images into individual characters, addressing a critical challenge in Optical Character Recognition (OCR) for this language. Pashto, a right-to-left, highly cursive language similar to Arabic and Urdu, poses unique segmentation difficulties due to the variable shapes and forms of its characters depending on their position in a word. The segmentation of Pashto characters remains an underdeveloped area in language processing, significantly hindering OCR performance. To tackle this, an image database of isolated Pashto characters was created. Pashto text samples were generated in Microsoft Word, with images saved in Bitmap (BMP) format for processing. These images were preprocessed, converting them to binary form and removing noise. These preprocessed images were then segmented into their constituent characters by the proposed algorithm. The proposed algorithm measure pixels strength to segment words into characters. The algorithm achieved a segmentation accuracy of 84.6%, verified through manual analysis, although some new and unwanted characters (garbage) were also generated. This work contributes a significant step toward improving OCR for the Pashto language, offering a reliable method for character segmentation, which is fundamental to the development of an accurate Pashto OCR system.
To share on other social networks, click on any
share button. What are these?