Introduction
============
Product: Urdu OCR
Applicant's Name: Sawood Alam
Current Urdu OCR's Status: No working OCR for Urdu is there even not a proprietary one. Several students were found to be working in this direction but no one ended up with a useful product.
Target: Development of an Urdu OCR that is capable of recognizing at least character based Urdu fonts (if not ligature based) to an acceptable extent of accuracy after proper learning.
Feasibility
===========
Writing an OCR from scratch and implementing own algorithms, is not feasible anyways within the short span of time of only six months or so. Hence building it on top of any prebuilt open source OCR (e.g. ) may be one option.
Tesseract (and other such tools) has some issues particularly in right-to-left (RTL) and ligature based (joined character) based languages. As their pointer flow is left-to-right and character mapping is one-to-one. These issues are majorly because the developers of such tools are mostly from European countries where script is based on isolated characters and text flow is left-to-right.
But I have solutions to all these issues which I will discuss briefly in the next section.
I also found people making their efforts on training Tesseract on RTL languages like Arabic and Hebrew. Particularly in case of Arabic only few characters could be recognized. But I hope after proper treatment and manipulation of the image file, result will improve.
Issues and Possible Solutions
=============================
Issue 1: Right-to-Left flow of text.
Solution: Instead of changing the code to manipulate the image from right to left keep it intact, as it is highly associated in several modules of the code. Due to this dependency, one may need to alter almost entire codebase. Another side effect will be, the tool will remain no more generic but specific to RTL languages only. An alternate solution could be "flip the image horizontally". And input this flipped image as input file and let the parser manipulate the image as it was some LTR language.
Issue 2: Character isolation from the joined ligature.
Solution: Do not isolate characters from within a joined ligature. instead consider them as single symbol and map them with corresponding set of characters in the mapping file. Drawback of this mechanism is performance degradation as the mapping table will now be several times larger than that of one-to-one character mapping table and a linear search will take a little more time.
Issue 3: Ligature isolation within a line. As several ligatures have some overlapping (as seen vertically) although they are isolated.
Solution: Most of the simple image manipulations are done through horizontal and vertical lines (forming rectangles). While in Urdu and alike scripts demand for a parallelogram with horizontal paralel lines and slanted vertical parallel lines to encapsulate a ligature. To tackle this issue we should not alter the code (that will be too complex and will loose generality). Rather we can apply a bit rotation and a little shear to deform the image in such a way to make ligatures encapsulable in rectangles.
Preprocessing: Hence as per the above discussion, a preprocessing (horizontal flip, a bit rotation and a little shear) of input images is needed to make them work with Tesseract for Urdu, Arabic and Persian and alike languages.
Timeline
========
Month 1: Thorough study of codebase of Tesseract. And training it for isolated Urdu alphabet characters and numbers of different fonts, both of orignal and flipped images and watch the result.
Month 2: Initial training of small ligatures (made of 2-3 characters) with "Tahoma" font (character based true type) with several preprocessing configurations and maintaining performance, accuracy table.
Month 3: Writing script to generate all possible ligatures (under complex rules of Persian script). This way mapping and training will be easy job, although it will degrade the performance to a large extent (that is not primary concern). Later this ligature set may be pruned of through a dictionary word list to enhance the performance.
Month 4: Writing a input image preprocessing script based on the above observations. This may ask for font name and other such parameters to preprocess the image properly.
Month 5: Handling "number within text" issue. As in Persian script (Urdu, Persian, Arabic and alike) numbers are left-to-right within right-to-left text. It needs extra care of identifying (may be through regular expression) the same and handle it properly. Otherwise 12345 will become 54321.
Month 6: Binding it all in a package format. And documentation of the work that may help for further enhancements. And list the possibilities of improvements and their requirement (as human, time and other resource).
Deliverable
============
At the end of the fellowship period, an Urdu OCR (that may be capable of learning Persian and Arabic too) for at least character based horizontal flow modern fonts (like Tahoma, Nafees Web Naskh, Alqalam etc) with around 90%+ accuracy.
A comprehensive documentation about its enhancement so that it become capable of processing traditional complex Nastaleeq fonts (in which most of the books are printed) to make it really useful for library digitization.
References
==========
1:
http://code.google.com/p/tesseract-ocr/
2:
http://en.wikipedia.org/wiki/Tesseract_(software)
3:
http://sourceforge.net/projects/tesseract-ocr
4:
http://crulp.org/
5:
http://urduweb.org/mehfil
6:
http://www.daniweb.com/forums/thread82641.html
7:
http://www.ocr.org.uk/qualifications/assetlanguages/urdu/
8:
http://www.waset.org/pwaset/v23/v23-93.pdf
9:
http://pubs.iupr.org/DATA/2006-IUPR-24Nov_1031.pdf
10:
http://sites.google.com/site/ocropus/languages/urdu