محب علوی
مدیر
کرلپ والوں کے پیپرز کی علمی ڈکیتی کو میں عین ثواب سمجھتا ہوں۔ ذرا دیکھیں وہ کیا تحقیق کر رہے ہیں اور اندازہ لگائیں کہ ان کا حال کیا ہے۔
URDU NASTALEEQ OPTICAL
CHARACTER
RECOGNITION (OCR)
Optical Character Recognition refers to the branch of computer science that involves reading text from paper and translating the images into a form that the computer can manipulate (for example, into Unicode). The Urdu Nastaleeq OCR is ligature based and it processes Nastaleeq script with a fixed font size of 36.
Urdu Nastaleeq OCR reads printed text from scanner and then automatically finds and extracts information from the monochrome bitmap image, interprets this information and transports it into Unicode text file where it can be edited. All is handled with a minimum of manual interference. The system is able to save time and perform the desired tasks efficiently. For implementation of software, we have used Visual Studio.Net C++ and HTK tool kit. HTK toolkit is based on Hidden Markov Model technique. For each ligature a Markov Model is generated which is a finite state machine where a transition from one state to another is governed by the probabilities.
To train the HMMs, we are using DCT (Discrete Cosine Transformation) values that are calculated from the ligatures extracted from the image. The main benefit of using HMM is to predict the next observation or more generally a continuation of the sequence of observations. The beauty of HMM is that, it has a great capability of catering noise and variations in patterns, which is a core issue in pattern recognition. Interface is provided to software using MFC. Our software generates editable form of printed text with great efficiency and accuracy.
URDU NASTALEEQ OPTICAL
CHARACTER
RECOGNITION (OCR)
Optical Character Recognition refers to the branch of computer science that involves reading text from paper and translating the images into a form that the computer can manipulate (for example, into Unicode). The Urdu Nastaleeq OCR is ligature based and it processes Nastaleeq script with a fixed font size of 36.
Urdu Nastaleeq OCR reads printed text from scanner and then automatically finds and extracts information from the monochrome bitmap image, interprets this information and transports it into Unicode text file where it can be edited. All is handled with a minimum of manual interference. The system is able to save time and perform the desired tasks efficiently. For implementation of software, we have used Visual Studio.Net C++ and HTK tool kit. HTK toolkit is based on Hidden Markov Model technique. For each ligature a Markov Model is generated which is a finite state machine where a transition from one state to another is governed by the probabilities.
To train the HMMs, we are using DCT (Discrete Cosine Transformation) values that are calculated from the ligatures extracted from the image. The main benefit of using HMM is to predict the next observation or more generally a continuation of the sequence of observations. The beauty of HMM is that, it has a great capability of catering noise and variations in patterns, which is a core issue in pattern recognition. Interface is provided to software using MFC. Our software generates editable form of printed text with great efficiency and accuracy.