Ms Tanveer Fatima*
GHOST CHARACTERS THEORY for orthographic representation of the Arabic Block
There are many constraints in the spread of knowledge, most important of which is language/communication problem. As about 45% volume of the knowledge is in English and most of the people cannot understand English. having 35% literacy rate, out of this only 2% can read and understand English. This is really a big obstacle to reach the unreached. Solution to this problem is localization i.e. all the products of I.T. and other computer operations should be converted in user’s native language. History also reflects that the localization is now a global business.
Mark Davis (2003) of UNICODE states in his Unicode research paper that, many people in the software industry don’t realize how important it is to localize products for different languages around the world. While English is a major language, it only accounts for around 30% of the world Gross Domestic Product (GDP), and is likely to account for less in the future. Neglecting other languages means ignoring quite significant potential markets.
His short article provides a picture of economic significance of different languages, with a breakdown of the percentages of world GDP by language. Not only does it show the current breakdown, but it also provides data for the years 1975 to 2002 to show modern trends. The most notable feature is steady rise of Chinese and slow relative decline of Japanese and most European languages. Korean and Indic languages also show growth over that period, though slower than Chinese.
The GDP values are expressed in terms of Purchasing Power Parity (PPP), which accounts for price differences between countries.
The Other field is the accumulated total for languages for which there is data, but where each has less than 0.9% of the world GDP. While each language separately corresponds to a small percentage, their total is significant (about the same as Chinese). In general, the data is less reliable for smaller language, so the order should not be taken as significant.
For localization we need many technical approaches e.g. translation etc. but here the problem is that only 10% of the computer literate never went out of word-processing or never touched other functions included in MS Office or other applications. Using MS Volt, to become a necessity of all their Arabic based languages for all their now and future characteristics, fonts are never considered as a tool of localization. Considering localization in practice for a while, another problem raises its head, i.e. orthographic or script processing on the computer in relation to font of the concerned languages. As far as the Arabic basic script of the languages is concerned there is ever growing need of characters in the Arabic script. But there is no room left in different code pages of the computer standards. UNICODE allotted 06 place for this purpose, --- Then on 07 and now entering in the page of 08. Space is a big problem for ever-growing characters of the Arabic based languages. But every problem has its solution. How? It is possible only with a new basis, i.e. Ghost character theory: Only 44 Ghost characters can do all the job and no need to find extra space for new characters. There are a few common items/fractions of the characters/letters in any script.
J. Kew et. al. (2003), states that the true structure of the script is better understood as a small set of underlying "skeleton" letter forms, to which patterns of dots ("nuqtas") are added to differentiate sounds and letters needed to write a particular language.
It will have many benefits: e.g. Universality to Arabic coverage of the block, limiting the block explosion, providing the ease in data entry operations especially on limiting devices this was a thought proved by Dr. Attash Durrani after the ASCII code plate for Urdu was devised in 1999. Full atomization was presented in its 2nd version.
It may introduce normalization issues in the code development process. The normalization transformations are not of the transient nature, but these transformations are there. A user is expected to type in a hybrid of both forms. A character may be either of two cases (i) Collapsed case (characters having diacritic (nuqta), (ii) spread cases (characters without diacritic (nuqta), (ii) Hybrid case (mixture of collapsed and spread case).
Arabic script was historically a “dot less” script. . By this we mean that a single shape may have different sounds depending on the word. Here is an example
·In the figure above, a native Arabic speaker is able to comprehend the meanings of text based on context and his/her vocabulary. However, anyone less familiar with Arabic language will not be able to understand the correct meanings of the text because of limited vocabulary and unable to understand the context. The main reason of not being able to read such a text the sound of a character is heavily dependent upon the context and content of the text.
·To overcome this problem, a Muslim Caliph introduced Nuqtas. The sole purpose of the dot was to sit on a shape (where we call basic or ghost shape or kashti) and to depict its phonetic status. Below is the “dotted” version of the above-mentioned text.
Arabic phrase with dots. Sound of characters is not to be “guessed”
·Now after the placement of dots, even a non-native reader can easily understand the text without any hit and trial because dots are sufficiently depicting the exact sound of the character.
Later on, when new languages adopted Arabic script as their script of choice a new problem arose and that was un-available sounds (phonemes). For example, the Urdu has a sound exactly equal to sound of “p” in English but Arabic language has no such sound and there is no means to depict this sound. Again Nuqta comes to rescue, taking the basic shape of bey and placing three dots beneath it solved this problem. Here how it looks
These characters and the dots were included in the ASCII Code –plate of National Language Authority.
This is the point of present and future of Urdu alphabets as well as of other Pakistani languages. Dr. Durrani enlisted pages from Amir Khusro’s “Khaliq Bari”, Maulvi Abdul Haq’s dictionary and pedagogical needs from Urdu primers of NWFP.
The reasons for encoding the new letterforms as a unit and not encoding combining modifier forms separately or historic, due to the evaluation of the Unicode standard are simple: While vowels and punctuation marks have been encoded as combining marks, the consonantal base letters have consistently been encoded in Unicode as unit. To change a practice would open the door to multiple representations for the same letters.
Some new additions were also made to make it simpler.
ARABIC TIPPLE NUQTA ABOVE = ARABIC DOUBLE NUQTA ABOVE + ARABIC SINGLE NUQTA ABOVE
ARABIC TRIPPLE INVERTED NUQTA ABOVE = ARABIC SINGLE NUQTA ABOVE + ARABIC DOUBLE NUQTA ABOVE
SINDHI QUADRPLE NUQTA ABOVE = ARABIC DOUBLE NUQTA ABOVE + ARABIC DOUBLE NUQTA ABOVE
Dr. Durrani’s Ghost Characters were included in the international standard of fonts/characters UNICODE but partially i.e. the dot less character set was completed by including dot less Bey, Fey and Quaff in the UNICODE Version 3.1. But there was no room for dots and no Unicode number were allotted to the dots and other atoms. The theory request was for the addition of 22 new combining characters to the Arabic block of Unicode standard that will make possible to typeset almost all regional languages written in the Arabic script:
There were different costraints during the development of this project i.e. feasibility /development constraint, financial constraint, resourses constraints, personal constrains and system (Hardware and software) constraints.
According to some researchers, in the development of Unicode, introduction of separate nuqta diacrtics for Arabic would be problematic one. These could not be added to the standrad normilized forms due to the stability requirments and having the separate nuqta diacritics without normalization that would be a security problem for which the technical committee has not found a solution.
These characters have individualistic script existance and are often needed in the generation of electronic texts like pedagogical material. Unicode had already added many entries from the ASCII Code plate notification of NLA, including the notions of ghost characters thus completing the set of ghost characters of the Arabic script. Now it is complimentry to add support for these nuqta characters to these Ghost Characters in the code blocks to realize the real benefit of the set.
Nuqtas are also peresent in Quran as separate characters like 2, 3 and 4 nuqtas above used separately. In these cicumstances, need for these nuqta marks as separate characters is of immense importance. Another rationale was also depicted by Dr. Durrani in the following examples where the nuqtas are red in color,
The project was rejected in 2003 by Unicode technical Committee (UTC) due to this reason that addition of the combining nuqta characters would change the encoding model for Arabic. It is not intended to change the system or introduce a parallal or duplicate encoding system in the Arabic block. It is just the addition of these nuqta characters along with the proposed properties and if introduces a prallal system then it is an additional benefit yeilding self sufficiency of the Arabic script.
But it was solved later and was accepted that it would constitute an untenable destabilization in the unicode standard. It was precisely that reason that UTC was forced to reject the proposal, even though the committee as whole agreed that a decomposed representation for Arabic script would have been preferable. Had it been done from the outset before stability became a limiting factor.
Unicode could restrict the usage of the combining nuqtas in such a way that letters that already exist in their own right cannot be encoded as sequences. Thus, the sequence <DOTLESS BEY, NUQTA BELOW> would be defined to not combine and from a letter looking BEY. No new ambiguities are therefore introduced; any given Arabic letter still only has one Unicode representation. There is no impact whatsoever on normalization.
It also requires implementers to deal with a specific “exclusion list” of apparently-typical sequences that must not be rendered “normally”, nor interpreted as if they meant what “ought” to mean. This would represent an unwelcome burden on every implementation that wants to handle Arabic script in any way.
The answer to this was that the ghost characters theory already exists in Unicode on different pages and there was no restriction for the usage of nuqtas, so no ambiguities were to be introduced. It was suggested that 08 place may be given to this new set, i.e. nuqtas are separate characters. The example on page 06 were like
To achieve this, it is proposed that rather than adding the decompositions of the current recomposed Arabic letters to the UCD as canonical decompositions (which seems natural, but contravenes published Unicode stability policy), a new property that could be named “required” should be defined. The existing recomposed Arabic letters would have their “decomposed forms” defined here. The intention is that the required composition property gives compositions that must always be used during normalization – even in NFD.
The Unicode Standard allows for the dynamic composition of accented forms and Hangul syllables. Combining characters used to create composite forms are productive. Because the process of character composition is open-ended, new forms with modifying marks may be created from a combination of base characters followed by combining characters. For example, the diacritics “..” may be combined with all vowels and a number of consonants in languages using the Latin script and several other scripts.
There are many ways to categorize the points. This illustrates some of the categorizations and basic terminology used in the Unicode Standard. Not all assigned code points represent abstract characters; only Graphic, Format, Control and private-use do. Surrogates and Noncharacters are assigned code points but are not assigned to abstract characters. Reserved code points are assignable: any may be assigned in a future version of the standard. The General Category provides a finer breakdown of required character codes following the base character. For combining characters placed below a base character, the situation is reversed, with the combining characters starting from the base character and stacking downward.
Another example of multiple combining characters above the base character can be found in Thai, where a consonant letter can have above it one of the vowel U+0E34 through U+0E37 and, above that, one of four tone marks U+0E48 through U+0E4B. The order of character codes that produces this graphic display is base consonant character + vowel character + tone mark character.
Ligated base character with multiple combining marks do not commonly occur in most scripts. However, in some scripts, such as Arabic, this situation occurs quite often when vowel marks are used. It arises because of the large number of ligatures in Arabic, where each element of a ligature is a consonant, which in turn can have a vowel mark attached to it. Ligatures can even occur with three or more characters merging; vowel marks may be attached to each part.
In cases involving two or more sequences considered to be equivalent, the Unicode Standard does not prescribe one particular sequence as being the correct one; instead, each sequence is merely equivalent to the others. Figure illustrates the two major forms of equivalent sequences formally defined by the Unicode Standard. In the first example, the sequences are canonically equivalent. Both sequences should display and be interpreted the same way. The second and third examples illustrate different compatibility sequences. Compatible-equivalent sequences may have format difference in display and may be interpreted differently in some contexts.
A key part of normalization is to provide a unique canonical order for visually no distinct sequences of combining characters. Figure shows the effect of canonical ordering for multiple combining marks applied to the same base character.
When combining characters do not interact typographically, the relative ordering of contiguous combining marks cannot result in any visual distinction and thus is insignificant
Then it was suggested by Dr. Durrani that 08 or other place might be given to this new decomposed set so there will be no duplication or problem of normalization. Later this place i.e. 08 was allotted to the proposal of Dr. Durrani.
The question of appropriate combining classes for the nuqtas requires some attention. Given that the nuqtas are closely associated with the base letter, it seems natural to assign them a low combining class value; this would keep them close to the base letter in NFF, which could benefit analytical processes and rendering systems. It could also tend to help the efficiency of the NFC/NFD algorithms which need to recombine base + nuqtas sequences.
There is already a combining class, 7, used for “nuqtas” in Indic scripts; these are consonant-modifiers that go below the basic consonant, and thus very similar to the proposed Arabic nuqtas. It was suggested, therefore, using this same combining class value for the nuqtas that are positioned below the base or ghost letter, and 6 (8 is already in use) for those that go above. Nuqta-like marks that actually attach to the base letter (ring, as seen on U+067C and others; stroke through, as seen on U+06C5) could have combining class I (also used for combining overlays is that they differ from the class of the combining HAMZA marks that are already in Unicode. U+0681 show that the HAMZA form has been used as a nukta-like mark to create a new letter in at least one instance, in addition to its conventional use of ALEF, WAW, and YEH. It therefore seems unfortunate for it not to share the combining class value of the other nuqtas.
Here there was no need of more discussion as nuqtas are now declared as separate character.
Jonathan Kew, (2003) have also stated that a variety of letters that are not represented in Unicode 4.0. Some of the more “interesting” letters are highlighted. Note that in many cases, several different writing conventions for the same language are mentioned. Even if some characters are eventually dropped during orthographic standardization/reform of these languages, the fact that they have been traditionally used by some writers mean that they need to be taken into considerations, otherwise existing texts cannot be encoded. BEY skeleton with two dots vertically above right end; NOON or BEY skeleton (ambiguous, because chart shows linked initial form) with dot above and two dots below i.e. Songhoy:
HAH with two dots above; AIN with two dots above. Songhoy language.
After all this effort here somewhat the success story starts and the Dr. Durrani’s proposed theory i.e. “Ghost Characters Theory” got accepted and all the proposed characters were given 08 place on UNICODE. Following were the codes that were assigned to the ghost characters of nuqtas.
According to the proposal 22 additions were requested. They have taken the name and tried to align them with the notion of being spacing characters. The names are also updated to the usual style for such Characters, beginning with “Arabic” for the script, and then annotated where appropriate for the particular language.
·0880 ARABIC SINGLE NUQTA ABOVE
·0881 ARABIC SINGLE NUQTA BELOW
·0882 ARABIC DOUBLE NUQTA ABOVE
·0883 ARABIC DOUBLE NUQTA BELOW
·0884 ARABIC TRIPLE NUQTA ABOVE
·0885 ARABIC TRIPLE NUQTA BELOW
·0886 ARABIC TRIPLE INVERTED NUQTA ABOVE
·0887 ARABIC TRIPLE INVERTED NUQTA BELOW
·0888 ARABIC QUADRUPLE NUQTA ABOVE
·* Sindhi
·0889 ARABIC QUADRUPLE NUQTA BELOW
·* Sindhi
·088A ARABIC DOUBLE DANDA ABOVE
·*Sindhi
·088B ARABIC DOUBLE DANDA BELOW
·088C ARABIC DOUBLE NUQTA VERTICAL ABOVE
·*Sindhi
·088D ARABIC DOUBLE NUQTA VERTICAL BELOW
·*Sindhi
·088E ARABIC SINGLE KASHIDA ABOVE
·*Urdu
·088F ARABIC SINGLE KASHIDA BELOW
·*Urdu
·0890 ARABIC DOUBLE KASHIDA ABOVE
·*Urdu
·0891 ARABIC DOUBLE KASHIDA BELOW
·*Urdu
·0892 ARABIC SINGLE CIRCLE ABOVE
·*Pashto
·0893 ARABIC SINGLE CIRCLE BELOW
·*Pashto
·0894 ARABIC TOTA ABOVE
·*Urdu
·0895 ARABIC TOTA BELOW
·*Urdu
This is a turning point in the history of Arabic fonts. Any character/letter for any language based on the Arabic script. There are only 44 atomized or Ghost Characters can be normalized or formed by these 44 characters, hence no need of different font for different languages. Any Pakistani language font developer or linguist can derive any character having any atom-combination. Dr. Attash Durrani’s Ghost Theory is a revolutionary step in the field of font development.
Here is the example of NLA’s Pak Nastaleeq Font depicting Arabic, Urdu, Pushto, Persian, Sindhi processed with a single font and that is the fruit.
Urdu, Arabic and Persian
بِسْمِ اللّٰہِ الرَّحْمٰنِ الرَّحِیْمِ تمام تعریفیں اللّٰہ ربُّ الْعِزَّت کے لیے ہیں جو تمام جہانوں کا ربّ ہے۔ وہی سزا وار حمدو ثنا ہے۔ اسی کے قبضے میں تمام کائنات ہے۔ اے اہلِ ایمان! صلٰوۃ و زکوٰۃ کا اہتمام کرو۔ نبی اکرمؐ۔ حضرت نوحؑ۔ حضرت عثمانؓ۔ قائد اعظمؒ۔ غالبؔ۔ ویکیپدیا پروژهای چندزبانه برای گردآوری دانشنامهای جامع و با محتویات آزاد است. این پروژه (به زبان انگلیسی) از ژانویهٔ ۲۰۰۱ آغاز شده و اکنون ۱۳[FONT="]٬[/FONT]۷۰۰ مقاله به زبان فارسی دارد. شما هم میتوانید مقالات را ویرایش کنید. برای فراگیری و تمرین این کار میتوانید نخست به صفحهٔ راهنما رفته و سپس در گودال ماسهبازی آزمایش کنید. لأن مصطلح الديمقراطية يستخدم لوصف أشكال الحكم و المجتمع الحر بالتناوب، فغالباً ما يُساء فهمه لأن المرء يتوقع عادة أن تعطيه زخارف حكم الأغلبيا كل مزايا المجتمع الحر. إذ في الوقت الذي يمكن فيه أن يكون للمجتمع الديمقراطي حكومتہ ديمقراطتہ فإن وجود حكومتہ ديمقراطيا لا يعني بالضرورة وجود مجتمع ديمقراطي. لقد إكتسب مصطلح الديمقراطيتہ إيحاءً إيجابياً جداً خلال النصف الثاني من القرن العشرين الى حد دفع بالحكام الدكتاتوريين الشموليين للتشدق بدعم "الديمقراطيا" وإجراء
[FONT="]Sindhi and Pushto[/FONT]
[FONT="]Total Ghost Characters are 44 out of which 22 are Kashties.[/FONT]
References:
Attash Durrani ,Dr., Letter to Jonathon (Rationale for Nuqta Proposal)
Attash Durrani ,Dr., Letter to Mark Davis.
Attash Durrani, Dr., 2006. Nuqta Marks in Arabic Detailed Character Properties.
Jonathan Kew., 2003. Images of potential extended Arabic characters.
Mark Davis, Kamal Mansour,. 2002. Proposal To Amend Arabic Repertoire.
Mark Davis,. 2003. Unicode Technical Note # 13. P 1-5.
--------------------------------------------------------------------------------------------------
Assistant Informatics Officer, Center of Excellence for Urdu Informatics, National Language Authority, Islamabad-PAKISTAN