Unicode
| | This article needs translation. If you want to participate, can process the article translating him or adding your own material and remove {{meta'frasi}} as soon as him you complete. |
In the computers, international sta'ntarnt Unicodeit aims in the coding of all alphabets that is used in the planet so that becomes possible the storage in the memory of computer the text of any language symperjlamvanwme'nwn and symbols of sciences,as mathematics,natural ktl.
The establishment of Unicode is a ambitious drawing after it intends it replaces all the existing codings of totals of characters,what have restrictions that him render problematic for use in multilanguage calculating systems.
Despite the technical problems that have presented the Unicode it has been established as who complete total of characters and as the preferable coding in multilanguage software.A lot of recent models as XML, as well as software of system as functional systems, they have adopted the Unicode in order to anaparastoy'n internally current.
Table of content |
Birth and growth of model
The Unicode sta'ntarnt aimed at explicit it exceeds the restrictions of traditional codings of characters as these are fixed from ISO 8859 model that was used widely in a lot of countries in the world but presented problems of incompatibility between his different concretisations. A lot of traditional codings of characters are shared a common problem in that they allow support of two alphabets,usually the romanjkoy' and one local, but they do not support a lot of languages.
The Unicode codes abstract characters supplying a code point - in each one their, no concrete forms that these can take in various fonts. In other words the Unicode model leaves the proportional software navigator of internet,processor of text decide ayto'.tin optical representation (sty'l,size,grammatosejra') the characters
Also in the model it includes also relative subjects as attributes of characters, fo'rmes normalization of text,direction emfa'njsis(gja languages that are also read from right to left as Arabic language and Evraj!ka'.
Alphabets that are covered
The Unicode covers almost all the alphabets that are in use today. These are:
|
and more. |
The Unicode has added also other alphabets as historical alphabets and disappearred alphabets for academic reasons:
|
... |
Also perjlamva'nesj and other symbols that are used in mathematics and music.
Various
1997 Michael Everson proposed are coded also the characters fantastjki's Klj'ngkon of language in his Epj'peso 1
ISO/IEC 10646-2. But this proposal was rejected as the proposal for his inclusion of languages Tolkjn.
Codings
Sta'ntarnt
The consortium Unicode with seat Kaljfo'rnja, it develops the model Unicode. Any organisation or private individual can become her member provided that it pays subscription. In the members they are included almost all the big companies of software and material that is interested relatively as Apple, Microsoft, IBM, Xerox, HP, Adobe Systems and a lot of other.
The consortium published first time The model Unicode(ISBN 0321185781) 1991,and continue developing model based on this initial work.The consortium Unicode was developed in combination with International Organism of Standardisation ISO,and her model is shared her total of characters with the modelISO/IEC 10646. The Unicode and the ISO/IEC 10646 they are equivalent as codings of characters but Unicode parje'hej a lot more information for programmers where materialising,covering in-depth subjects as coding based in mpjt, Unicode collation algo'rjcmoj, and graphic output. The Unicode enumerates enough attributes of characters,including and those that need for BiDi support. The two models use partially different terminology.
When we write for code points of Unicode he is usual to use the form U+xxxx or U+xxxxxx where xxxx or xxxxxx are the code point in dekaexadjko' system
Unicode background of revisions
- 1991 Unicode 1.0
- 1993 Unicode 1.1
- 1996 Unicode 2.0
- 1998 Unicode 2.1
- 1999 Unicode 3.0
- 2001 Unicode 3.1
- 2002 Unicode 3.2
- 2003 Unicode 4.0
- 2005 Unicode 4.1
Storage transport and treatment
Up to now Unicode it was presented simply as a apejko'nisi of each character that is used in some alphabet in the Ground in a unique number,the code point.However the storage of these numbers at the treatment of text is a completely different subject. Problems emerge from the make that software that is written in the western world it only handles codings 8-bit with the unicode support it is added much later.
The internal logic of traditional 8-mpjt applications allows only 8 mpjts for each character making impossible the hrisimopoj'isi of more 256 code points without special treatment. Thus engineers of software they have it proposes various mechanisms for the concretisation of Unicode.Who concretisation uses the each programmer from subjects of capacity,compatibility pigaj'oy code and interoperability with other systems.
The Unicode fixes two ways of depiction:
- UTF (Unicode Transformation Format) codings
- And UCS (Universal Character Set) codings
These kwdjkooji'sejs they include exej's main:
(The number implies the number mpjts in each unit (for UTF codings) or byter per unit (for UCS codings).)
In UTF-32 or ucs-, one unit suffices for any character; in the other cases, each character may use a variable number of units. utf- provides the $$$--FACTO standard encoding for interchange of Unicode text with UTF-16. UTF-32 occurs mainly in internal processing.
The ucs- and UTF-16 encodings specify the Unicode byte order mark (BOM) for use at the beginnings of text files. Some software developers have adopted it for other encodings, including utf-, which does not need an indication of byte order. In this case it attempts to mark the file as containing Unicode text. The BOM, code point U+FEFF, has the important property of unambiguity, regardless of the Unicode encoding used. The units FE and FF never appear in utf-8; U+FFFE (the result of byte-swapping U+FEFF) does not equate to a legal character, and U+FEFF conveys the Zero-Width No-Break Space (a character with no appearance and no effect other than preventing the formation of ligatures). The same character converted to utf- becomes the byte sequence EF BB BF.
EUT also: Mapping of Unicode characters
Ready and complex characters
Unicode includes a mechanism for modifying character shape and so greatly extending the supported glyph repertoire. This covers the use of combining diacritical marks. They get inserted after the main character (one can stack several combining diacritics over the same character). However, for reasons of compatibility, Unicode also includes a large quantity of precomposed characters. So in many cases, users have many ways of encoding the same character. To deal with this, Unicode provides the mechanism of canonical equivalence.
A similar situation exists with Hangul. Unicode provides the mechanism for composing Hangul syllables with Hangul Jamo. However, it also provides the precomposed Hangul syllables (11,172 of them).
The CJK ideographs currently have codes only for their precomposed form. Still, most of those ideographs evidently comprise simpler elements, so in principle Unicode could decompose them just as happens with Hangul. This would greatly reduce the number of required codepoints, while allowing the display of virtually every conceivable ideograph (and so doing away with all problems of the Han unification). A similar idea covers some input methods, such as Cangjie and Wubi. However, attempts to do this for character encoding have stumbled over the fact that ideographs do not decompose as simply or as regularly as they seem to.
Combining marks, like the complex script shaping required to properly render Arabic text and many other scripts, are usually dependent on complex font technologies, like OpenType (by Adobe and Microsoft), Graphite (by SIL International), and AAT (by Apple), by which a font designer includes instructions in a font telling software how to properly output different character sequences. Another method sometimes employed in fixed-width fonts is place the combining mark's glyph before its own left sidebearing; this method, however, only works for some diacritics and stacking will not occur properly.
As of 2004, most software still cannot reliably handle many features not supported by older font formats, so combining characters generally will not work correctly. Hypothetically, ? (precomposed e with macron and acute above) and e?? (e followed by the combining macron above and combining acute above) are identical in appearance, both giving an e with macron and acute accent, but appearance can vary greatly across software applications.
Also underdots, as needed in Indic Romanization, will often be placed incorrectly or worse. Sample:
- m? - n? - l?
Of course, this is in fact not a weakness in Unicode itself, but only uncovers gaps in rendering technology and fonts.
Various subjects
Some people, mostly in Japan, oppose Unicode in general, claiming technical limitations and political problems in process, which people working on the Unicode standard claim are simply misunderstandings of the Unicode standard and the process by which it was created. The most common mistake, according to this view, is confusion between abstract characters and their highly variable visual forms (glyphs). On the other hand, whereas Chinese can readily read most types of glyphs used by Japanese or Koreans, Japanese often can recognize only a particular variant. Unicode has been decried as a plot against Asian cultures perpetrated by Westerners with no understanding of the characters as used in Chinese, Korea, and Japanese, in spite of the presence of a majority of experts from all three countries in the Ideographic Rapporteur Group. The IRG advises the consortium and ISO on additions to the repertoire and on Han unification, the identification of forms in the three languages which will be treated as stylistic variations of the same historical character. This unification is one of the most controversial aspects of Unicode.
Unicode is criticized for failing to allow for older and alternate forms of kanji, which, it is said, complicates the processing of ancient Japanese and uncommon Japanese names, although it follows the recommendations of Japanese scholars of the language and of the Japanese government. There have been several attempts to create an alternative to Unicode. [ 1 ] Among them are TRON (although it is not widely adopted in Japan, some, particularly those who need to handle historical Japanese text, favor this), UTF-2000 and Giga Character Set (GCS). It is true that many older forms were not included in early versions of the Unicode standard, but Unicode 4.0 contains more than 90,000 Han characters, far more than any dictionary or any other standard, and work continues on adding characters from the early literature of China, Korea, and Japan.
Thai language support has been criticized for its illogical ordering of Thai characters. This complication is due to Unicode inheriting the Thai Industrial Standard 620, which worked in the same way. This ordering problem complicates the Unicode collation process. [ 2 ]
Opponents of Unicode sometimes claim even now that it cannot handle more than 65.535 characters, a limitation that was removed in Unicode 2.0.
Use of Unicode
Functional systems
Despite the technical problems the restrictions and the criticism in the course,the Unicode has prevailed as the sovereign form of coding of characters. Windows NT and his descendants Windows 2000 and Windows XP they make extensive use of form of coding UTF-16 for internal representation of text. UNIX functional systems as GNU/Linux, Plan 9 apo' Bell Labs, BSD and Mac OS X they have adopted the form utf-8, as the base for the representation multilanguage text.
The model MIME it fixes two different mechanisms for coding of no-ascii characters in messages of electronic correspondence,e-mails, depending on whether the characters are in the headings il.message as ph the heading "Subject:"or they are found in gentleman text il.message. And in the two cases,is determined the initial total of characters as well as the coding of transport.For electronic correspondence with Unicode characters are proposed the form of coding utf-8 and the coding of transport Base64 . The details of two mechanisms are determined in the model MIME and generally are hidden from the simple user computational il.correspondence.
The adoption of Unicode in Electronic correspondenceshe is very slow.Most texts in Eastern Asia are coded still in local codings as Shift-JIS,and a lot of popular programs il.correspondence even if they have somebodies unicode support nevertheless cannot handle Unicode given right.This situation is not forecasted to change the following future.
Internet
The new navigators of internet can and portray rightly web pages with Unicode characters provided that has been installed proportional font.
Even if syntactic rules can influence the line with which the characters it is allowed they are presented also language HTML 4.0 but also XML 1.0 ex'orjsmoy' support documents that are constituted by characters by all the breadth of code points of Unicode of excluded only certain characters of control permanent not-available code points D800-DFFF,
any code point that finishes in FFFE or FFFF and any code point above 10FFFF.
These characters present themselves or directly as mpa'jts according to the coding of document,provided that they are supported by the coding,or can be written as numerical reports of characters based on the code point of Unicode of character,provided that the coding that it uses the document allows the digits and the symbols that need in order to we write anafore's(ka'tj that happens with all the codings that have been adopted in the internet) For example the reports:
Δ? Й? ק? م? ๗? あ? 叶? 葉? 냻? (or the same price in dekaexadjko' with prefix
) it is presented in your navigator as D, ?, ?, ?, ?, ?, ?, ? and?-provided that you have the suitable font, these symbols appear as Greek capital letter "Delta", Cyrillic capital letter "Short", Arabic letter "Meem", Hebrew letter "Qof", Thai numeral 7, Japanese Hiragana "A", simplified Chinese "Leaf", traditional Chinese "Leaf", and Korea Hangul syllable "Nyaelh", respectively.
Fonts
Free and marketable fonts that are based on the Unicode model are common, with first TrueType and now OpenType fonts that support and the two Unicode portraying code points in concrete appearances of characters.
Exist thousands fonts in the market,but less from twelve try they support the majority of total of characters of model Unicode. On the contrary the based in the Unicode fonts usually support only basic ASCII and certain concrete alphabets.This becomes mainly for reasons of economy of authors of fonts and attribution of programs that can kneel kacw's the attribution of fonts is a process that consumes a lot of resources of computer.
Characters Unicode that cannot apodwcoy'n graphic be portrayed with one white square.
Machines of graphic attribution of multilanguage text
- Uniscribe - Windows
- Apple type Services for Unicode Imaging - new machine for Macintosh
- WorldScript - old machine for Macintosh
- Pango - software of open code
- ICU Layout Engine - software of open code
- Graphite - (open source renderer from SIL)
Methods of import
The processors of text Microsoft Wordthey allow the import of characters Unicode with two ways:
- pliktrologw'ntas the dekaexadjko' code point,ph
014V(i' U+014b) for ?, and then pressing alt + x so that it is replaced symvolosejra' in left the runner with the corresponding character unicode.
Usually becomes also reverse,that is to say a'ma you have a unicode character in left the runner and step Alt + x the Word will still replace the character with his equivalent code point in dekaexadjko' or
- pliktrologw'ntas
Alt + #, where # it is the decimal code point, phAlt + 0331us it will give the character Unicode ?.
- pliktrologw'ntas
Gnome2 it follows the model ISO 14755. It kept stepped the keys Ctrl and Shift and it imported in the dekaexadjko' code point of unicode character that you want emvasnjste'j.
Des also
Exterior contacts
- [http://graphis.hellug.gr/el/index.html Collection of Greek fonts
for X Window System that includes and unicode fonts ]
- The Unicode Consortium
- Unicode versions: 3.1, 3.2, 4.0, 4.0.1, 4.1
- new characters, scripts and characters and scripts under investigation
- Code Charts (PDF)
- Table of Unicode characters from 1 to 65535
- utf-8, UTF-16, UTF-32 Code Charts and a character map (JavaScript)
- The Letter Database Uses forms to present groups in list or grid format by hexadecimal.
- Example text files using Unicode
- Unicode special character map is similar to the Windows version. Click a symbol to obtain either the named or numeric code for HTML.
- ConScript Unicode Registry a project to standardize part of the Private Use Area for use with artificial scripts and artificial languages. An explanation of how to propose character names in Unicode is available here.
- The secret life of Unicode "A peek at Unicode's soft underbelly" Describes problems requiring resolution. Includes links to Unicode resources.
- Tim Bray's Characters vs Bytes explains how the different encodings work.
- Alan Wood's Unicode Resources Contains lists of word processors with Unicode capability; fonts and characters are grouped by type; characters are presented in lists, not grids.
- The strongest denunciation of Unicode, and a response to it
- Software engineering:
- International Components for Unicode (ICU) An open source set of libraries that provide robust and full-featured Unicode services for your applications on a wide variety of platforms.
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky of JoelonSoftware.COM (this is I comprehend outdated, but still a reasonable starting point).
- Freedesktop.Org's Project utf-' s purpose is to document and promote proper Unicode support in free and Open Source software.
- Supplementary Characters in the java Platform from Sun Microsystems
- Seeing the entirety of Unicode printed out as a single large poster gives a good feel for the size of the code.
