University code

University code is more international Standard, in on a long-term basis for each reflect-rising up Indication and/or. Text element of all well-known Writing cultures and Plotting systems digitally Code one specifies. It wants the problem of the different incompatible Coding into the different Countries eliminate. Conventional computer plotting systems cover a character set of either 128 (7 bits) indications like that much admitted with placed in front "U+" represented. Here "x" can be used as substitute symbols, if coherent ranges are meant, like "U+01Fx" for the code blocks U+01F0-U+01FF.

The code area of university code originally covered 65.536 indications (UCS-2, 16 bits). Soon however this turned out as insufficient. In version 2.0 the code blocks became around further 16 equal large ranges, so-called Flat one (levels) extends. Thus 1 is now maximum.114.112 (220+216) indications and/or. Code POINTS in the code blocks from U+00000 to U+10FFFF intended (UCS-4, 32 bits). So far, in university University of 4.0, are 96.382 codes individual indications assigned. That corresponds in approximately only 9% of the code area.

The code blocks (blocks), into which the university code levels are subdivided, are in that List of the university code blocks completely specified. Beside the valid coded indications, z is also very long-term. T. still quite vague planning specified.

The storage and transmission of university code take place in different formats:

  • University University of transformation format (UTF), how the most common is, z.B. in and in nearly all . Beside has UTF-16 a great importance, so z.B. as indication coding in Java, which corresponds to the university University of UCS-4 for all UCS-2 code POINTS, and all other code POINTS than two-sequences, the so-called Surrogate pairs, illustrates.
  • SCSU (standard Compression Scheme for university code, in former times also called RCSU - Reuters ' Compression Scheme for university code -) is a method for space-saving storage, which uses the arrangement of the different alphabets in blocks (see Web on the left of).
  • UTF EBCDIC a university code extension is, on that the prop. guessing eras EBCDICFormat of -Large computers constructs.
  • Punycode serves for it, with not ASCII indications. See also: IDNA.
  • In addition there are the formats CESU-8 and GB18030.

Institutions for standardisation

The non-profit University University of Consortium is for that Industry standard University code responsible. Of the ISO (International Organization for Standardization) the international becomes Standard ISO 10646 given change. Both institutions co-operate closely. Since 1993 university code and ISO 10646 are identical concerning the indication coding. While ISO 10646 specifies only the actual indication coding, a comprehensive set of rules, the u belongs to the university code. A. for all indications further characteristics important for concrete application clearly specifies like sort sequence, writing direction and rules of combining indications.

For the moment university code is strictly still another subset of ISO 10646: While ISO 10646 permits character codes with up to 31 bits, maximally 21 bits are permitted with university code. In next time however the ISO code blocks might be reduced to by university code.

Coding criteria

In relation to other standards there is the characteristic with university code that once coded indications are never again removed, in order to ensure the longevity of digital data. If the standardisation of an indication should later prove as errors, if necessary its use one advises against. Therefore the admission requires an indication into the standard of an extremely careful examination, which can drag on over years.

In the university code "abstract indications" (English become: character) does not code, Glyphen. The latter is the diagram of abstract indications, which can fail extremely differently, with latin alphabet for example in German type, Antiqua, in Irish one and in Handwriting, see also Glyphe. For Glyphenvarianten, whose standardisation is proven as meaningful and necessarily, as a precaution 256 "variation Selectors" is intended, those if necessary. the actual code to be placed behind can.

On the other hand have writings, both the that latin and that , doubly coded identical Glyphen for the following ambiguous letters contains: ? ? ? ? ? ? ? ? ? ? ? ? ? ?. From many indications it does not only give by those Character font conditioned variants but also within a character font more or less necessary -, writing or context depending Glyphenvarianten spoke and Ligatures, to their representation it so-called Smartfonttechniken how Open type, however does not require a university code coding.

However in border lines hard for the decision it is struggled whether it itself around Glyphenvarianten or code-worthy indications, D. h. different (Grapheme) acts. For example few specialists of the opinion are not, that phoenizische alphabet one can as Glyphenvarianten of the Hebrew regard, there the entire character set of the Phoenizi there clear correspondences has, and also the two Languages are very closely related. The view, it concerns a separate Plotting system, in the university code terminology "script", in the long run interspersed itself. Differently it behaves CJK: Chinese, Japanese and Korean). Here 20 has themselves in. Century the forms of many equivalent characters apart-develops. The languagespecific Glyphen the same codes divides nevertheless in the university code. In practice predominantly languagespecific here probably become Character fonts used, and those are characterised already by unusual file sizes. Uniform coding of the CJK characters (Han Unification) was one most important and most extensive pre-working for the development of university code. Particularly in Japan it is quite disputed. To details (English) see Web on the left of.

As that Foundation-stone for university code it was put, had to be considered that already a multiplicity of different coding was in common use. University-code-based systems should be able to handle conventionally coded data at small expenditure. For this the wide-spread became for the lower 256 indications ISO 8859-1Z maintain coding (Latin1) just like the kinds of coding of different national standards. B. TIS 620 for Thai (nearly identically also ISO 8859-11) or ISCII for Indian writings, which were only shifted in the original order into higher code blocks.

Each indication of relevant overcoming coding was transferred to the standard, even if it does not become fair the normally imposed standards. Here it concerns to a large part indications, which are compound from two or more indications, like letters also diacritical indications. In all other respects also today still another large part of the software does not have the possibility of building indications up with Diakritika properly. The accurate definition of equivalent coding is part of the extensive set of rules belonging to the university code. Although the hexadecimaldecadic numbers A to F formally fulfill the criteria for a separate coding, this had to be omitted, because in practice their function is always transferred by the letters A to F.

No Glyphe is assigned many university code characters. Also they are considered as "character". Beside that Control character like line feed (U+000A), tabulator (U+0009) etc.. alone 19 indications are explicitly as blank defined, even such without width, the u. A. as word disconnecting switches to be used for languages how Thai or Tibetan, which are written without word gap. For bi-directional text, z. B. Arab and Latin seven formatting characters are necessary.

Example: Combining Grapheme Joiner (CGJ)

The CGJ is an invisible special character, which is normally ignored by the application programs completely (English: default ignorable). It is not audruecklich for the marking of Glyphenvarianten o. Ae. are used. Its use is defined as follows:

In some languages it gives Digraphen and Tri graph, in principle as independent letters, D treats. h. in particular to be sorted. In the Hungarian one for example concerns: cs, dz, dzs, gy, ly, ny, sp, ty and zs. In order to mark exceptions of it if necessary, the "Combining Grapheme Joiner" CGJ (U+034F) was introduced. The name actually means the opposite, but, belonged to the standard, also the names of coded indications are also changed never.

A letter carries several Diakritika more drueber or more drunter, these are normally vertically stacked. For exceptional cases, in which two Diakritika must stand next to each other, university code plans that a CGJ is intervened. It is incumbent on the writing developer to specify the feature form of the character sequences "Diakritikon1 CGJ Diakritikon2" on then by means of a writing technology how Open type to be accessed can.

The characteristic "default ignorable" specified in the standard qualifies the CGJ to mark in special cases also different otherwise unnecessary fine differences. So the data processing of German libraries knows the distinction of Umlaut and Diaeresis (usually for fremdsprachige names) require. Here university code recommends to place the CGJ in front the diaeresis (U+0308) in order to mark it as umlaut. Originally of DIN suggested subsequent separate coding of the umlaut dots would have led to a hardly justifiable inconsistency of large data sets.

Input methods

One wants a university code character (for example "?") in HTML or XML uses, looks for one it first from the appropriate table (here: Mathematical symbols). There is its indication number indicated. With this indication number one provides then a Zeichenentitaet by placing in front "&#x" and to adding a semicolon, evenly "⊕". The indication number can be indicated in the Zeichenentitaet also decimally, then without prominent "x", for example "⊕" for the same indication. The text Encoding initiative TEI , university code compiled recommendations in XMLTo enter files in more easily understandable form. Here it concerns a set of designated indications (English: named entites), into that Stylesheet one integrates. Generally usual designated indications are z. B. the umlauts as "Ä" instead of "Ä" for Ae.

In Vi Improved one knows university code University of (a condition: Was based Locale or as university code, for example

 

  > German to English > de.wikipedia.org (Machine translated into English)