2.1 Character Set

24
Dec

Character set

1/2 {character set} The character repertoire for the text of an Ada program consists of the entire coding space described by the ISO/IEC 10646:2003 Universal Multiple-Octet Coded Character Set. This coding space is organized in planes, each plane comprising 65536 characters. {plane (character)} {character plane}

1.c/2 Discussion: It is our intent to follow the terminology of ISO/IEC 10646:2003 where appropriate, and to remain compatible with the character classifications defined in §A.3, “Character Handling”.

Syntax

3.1/2 A character is defined by this International Standard for each cell in the coding space described by ISO/IEC 10646:2003, regardless of whether or not ISO/IEC 10646:2003 allocates a character to that cell. 

Static Semantics

4/2 The coded representation for characters is implementation defined [(it need not be a representation defined within ISO/IEC 10646:2003)]. A character whose relative code position in its plane is 16#FFFE# or 16#FFFF# is not allowed anywhere in the text of a program. 

4.a Implementation defined: The coded representation for the text of an Ada program.

4.b/2 Ramification: Note that this rule doesn't really have much force, since the implementation can represent characters in the source in any way it sees fit. For example, an implementation could simply define that what seems to be an other_private_use character is actually a representation of the space character. 

4.1/2 The semantics of an Ada program whose text is not in Normalization Form KC (as defined by section 24 of ISO/IEC 10646:2003) is implementation defined.

4.c/2 Implementation defined: The semantics of an Ada program whose text is not in Normalization Form KC.

5/2 The description of the language definition in this International Standard uses the character properties General Category, Simple Uppercase Mapping, Uppercase Mapping, and Special Case Condition of the documents referenced by the note in section 1 of ISO/IEC 10646:20031. The actual set of graphic symbols used by an implementation for the visual representation of the text of an Ada program is not specified. {unspecified [partial]}

6/2 Characters are categorized as follows:

6.a/2 Discussion: Our character classification considers that the cells not allocated in ISO/IEC 10646:2003 are graphic characters, except for those whose relative code position in their plane is 16#FFFE# or 16#FFFF#. This seems to provide the best compatibility with future versions of ISO/IEC 10646, as future characters can already be used in Ada character and string literals.

8/2 {letter_uppercase} letter_uppercase

Any character whose General Category is defined to be “Letter, Uppercase”.

9/2 {letter_lowercase} letter_lowercase

Any character whose General Category is defined to be “Letter, Lowercase”.

9.1/2 {letter_titlecase} letter_titlecase

Any character whose General Category is defined to be “Letter, Titlecase”.

9.2/2 {letter_modifier} letter_modifier

Any character whose General Category is defined to be “Letter, Modifier”.

9.3/2 {letter_other} letter_other

Any character whose General Category is defined to be “Letter, Other”.

9.4/2 {mark_non_spacing} mark_non_spacing

Any character whose General Category is defined to be “Mark, Non-Spacing”.

9.5/2 {mark_non_spacing} mark_spacing_combining

Any character whose General Category is defined to be “Mark, Spacing Combining”.

10/2 {number_decimal} number_decimal

Any character whose General Category is defined to be “Number, Decimal”.

10.1/2 {number_letter} number_letter

Any character whose General Category is defined to be “Number, Letter”.

10.2/2 {punctuation_connector} punctuation_connector

Any character whose General Category is defined to be “Punctuation, Connector”.

10.3/2 {other_format} other_format

Any character whose General Category is defined to be “Other, Format”.

11/2 {separator_space} separator_space

Any character whose General Category is defined to be “Separator, Space”.

12/2 {separator_line} separator_line

Any character whose General Category is defined to be “Separator, Line”.

12.1/2 {separator_paragraph} separator_paragraph

Any character whose General Category is defined to be “Separator, Paragraph”.

13/2 {format_effector} format_effector

The characters whose code positions are 16#09# (CHARACTER TABULATION), 16#0A# (LINE FEED), 16#0B# (LINE TABULATION), 16#0C# (FORM FEED), 16#0D# (CARRIAGE RETURN), 16#85# (NEXT LINE), and the characters in categories separator_line and separator_paragraph. {control character: See also format_effector}

13.a/2 Discussion: ISO/IEC 10646:2003 does not define the names of control characters, but rather refers to the names defined by ISO/IEC 6429:1992. These are the names that we use here. 

13.1/2 {other_control} other_control

Any character whose General Category is defined to be “Other, Control”, and which is not defined to be a format_effector.

13.2/2 {other_private_use} other_private_use

Any character whose General Category is defined to be “Other, Private Use”.

13.3/2 {other_surrogate} other_surrogate

Any character whose General Category is defined to be “Other, Surrogate”.

14/2 {graphic_character} graphic_character

Any character that is not in the categories other_control, other_private_use, other_surrogate, format_effector, and whose relative code position in its plane is neither 16#FFFE# nor 16#FFFF#.

14.b/2 Discussion: We considered basing the definition of lexical elements on Annex A of ISO/IEC TR 10176 (4th edition), which lists the characters which should be supported in identifiers for all programming languages, but we finally decided against this option. Note that it is not our intent to diverge from ISO/IEC TR 10176, except to the extent that ISO/IEC TR 10176 itself diverges from ISO/IEC 10646:2003 (which is the case at the time of this writing [January 2005]).

14.c/2 More precisely, we intend to align strictly with ISO/IEC 10646:2003. It must be noted that ISO/IEC TR 10176 is a Technical Report while ISO/IEC 10646:2003 is a Standard. If one has to make a choice, one should conform with the Standard rather than with the Technical Report. And, it turns out that one must make a choice because there are important differences between the two:

  • 14.d/2 ISO/IEC TR 10176 is still based on ISO/IEC 10646:2000 while ISO/IEC 10646:2003 has already been published for a year. We cannot afford to delay the adoption of our amendment until ISO/IEC TR 10176 has been revised.
  • 14.e/2 There are considerable differences between the two editions of ISO/IEC 10646, notably in supporting characters beyond the BMP (this might be significant for some languages, e.g. Korean).
  • 14.f/2 ISO/IEC TR 10176 does not define case conversion tables, which are essential for a case-insensitive language like Ada. To get case conversion tables, we would have to reference either ISO/IEC 10646:2003 or Unicode, or we would have to invent our own. 

14.g/2 For the purpose of defining the lexical elements of the language, we need character properties like categorization, as well as case conversion tables. These are mentioned in ISO/IEC 10646:2003 as useful for implementations, with a reference to Unicode. Machine-readable tables are available on the web at URLs: 

14.h/2 http://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt and http://www.unicode.org/Public/4.0-Update/CaseFolding-4.0.0.txt

14.i/2 with an explanatory document found at URL: 

14.j/2 http://www.unicode.org/Public/4.0-Update/UCD-4.0.0.html

14.k/2 The actual text of the standard only makes specific references to the corresponding clauses of ISO/IEC 10646:2003, not to Unicode.

15/2 The following names are used when referring to certain characters (the first name is that given in ISO/IEC 10646:2003) {quotation mark} {number sign} {ampersand} {apostrophe} {tick} {left parenthesis} {right parenthesis} {asterisk} {multiply} {plus sign} {comma} {hyphen-minus} {minus} {full stop} {dot} {point} {solidus} {divide} {colon} {semicolon} {less-than sign} {equals sign} {greater-than sign} {low line} {underline} {vertical line} {exclamation point} {percent sign}

15.a/2 Discussion: {graphic symbols} {glyphs} This table serves to show the correspondence between ISO/IEC 10646:2003 names and the graphic symbols (glyphs) used in this International Standard. These are the characters that play a special role in the syntax of Ada.

graphic
symbol
name graphic
symbol
name
" quotation mark : colon
# number sign ; semicolon
& ampersand < less-than sign
' apostrophe, tick = equals sign
( left parenthesis > greater-than sign
) right parenthesis _ low line, underline
* asterisk, multiply | vertical line
+ plus sign / solidus, divide
, comma ! exclamation point
- hyphen-minus, minus % percent sign
. full stop, dot, point    

Implementation Permissions

NOTES

17/2 [1] The characters in categories other_control, other_private_use, and other_surrogate are only allowed in comments.

18 [2] The language does not specify the source representation of programs.

18.a/2 Discussion: Any source representation is valid so long as the implementer can produce an (information-preserving) algorithm for translating both directions between the representation and the standard character set. (For example, every character in the standard character set has to be representable, even if the output devices attached to a given computer cannot print all of those characters properly.) From a practical point of view, every implementer will have to provide some way to process the ACATS. It is the intent to allow source representations, such as parse trees, that are not even linear sequences of characters. It is also the intent to allow different fonts: reserved words might be in bold face, and that should be irrelevant to the semantics. 

Extensions to Ada 83

18.b {extensions to Ada 83} Ada 95 allows 8-bit and 16-bit characters, as well as implementation-specified character sets. 

Wording Changes from Ada 83

18.c/2 The syntax rules in this clause are modified to remove the emphasis on basic characters vs. others. (In this day and age, there is no need to point out that you can write programs without using (for example) lower case letters.) In particular, character (representing all characters usable outside comments) is added, and basic_graphic_character, other_special_character, and basic_character are removed. Special_character is expanded to include Ada 83's other_special_character, as well as new 8-bit characters not present in Ada 83. Ada 2005 removes special_character altogether; we want to stick to ISO/IEC 10646:2003 character classifications. Note that the term “basic letter” is used in §A.3, “Character Handling” to refer to letters without diacritical marks.

18.d/2 Character names now come from ISO/IEC 10646:2003 ISO 10646.

Extensions to Ada 95

18.f/2 {extensions to Ada 95} Program text can use most characters defined by ISO-10646:2003. This clause has been rewritten to use the categories defined in that Standard. This should ease programming in languages other than English.


[aada]

Alexis Ada

Accept any character in the range 1 and 16#7FFFFFFD# (see 3.1/2, we also refuse 0, if I find the reference, I'll add it here--actually many control characters are not legal), except those that are clearly marked as illegal in ISO-10646:2003 (0xFFFE and 0xFFFF, see 4/2).

We will use the most current Unicode tables as offered by http://unicode.org (at time of writing, this is version 5.2.0). See http://www.unicode.org/ucd/ for the latest version. The files used by AAda are usually found at a URL such as: http://www.unicode.org/Public/5.2.0/

All input files to the main compiler (i.e. not including any front-end filters) are considered UTF-8 characters (see 4.a) which is not compatible with the general default for Ada, ISO-8859-1.

The UTF-8 encoding must be normalized to be accepted (i.e. use the least number of bytes possible to encode each character, see 4.1/2 and 4.c/2.) Although we support unnormilized characters on request (see pragmas)

All characters are categories in ISO-10646:2003. This category is available to the compiler to give each character a type (see 8/2 to 14/2). This is done using a character library defining a character object type. This library makes use of definitions given by the Unicode website. However, those tables are tweaked to specifically match the Ada 2005 types without further processing. The character object includes:

  • The character in UCS-4 encoding;
  • The other case of a lower or upper case character (§8/2, §9/2);
  • The ASCII decimal digit (§10/2);
  • The ASCII letter digit (§10.1/2);
  • The Ada 2005 character type (§8/2, §14/2);
  • The UTF-8 representation (since we already have it, we can as well keep it so we can print the character without having to re-encode it.)

NOTE  The ASCII encoding is managed internally for fast processing.

The UCS-4 character is computed and used by the lexer and tokenizer objects, but it is output back as UTF-8 for later stages that do not require detailed information about each character. However, we also keep the source case for errors and the uppercase version of the string to allow for case insensitive comparisons.

This is not the case: Numbers are transformed by the lexer as returned as numbers to the parser. It will be done by the parser instead.

General Category Values (from Unicode UCD 4.0.0)

The values in this field are abbreviations for the following values. For more information, see the Unicode Standard.

Note: The Unicode Standard does not assign information to control characters (except for certain cases). Implementations will generally also assign categories to certain control characters, notably CR and LF, according to platform conventions. See Section §5.8 "Newline Guidelines" for more information.

Abbr.

Description

Lu Letter, Uppercase
Ll Letter, Lowercase
Lt Letter, Titlecase
Lm Letter, Modifier
Lo Letter, Other
Mn Mark, Non-Spacing
Mc Mark, Spacing Combining
Me Mark, Enclosing
Nd Number, Decimal
Nl Number, Letter
No Number, Other
Pc Punctuation, Connector
Pd Punctuation, Dash
Ps Punctuation, Open
Pe Punctuation, Close
Pi Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
Pf Punctuation, Final quote (may behave like Ps or Pe depending on usage)
Po Punctuation, Other
Sm Symbol, Math
Sc Symbol, Currency
Sk Symbol, Modifier
So Symbol, Other
Zs Separator, Space
Zl Separator, Line
Zp Separator, Paragraph
Cc Other, Control
Cf Other, Format
Cs Other, Surrogate
Co Other, Private Use
Cn Other, Not Assigned (no characters in the file have this property)

Note: The term "L&" is used to stand for Uppercase, Lowercase or Titlecase letters (Lu, Ll, or Lt) in comments. The LC value in PropertyValueAliases.txt also stands for Uppercase, Lowercase or Titlecase letters.

[/aada]

  • 1. Note in section of ISO/IEC 10646:2003 reads:
    NOTE – The Unicode Standard, Version 4.0 includes a set of characters, names, and coded representations that are identical with those in this International Standard. It additionally provides details of character properties, processing algorithms, and definitions that are useful to implementers.