2.3 Identifiers

25
Dec

1 Identifiers are used as names. 

Syntax

2/2 identifier ::= identifier_start {identifier_start | identifier_extend}
3/2 identifier_start ::= 
     letter_uppercase
   | letter_lowercase
   | letter_titlecase
   | letter_modifier
   | letter_other
   | number_letter
3.1/2 identifier_extend ::= 
     mark_non_spacing
   | mark_spacing_combining
   | number_decimal
   | punctuation_connector
   | other_format

4/2 After eliminating the characters in category other_format, an identifier shall not contain two consecutive characters in category punctuation_connector, or end with a character in that category.

4.a/2 Reason: This rule was stated in the syntax in Ada 95, but that has gotten too complex in Ada 2005. Since other_format characters usually do not display, we do not want to count them as separating two underscores.

Static Semantics

5/2 Two identifiers are considered the same if they consist of the same sequence of characters after applying the following transformations (in this order):

  • 5.1/2 The characters in category other_format are eliminated.

  • 5.2/2 The remaining sequence of characters is converted to upper case. {case insensitive}

5.3/2 After applying these transformations, an identifier shall not be identical to a reserved word (in upper case). 

5.b/2 Implementation Note: We match the reserved words after doing these transformations so that the rules for identifiers and reserved words are the same. (This allows other_format characters, which usually don't display, in a reserved word without changing it to an identifier.) Since a compiler usually will lexically process identifiers and reserved words the same way (often with the same code), this will prevent a lot of headaches. 

5.c/2 Ramification: The rules for reserved words differ in one way: they define case conversion on letters rather than sequences. This means that some unusual sequences are neither identifiers nor reserved words. For instance, “ıf” and “acceß” have upper case conversions of “IF” and “ACCESS” respectively. These are not identifiers, because the transformed values are identical to a reserved word. But they are not reserved words, either, because the original values do not match any reserved word as defined or with any number of characters of the reserved word in upper case. Thus, these odd constructions are just illegal, and should not appear in the source of a program. 

Implementation Permissions

6 In a nonstandard mode, an implementation may support other upper/lower case equivalence rules for identifiers[, to accommodate local conventions].

6.a/2 Discussion: For instance, in most languages, the uppercase equivalent of LATIN SMALL LETTER I (a lower case letter with a dot above) is LATIN CAPITAL LETTER I (an upper case letter without a dot above). In Turkish, though, LATIN SMALL LETTER I and LATIN SMALL LETTER DOTLESS I are two distinct letters, so the upper case equivalent of LATIN SMALL LETTER I is LATIN CAPITAL LETTER I WITH DOT ABOVE, and the upper case equivalent of LATIN SMALL LETTER DOTLESS I is LATIN CAPITAL LETTER I. Take for instance the following identifier (which is the name of a city on the Tigris river in Eastern Anatolia): 

6.b/2 diyarbakır -- The first i is dotted, the second isn't.

6.c/2 Locale-independent conversion to upper case results in: 

6.d/2 DIYARBAKIR -- Both Is are dotless.

6.e/2 This means that the four following sequences of characters represent the same identifier, even though for a locutor of Turkish they would probably be considered distinct words: 

6.f/2 diyarbakir diyarbakır dıyarbakir dıyarbakır

6.g/2 An implementation targeting the Turkish market is allowed (in fact, expected) to provide a nonstandard mode where case folding is appropriate for Turkish. This would cause the original identifier to be converted to:

6.h/2 DİYARBAKIR -- The first I is dotted, the second isn't.

6.i/2 and the four sequences of characters shown above would represent four distinct identifiers.

6.j/2 Lithuanian and Azeri are two other languages that present similar idiosyncrasies. 

NOTES

6.1/2 [3] Identifiers differing only in the use of corresponding upper and lower case letters are considered the same. 

Examples

7 Examples of identifiers:

8/2

Count      X    Get_Symbol   Ethelyn   Marion
Snobol_4   X1   Page_Count   Store_Next_Item
Πλάτων      -- Plato 
Чайковский  -- Tchaikovsky  
θ  φ        -- Angles

Wording Changes from Ada 83

8.a We no longer include reserved words as identifiers. This is not a language change. In Ada 83, identifier included reserved words. However, this complicated several other rules (for example, regarding implementation-defined attributes and pragmas, etc.). We now explicitly allow certain reserved words for attribute designators, to make up for the loss. 

8.b Ramification: Because syntax rules are relevant to overload resolution, it means that if it looks like a reserved word, it is not an identifier. As a side effect, implementations cannot use reserved words as implementation-defined attributes or pragma names. 

Extensions to Ada 95

8.c/2 {extensions to Ada 95} An identifier can use any letter defined by ISO-10646:2003, along with several other categories. This should ease programming in languages other than English.

[aada]

The compiler includes tables that directly defines whether a character is a identifier_start or identifier_extend1.

To satisfy 4/2, the lexer first reads the identifier in full, then verifies that no two punctuation characters follow each other or appear at the end of the identifier. Each such erroneous instance will be displayed. (Note: this would be the cleaned up identifier as mentioned in the list below.)

In order to satisfy 5.c/2, we want to:

  • Transform identifiers as per 5.1/2, save it as the source identifier,
  • Transform the source as per 5.2/2, save it as the cleaned up identifier,
  • Determine whether the clean identifier matches a reserved word,
  • If the cleaned up identifier is determined to be a reserved word, verify that it is a valid reserved word by comparing each letter to the source identifier without changing the case of the source2.

Note that an invalid identifier may represent a fatal error.

Important Note about 5.1/2 and 5.2/2

In Unicode, characters can be followed by a CCC byte code. In most cases, this is used as a diacritic. At times it changes the sound of a character.

When removing format characters we must check whether a CCC byte code is following and remove them too. It is generally viewed as illegal, but if there are any such codes, removing the format character only would have the side effect of attaching the CCC byte code to the previous character which is wrong.

Alexis Ada ignores the discussion in paragraphs 6 to 6.1/2. At this point, this implementation will instead do transliterations which means that an i with a dot is considered the same as an ı without the dot. This does not prevent someone from writing words with the correct syntax, although it will match the incorrect syntax, it will eliminate the problem of having the same software not being compatible in two different countries. Note that the i character is the only one in Unicode that has such a major problem.

The character categories presented here can be simplified for the purpose of Ada 2005 (Even though the documentation states all of those special characters...)

Category Comments
letter Letters represent any character that can be an identifier first letter:
letter_uppercase, letter_lowercase, letter_titlecase, letter_modifier, letter_other, number_letter
other mark_non_spacing, mark_spacing_combining, number_decimal
punctuation punctuation_connector
format other_format

The identifier definition becomes: aada_identifier ::= letter {letter | other | punctuation | format}

And the test for a double punctuation is performed after clearing the identifier of any format character.

[/aada]

  • 1. This is mentioned earlier and is done this way to make the compiler faster.
  • 2. This test requires us to compare the reserved word letters in upper and lower case against the source identifier. Since all reserved words are in ASCII (letters A to Z), the upper and lower case characters can be computed with a single machine instruction and thus this test is really fast and worth doing.