To manage the names assigned to Unicode character values.
§1. Parsing. The following is called only on excerpts from the source where it is a fairly safe bet that a Unicode character is referred to. For example, when the player types either of these:
"[unicode 321]odz Churchyard"
"[unicode Latin capital letter L with stroke]odz Churchyard"
...then the text after the word "unicode" is parsed by <s-unicode-character>.
<s-unicode-character> ::= <cardinal-number-unlimited> | ==> { -, Rvalues::from_Unicode(UnicodeLiterals::max(R[1]), W) } <unicode-character-name> ==> { -, Rvalues::from_Unicode(R[1], W) } <unicode-character-name> internal { TEMPORARY_TEXT(N) WRITE_TO(N, "%W", W); for (int i=0; i<Str::len(N); i++) Str::put_at(N, i, Characters::toupper(Str::get_at(N, i))); int U = UnicodeLiterals::parse(N); DISCARD_TEXT(N) if (U >= 0) { if ((TargetVMs::is_16_bit(Task::vm())) && (U >= 0x10000)) { Issue PM_UnicodeOutOfRange1.2; U = 65; } ==> { UnicodeLiterals::max(U), - }; return TRUE; } ==> { fail nonterminal }; }
- This is Preform grammar, not regular C code.
§1.1. And here is the range check. Values above MAX_UNICODE_CODE_POINT are permitted, but need to be specified numerically.
int UnicodeLiterals::max(int cc) { if (cc < 0) { Issue PM_UnicodeOutOfRange1.2; return 65; } return cc; }
§1.2. Issue PM_UnicodeOutOfRange1.2 =
StandardProblems::sentence_problem(Task::syntax_tree(), _p_(PM_UnicodeOutOfRange), "this character value is beyond the range which the current story " "could handle", "which is from 0 to (hexadecimal) FFFF for stories compiled to the " "Z-machine, and otherwise 0 to 1FFFF.");
§2. Code points. Each distinct code point in the Unicode specification will correspond to one of these:
define MAX_UNICODE_CODE_POINT 0x20000 enum Cc_UNICODE_CAT from 1 Other, Control enum Cf_UNICODE_CAT Other, Format enum Cn_UNICODE_CAT Other, Not Assigned: no character actually has this enum Co_UNICODE_CAT Other, Private Use enum Cs_UNICODE_CAT Other, Surrogate enum Ll_UNICODE_CAT Letter, Lowercase enum Lm_UNICODE_CAT Letter, Modifier enum Lo_UNICODE_CAT Letter, Other enum Lt_UNICODE_CAT Letter, Titlecase enum Lu_UNICODE_CAT Letter, Uppercase enum Mc_UNICODE_CAT Mark, Spacing Combining enum Me_UNICODE_CAT Mark, Enclosing enum Mn_UNICODE_CAT Mark, Non-Spacing enum Nd_UNICODE_CAT Number, Decimal Digit enum Nl_UNICODE_CAT Number, Letter enum No_UNICODE_CAT Number, Other enum Pc_UNICODE_CAT Punctuation, Connector enum Pd_UNICODE_CAT Punctuation, Dash enum Pe_UNICODE_CAT Punctuation, Close enum Pf_UNICODE_CAT Punctuation, Final quote enum Pi_UNICODE_CAT Punctuation, Initial quote enum Po_UNICODE_CAT Punctuation, Other enum Ps_UNICODE_CAT Punctuation, Open enum Sc_UNICODE_CAT Symbol, Currency enum Sk_UNICODE_CAT Symbol, Modifier enum Sm_UNICODE_CAT Symbol, Math enum So_UNICODE_CAT Symbol, Other enum Zl_UNICODE_CAT Separator, Line enum Zp_UNICODE_CAT Separator, Paragraph enum Zs_UNICODE_CAT Separator, Space
typedef struct unicode_point { int code_point; in the range 0 to MAX_UNICODE_CODE_POINT - 1 struct text_stream *name; e.g. "RIGHT-FACING ARMENIAN ETERNITY SIGN" int category; one of the *_UNICODE_CAT values above int tolower; -1 if no mapping to lower case is available, or a code point int toupper; -1 if no mapping to upper case is available, or a code point int totitle; -1 if no mapping to title case is available, or a code point } unicode_point; unicode_point UnicodeLiterals::new_code_point(int C) { unicode_point up; up.code_point = C; up.name = NULL; up.category = Cn_UNICODE_CAT; up.tolower = -1; up.toupper = -1; up.totitle = -1; return up; }
- The structure unicode_point is accessed in 2/spc, 2/rvl, 2/lvl, 5/dsh and here.
§3. Storage for these is managed on demand, in a flexibly-sized array:
unicode_point *unicode_points = NULL; array indexed by code point int unicode_points_extent = 0; current number of entries in that array int max_known_unicode_point = 0; unicode_point *UnicodeLiterals::code_point(int U) { if ((U < 0) || (U >= MAX_UNICODE_CODE_POINT)) internal_error("Unicode point out of range"); UnicodeLiterals::ensure_data(); if (U >= unicode_points_extent) { int new_extent = unicode_points_extent; if (new_extent == 0) new_extent = 1; while (new_extent <= U) new_extent = 2*new_extent; unicode_point *new_unicode_points = (unicode_point *) (Memory::calloc(new_extent, sizeof(unicode_point), UNICODE_DATA_MREASON)); for (int i=0; i<unicode_points_extent; i++) new_unicode_points[i] = unicode_points[i]; for (int i=unicode_points_extent; i<new_extent; i++) new_unicode_points[i] = UnicodeLiterals::new_code_point(i); if (unicode_points_extent > 0) Memory::I7_array_free(unicode_points, UNICODE_DATA_MREASON, unicode_points_extent, sizeof(unicode_point)); unicode_points = new_unicode_points; unicode_points_extent = new_extent; } if (U > max_known_unicode_point) max_known_unicode_point = U; return &(unicode_points[U]); }
§4. The standard Inform distribution includes the current Unicode specification's main data file. Although parsing that file is relatively fast, we do it only on demand, because it's not small (about 2 MB of text) and is often not needed.
The UnicodeData_lookup dictionary really associates texts (names of characters) with non-negative integers (their code points), but our dictionary type only allows texts-to-pointers, so we wrap these integers up into unicode_lookup_value to which we can then have pointers.
(As noted by David Kinder in May 2023, it's unsafe to use this dictionary to associate texts with unicode_point * values, because the flexible-sized array holding those means that they will move around in memory. If we are lucky, the memory freed when the old version of the array is surpassed will be left intact and then the dictionary pointers to it will all work fine: if we are not lucky, for example if the memory environment is stressed because intest is running many simultaneous copies of Inform, then that space will be reused and the dictionary pointers will be invalid.)
dictionary *UnicodeData_lookup = NULL; typedef struct unicode_lookup_value { int code_point; } unicode_lookup_value; void UnicodeLiterals::ensure_data(void) { if (UnicodeData_lookup == NULL) { UnicodeData_lookup = Dictionaries::new(65536, FALSE); filename *F = InstalledFiles::filename(UNICODE_DATA_IRES); TextFiles::read(F, FALSE, "can't open UnicodeData file", TRUE, &UnicodeLiterals::read_line, NULL, NULL); LOG("Read Unicode data to code point 0x%06x in %f\n", max_known_unicode_point, F); } }
- The structure unicode_lookup_value is private to this section.
§5. The format of this file is admirably stable. Lines look like so:
0067;LATIN SMALL LETTER G;Ll;0;L;;;;;N;;;0047;;0047 1C85;CYRILLIC SMALL LETTER THREE-LEGGED TE;Ll;0;L;;;;;N;;;0422;;0422 1FAA1;SEWING NEEDLE;So;0;ON;;;;;N;;;;;
Each line corresponds to a code point. They're presented in the file in ascending order of these values, but we make no use of that fact. Each line contains fields divided by semicolons, and semicolon characters are illegal in any field.
define CODE_VALUE_UNICODE_DATA_FIELD 0 define NAME_UNICODE_DATA_FIELD 1 define GENERAL_CATEGORY_UNICODE_DATA_FIELD 2 define COMBINING_CLASSES_UNICODE_DATA_FIELD 3 define BIDIRECTIONAL_CATEGORY_UNICODE_DATA_FIELD 4 define DECOMPOSITION_MAPPING_UNICODE_DATA_FIELD 5 define DECIMAL_DIGIT_VALUE_UNICODE_DATA_FIELD 6 define DIGIT_VALUE_UNICODE_DATA_FIELD 7 define NUMERIC_VALUE_UNICODE_DATA_FIELD 8 define MIRRORED_UNICODE_DATA_FIELD 9 define OLD_NAME_UNICODE_DATA_FIELD 10 define ISO_10646_COMMENT_UNICODE_DATA_FIELD 11 define UC_MAPPING_UNICODE_DATA_FIELD 12 define LC_MAPPING_UNICODE_DATA_FIELD 13 define TC_MAPPING_UNICODE_DATA_FIELD 14
void UnicodeLiterals::read_line(text_stream *text, text_file_position *tfp, void *vm) { Str::trim_white_space(text); inchar32_t c = Str::get_first_char(text); if (c == 0) return; text_stream *name = Str::new(); TEMPORARY_TEXT(category) int U[16], field_number = 0; for (int f=0; f<16; f++) U[f] = 0; Parse the fields5.1; if ((field_number > 1) && (U[CODE_VALUE_UNICODE_DATA_FIELD] < MAX_UNICODE_CODE_POINT)) { int c = Cn_UNICODE_CAT; Determine the category code5.2; unicode_point *up = UnicodeLiterals::code_point(U[CODE_VALUE_UNICODE_DATA_FIELD]); Initialise the unicode point structure5.3; Add to the dictionary of character names5.4; } DISCARD_TEXT(category) }
for (int i=0; i<Str::len(text); i++) { inchar32_t c = Str::get_at(text, i); if (c == ';') field_number++; else switch (field_number) { case CODE_VALUE_UNICODE_DATA_FIELD: case UC_MAPPING_UNICODE_DATA_FIELD: case LC_MAPPING_UNICODE_DATA_FIELD: case TC_MAPPING_UNICODE_DATA_FIELD: { int H = -1; if ((c >= '0') && (c <= '9')) H = (int) (c - '0'); if ((c >= 'A') && (c <= 'F')) H = (int) (c - 'A' + 10); if (H >= 0) U[field_number] = U[field_number]*16 + H; break; } case NAME_UNICODE_DATA_FIELD: PUT_TO(name, c); break; case GENERAL_CATEGORY_UNICODE_DATA_FIELD: PUT_TO(category, c); break; } }
- This code is used in §5.
§5.2. Determine the category code5.2 =
if (Str::eq(category, I"Cc")) c = Cc_UNICODE_CAT; else if (Str::eq(category, I"Cf")) c = Cf_UNICODE_CAT; else if (Str::eq(category, I"Cn")) c = Cn_UNICODE_CAT; else if (Str::eq(category, I"Co")) c = Co_UNICODE_CAT; else if (Str::eq(category, I"Cs")) c = Cs_UNICODE_CAT; else if (Str::eq(category, I"Ll")) c = Ll_UNICODE_CAT; else if (Str::eq(category, I"Lm")) c = Lm_UNICODE_CAT; else if (Str::eq(category, I"Lo")) c = Lo_UNICODE_CAT; else if (Str::eq(category, I"Lt")) c = Lt_UNICODE_CAT; else if (Str::eq(category, I"Lu")) c = Lu_UNICODE_CAT; else if (Str::eq(category, I"Mc")) c = Mc_UNICODE_CAT; else if (Str::eq(category, I"Me")) c = Me_UNICODE_CAT; else if (Str::eq(category, I"Mn")) c = Mn_UNICODE_CAT; else if (Str::eq(category, I"Nd")) c = Nd_UNICODE_CAT; else if (Str::eq(category, I"Nl")) c = Nl_UNICODE_CAT; else if (Str::eq(category, I"No")) c = No_UNICODE_CAT; else if (Str::eq(category, I"Pc")) c = Pc_UNICODE_CAT; else if (Str::eq(category, I"Pd")) c = Pd_UNICODE_CAT; else if (Str::eq(category, I"Pe")) c = Pe_UNICODE_CAT; else if (Str::eq(category, I"Pf")) c = Pf_UNICODE_CAT; else if (Str::eq(category, I"Pi")) c = Pi_UNICODE_CAT; else if (Str::eq(category, I"Po")) c = Po_UNICODE_CAT; else if (Str::eq(category, I"Ps")) c = Ps_UNICODE_CAT; else if (Str::eq(category, I"Sc")) c = Sc_UNICODE_CAT; else if (Str::eq(category, I"Sk")) c = Sk_UNICODE_CAT; else if (Str::eq(category, I"Sm")) c = Sm_UNICODE_CAT; else if (Str::eq(category, I"So")) c = So_UNICODE_CAT; else if (Str::eq(category, I"Zl")) c = Zl_UNICODE_CAT; else if (Str::eq(category, I"Zp")) c = Zp_UNICODE_CAT; else if (Str::eq(category, I"Zs")) c = Zs_UNICODE_CAT; else LOG("Unknown category '%S'\n", category);
- This code is used in §5.
§5.3. Initialise the unicode point structure5.3 =
up->name = name; up->category = c; up->tolower = U[LC_MAPPING_UNICODE_DATA_FIELD]; up->toupper = U[UC_MAPPING_UNICODE_DATA_FIELD]; up->totitle = U[TC_MAPPING_UNICODE_DATA_FIELD];
- This code is used in §5.
§5.4. Control codes in Unicode, a residue of ASCII, are given no names by the standard. For example:
0004;<control>;Cc;0;BN;;;;;N;END OF TRANSMISSION;;;;
Indeed, at present every code with category Cc has the pseudo-name <control>. So we will mostly not allow these to be referred to by name in Inform. (In theory we could read the ISO-10646 comment as if it were a name: here, that would be "END OF TRANSMISSION", which isn't too bad. But "FORM FEED (FF)" and "CHARACTER TABULATION" are less persuasive, and anyway, we don't actually want users to insert control characters into Inform text literals.)
Add to the dictionary of character names5.4 =
text_stream *index = NULL; if (c == Cc_UNICODE_CAT) { if (U[CODE_VALUE_UNICODE_DATA_FIELD] == 9) index = I"TAB"; if (U[CODE_VALUE_UNICODE_DATA_FIELD] == 10) index = I"NEWLINE"; } else { index = name; } if (index) { Dictionaries::create(UnicodeData_lookup, name); unicode_lookup_value *ulv = CREATE(unicode_lookup_value); ulv->code_point = U[CODE_VALUE_UNICODE_DATA_FIELD]; Dictionaries::write_value(UnicodeData_lookup, name, (void *) ulv); }
- This code is used in §5.
§6. Using the Unicode data. The first lookup here is slow, since it requires us to parse the Unicode specification data file. But after that everything runs quite swiftly.
int UnicodeLiterals::parse(text_stream *N) { if ((Str::get_at(N, 0) == 'U') && (Str::get_at(N, 1) == '+')) { int pos = 2; int t = 0; while (Str::get_at(N, pos)) { if (pos > 7) return -1; inchar32_t c = Str::get_at(N, pos), d = 0; if ((c >= '0') && (c <= '9')) d = c - '0'; else if ((c >= 'a') && (c <= 'f')) d = c - 'a' + 10; else if ((c >= 'A') && (c <= 'F')) d = c - 'A' + 10; else return -1; if (d >= 16) return -1; t = t*16 + (int) d; pos++; } return t; } else if ((Str::len(N) == 3) && (Str::get_at(N, 0) == '"') && (Str::get_at(N, 2) == '"')) { return (int) Str::get_at(N, 1); } else { UnicodeLiterals::ensure_data(); if (Dictionaries::find(UnicodeData_lookup, N)) { unicode_lookup_value *ulv = Dictionaries::read_value(UnicodeData_lookup, N); return ulv->code_point; } } return -1; }
§7. We won't go too far down the Unicode rabbit-hole, but here are functions which may some day be useful:
int UnicodeLiterals::tolower(int C) { unicode_point *up = UnicodeLiterals::code_point(C); int D = up->tolower; if (D >= 0) return D; return C; } int UnicodeLiterals::toupper(int C) { unicode_point *up = UnicodeLiterals::code_point(C); int D = up->toupper; if (D >= 0) return D; return C; } int UnicodeLiterals::totitle(int C) { unicode_point *up = UnicodeLiterals::code_point(C); int D = up->totitle; if (D >= 0) return D; return C; } int UnicodeLiterals::category(int C) { unicode_point *up = UnicodeLiterals::code_point(C); return up->category; }