An overview of the words module's role and abilities.
- §1. Prerequisites
- §2. Words, words, words
- §5. Meaning codes
- §6. Contiguous runs of words
- §7. Hypothetical words
- §8. Rock, paper, scissors
- §9. Traditional identifiers
- §10. Preform
§1. Prerequisites. The words module is a part of the Inform compiler toolset. It is presented as a literate program or "web". Before diving in:
- (a) It helps to have some experience of reading webs: see inweb for more.
- (b) The module is written in C, in fact ANSI C99, but this is disguised by the fact that it uses some extension syntaxes provided by the inweb literate programming tool, making it a dialect of C called InC. See inweb for full details, but essentially: it's C without predeclarations or header files, and where functions have names like Tags::add_by_name rather than add_by_name.
- (c) This module uses other modules drawn from the compiler (see structure), and also uses a module of utility functions called foundation. For more, see A Brief Guide to Foundation (in foundation).
§2. Words, words, words. Natural language text for use with Inform begins as text files written by human users, which are fed into the "lexer" (i.e., lexical analyser). The function TextFromFiles::feed_open_file_into_lexer reads such a file, converting it to a numbered stream of words. For indexing and error reporting purposes, we must not forget where these words came from: the function returns a source_file object representing the file as an origin, and the lexer assigns each word a source_location which is simply its SF together with a line number. Lexer::word_location returns this for a given word number.
Word numbers count upwards from 1 and are contiguous: for example —
Mary had a little lamb . Everywhere that Mary went , the lamb 17 18 19 20 21 22 23 24 25 26 27 28 29
Repetitions are frequent: a typical source text of 50,000 words has an unquoted1 vocabulary of only about 2000 different words. Inform generates a vocabulary_entry object for each of these distinct words, and Lexer::word returns the VE for a given word number. In the above example,
Lexer::word(17) == Lexer::word(25) both are uses of "Mary" Lexer::word(21) == Lexer::word(29) both are uses of "lamb" Lexer::word(20) != Lexer::word(24) one is "little", the other "that"
The important point is that words at two positions can be tested for textual equality in an essentially instant process, by comparing vocabulary_entry * pointers. (See Numbered Words for just this sort of comparison.)
Nothing in life is free, and building the vocabulary efficiently is itself a challenge: see Vocabulary::hash_code_from_word. The key function is Vocabulary::entry_for_text, which takes a wide C string for a word and returns its vocabulary_entry. There are also issues with casing: in general we want "Lamb" and "lamb" to match, but not always.
§3. A few vocabulary_entry objects are hardwired into words, but only for punctuation. These have names like COMMA_V, which means just what you think it means. In our example,
Lexer::word(27) == COMMA_V the comma between "went" and "the"
See Vocabulary::create_punctuation, and also LoadPreform::create_punctuation, where further punctuation marks are created in order to parse Preform syntax — there are exotica such as COLONCOLONEQUALS_V there, for "::=".
§4. Lexical errors occur if words are too long, or quoted text continues without a close quote right to the end of a file, and so on. These are sent to the function Lexer::lexer_problem_handler, but can be intercepted by the user (see How To Include This Module).
§5. Meaning codes. Each vocabulary_entry has a bitmap of *_MC meaning codes assigned to it. (And Vocabulary::test_flags tests whether the Nth word has a given bit.) For example, ORDINAL_MC is applied to ordinal numbers like "sixth" or "15th" — see Vocabulary::an_ordinal_number, and NUMBER_MC to cardinals. The words module uses only a few bits in this map, but the linguistics module develops the idea much further: for example, any word which can be used in a particular semantic category — say, in a variable name — is marked with a bit representing that — say, VARIABLE_MC. The core module uses this for 15 or so of the most commonly used semantic categories in the Inform language. See What This Module Does (in linguistics) to pick up the story.
§6. Contiguous runs of words. Natural languages are fundamentally unlike programming languages because a noun referring to, say, a variable is rarely a single lexical token. In C, a variable name like selected_lamb is one lexical unit. For us, though, "a little lamb" is three words.
However, multi-word snippets of text which have a joint meaning are almost always contiguous. The text "a little lamb" is word numbers 19, 20, 21. We deal with this using the wording type: it's essentially a pair of integers, (19, 21), and thus is very quick to form, compare, copy and pass as a parameter. Wordings provides an extensive API for this.
§7. Hypothetical words. Sometimes Inform needs to make hypothetical passages of text. For example, suppose there is a kind called "paint colour" in the source text; Inform may then want to create a variable called "paint colour understood". But this text may not occur as such anywhere in the source.
If all the words needed are in the source somewhere, but not together, the user of the words module has two options:
- ● Create a word_assemblage object. This can represent any discontiguous list of word numbers: thus, the text "lamb went everywhere" could be a WA of numbers (21, 26, 23) in our example above.
- ● Use Lexer::splice_words to create duplicate snippets of text in the word stream, with new numbers. For example, call this on "lamb", then "went", then "everywhere"; the three new word numbers will then be contiguous, and can be represented by a wording:
Mary had a little lamb . Everywhere that Mary went , the lamb lamb went everywhere 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
If however we want to make "lamb tian with haricot beans", we need to use the Lexer's ability to read text internally as well as from external files. This is called a "feed": see Feeds. In particular, Feeds::feed_text will take the text I"tian with haricot beans", treat this as fresh text for lexing so that we now have
... , the lamb lamb went everywhere tian with haricot beans ... 27 28 29 30 31 32 34 35 36 37
and now the word assemblage (21, 34, 35, 36, 37) would indeed represent "lamb tian with haricot beans". The return value of Feeds::feed_text is the wording (34, 37).
These new words do not originate in a file; their source_location therefore has a null source_file. Words which have been spliced, however, and thus duplicated in the word stream (like "lamb went everywhere", 30-32), retain their original origins.
§8. Rock, paper, scissors. We now have three ways to represent text which may contain multiple words: as a text_stream, as a wording, as a word_assemblage. Each can be converted into the other two:
- ● Use Feeds::feed_text to turn a text_stream to a wording.
- ● Use WordAssemblages::from_wording to turn a wording to a word_assemblage.
- ● Use WordAssemblages::to_wording to turn a word_assemblage to a wording.
- ● Use Wordings::writer or use the formatted WRITE escape %W to write a wording into a text_stream.
- ● Use WordAssemblages::writer or use the formatted WRITE escape %A to write a word_assemblage into a text_stream.
As a general design goal, all Inform code uses wording to identify names of things: this is fastest and most efficient on memory.
§9. Traditional identifiers. Imagine you're a compiler turning natural language into some sort of computer code, just hypothetically: then you probably want "a little lamb" to come out as a named location in memory, or object, or something like that: and this name must be a valid identifier for some other compiler or assembler — alphanumeric, not too long, and so on. Calling it "a little lamb" is not an option.
You could of course name it ref_15A40F, or some such, because the user will never see it anyway, so why have a helpful name? But that won't make debugging your output easy. The function Identifiers::compose therefore takes a wording and a unique ID number and makes something sensible: I15_a_little_lamb, say.
§10. Preform. Preform is a meta-language for writing a simple grammar: it's in some sense pre-Inform, because it defines the Inform language itself. See About Preform, where the story told in the present section continues...