1.4. Information Only

There is not really any unanimity about what regular expression language is. The unix tools sed and grep extend on Kleene's original grammar. Henry Spencer's regex library extended on this again, and was a foundation for Perl, but Perl once again went further. Philip Hazel's PCRE, despite the name Perl Compatible Regular Expressions, makes further extensions still, and so on.

Inform's regular expressions are modelled on those of Perl, as the best de facto standard we have, but a few omissions have been inevitable. Inform's regex matcher must occupy source code less than one hundredth the size of PCRE, and it has very little memory. Inform aims to behave exactly like Perl except as follows:

(i) Inform allows angle brackets as synonymous with square brackets, for reasons explained above. This means literal angle brackets have to be escaped as "\<" and "\>" in Inform regular expressions, which is unnecessary in Perl.

(ii) Inform only has single-line mode, not multiline mode: this removes need for the mode-switches "(?m)" and "(?s)" and the positional markers "\A" and "\Z". Multiline mode is idiosyncratic to Perl and is a messy compromise to do with holding long files of text as single strings, yet treating them as lists of lines at the same time: this would not be sensible for Inform. Similarly, because there is no ambiguity about how line breaks are represented in Inform strings (by a single "\n"), initial newline convention markers such as "(*ANYCRLF)" are unsupported.

(iii) The codes "\a", "\r", "\f", "\e", "\0" for alarm, carriage return, form feed, escape and the zero character are unsupported: none of these can occur in an Inform string.

(iv) Inform does not allow characters to be referred to by character code (whereas Perl allows "\036" for an octal character code, "\x7e" for a hexadecimal one, "\cD" for a control character). This is because we do not want the user to know whether text is internally stored as ZSCII or Unicode.

(v) Inform's character class "\p" (and its negation "\P") have no equivalent in Perl, and Inform's understanding of "\w" is different. Perl defines this as an upper or lower case English letter, underscore or digit, which is good for programming-language identifiers, but bad for natural language - for instance, "é" is not matched by "\w" in Perl, but unquestionably it appears in words. Inform therefore defines "\w" as the negation of "\s" union "\p".

(vi) Inform supports only single-digit grouping numbers "\1" to "\9", whereas Perl allows "\10", "\11", ...

(vii) POSIX named character ranges are not supported. These are only abbreviations in any case, and are not very useful. (Note that the POSIX range "[:punct:]", which is supposedly for punctuation, includes many things we do not want to think of that way - percentage signs, for instance - and so "\p" has a more natural-language-based definition.)

(viii) Character classes can be used inside ranges, so that "<\da-f>" is legal, but not as ends of contiguous runs, so that "<\d-X>" is not legal. (As reckless as this is, it is legal in Perl.)

(ix) For obvious reasons, escapes to Perl code using the notation "(?{...})" are unsupported, and so is the Perl iteration operator "\G".

(x) Perl's extended mode "(?x)", a lexical arrangement which allows expressions to be expanded out as little computer programs with comments, is unsupported. It would look awful syntax-coloured in the Inform interface and is not a style of coding to be encouraged.

Inform further does not support the Python extension of named subexpression groups, nor the Java extension of the possessive quantifier "++". There was only so much functionality we could squeeze in.

As verification of Inform's matching algorithm, we took the Perl 5 source code's notorious "re-test.txt" set of 961 test cases, removed the 316 using features unsupported by Inform (220 tested multiline mode, for instance), and ran the remaining 645 cases through Inform. It agrees with Perl on 643 of these: the two outstanding are -

(i) Perl is able to match "^(a\1?){4}$" against "aaaaaa" but Inform is not - Inform's backtracking is not as good when it comes to repetitions of groupings which are recursively defined. (Note that the optional "\1" match refers to the value of the bracketed expression which contains it, so that the interpretation is different on each repetition. Here to match we have to interpret "?" as 0, 0, 1, 0 repeats respectively as we work through the "{4}".)

(ii) Perl matches "((<a-c>)b*?\2)*" against "ababbbcbc" finding the match "ababb", whereas Inform finds the match "ababbbcbc". This is really a difference of opinion about whether the outer asterisk, which is greedy, should be allowed three matches rather than two if to do so requires the inner asterisk, which is not greedy, to eat more than it needs on one of those three matches.

Case (i) is a sacrifice to enable Inform's back-tracking to use less memory. Case (ii) simply seems unimportant.

An entire grammar for the whole mass of Inform would not be linguistically interesting: it contains many convenient wordings which are not really part of a grand pattern. Inform does, however, have a formal notion of a Sentence, a grammatical structure which we shall call S. It is almost true that conditions ("if the flowerpot is on the wall") have the same grammar as assertions ("The flowerpot is on the wall") and "now" phrases ("now the flowerpot is on the wall"). All three use the S grammar, so we could define an assertion as "S.", say that "if S", "while S", "when S" and so on are conditions, and say that "now S" defines the "now" declaration.

Grammatical sentences do not necessarily make sense, of course. Many perfectly grammatical assertions in fact give rise to problem messages:

The wicker basket is not in the kitchen. (Unhelpfully negative.)
The wicker basket has been in the kitchen. (Talks about a time which never existed.)
The wicker basket is full. (Full of what? Too vague.)
The wicker basket is the ginger cat. (Demonstrably false.)

Whereas the first three, at least, would be sensible as conditions. So saying that assertions are "just like" conditions is a little misleading: what they have in common is S, the underlying grammar they each use as a starting-point.

To define S, we break it up into subsidiary structures. The most important is the Description Phrase (DP), examples of which include "the red basket", "somewhere lighted" and "an empty open container". Clearly sentences include DPs, but they also include other ingredients. The general pattern used in Inform is very simple:

1. S = DP + VP
2. VP = Verb + DP

where VP is another structure, the Verb Phrase. For instance:

S (The horseman wears a riding helmet)
= DP (The horseman) + VP (wears a riding helmet)
VP (wears a riding helmet)
= Verb (wears) + DP (a riding helmet)

In that example, the Verb was the single word "wears". More generally, Inform allows a Verb to include adverbs and prepositions, to be negated, and to come in any of four tenses, so the following are all valid examples of Verb in our grammar:

wore
carries
is carried by
had not been inside

Although we are not going through the definition of Description Phrases in detail, it is worth noticing how "which" and "who" behave:

3a. DP = DP + which + VP
3b. DP = DP + who + VP

Thus "an open container which is in the Ballroom" can be broken down as:

DP (an open container) + which + VP (is in the Ballroom)

To understand compounds like "something in a container", we have to invent a new grammatical structure for "in a container" and similar: let's call this a Relative Phrase (RP).

4. DP = DP + RP

Thus "an open container in the Ballroom" is DP (an open container) + RP (in the Ballroom). Relative Phrases have two different forms:

5a. RP = Preposition + DP
5b. RP = Participle + DP

so that "in a container" is an example of 5a. An example of 5b would be

RP (worn by Mr Darcy) = Participle (worn by) + DP (Mr Darcy)

That is nearly it, but not quite: we must go back to the "almost" in the statement above that assertions and conditions "almost" have the same grammar S. The difference arises from a curious irregularity in English called subject-verb inversion (see the Oxford English Grammar at 3.22F), whereby assertions can be reversed but not conditions. For instance,

In the Garden is a sunflower.

This does not follow the pattern S = DP + VP, because "in the garden" is not a DP: indeed, it is not a noun at all. To make sense of this sentence, Inform reverses it to "A sunflower is in the Garden", which does indeed follow DP + VP. Hence the final rule:

6 (assertions only). S = RP + Verb + DP

So the condition "if in the garden is a sunflower..." fails because rule 6 does not apply to the grammar for conditions: while occasional poetic uses of subject-verb inversion do turn up in conditions ("If On A Winter's Night A Traveller", say), they are rare in ordinary English usage, and illegal in Inform. That completes the S grammar, so to recap:

1. S = DP + VP
2. VP = Verb + DP
3a. DP = DP + which + VP
3b. DP = DP + who + VP
4. DP = DP + RP
5a. RP = Preposition + DP
5b. RP = Participle + DP
6 (assertions only). S = RP + Verb + DP

Inform uses the term "relation" in a broader sense than mathematics. Properly speaking, the term "relation" in its mathematical sense only applies to the case where the domain for the left and right objects are the same: for simplicity's sake, let us talk only about the case where they are.

In mathematics, the properties most often looked for in a relation are that it should be:

(a) Reflexive: A <=> A for every A. This is not especially useful for Inform, and seldom appears in practical examples.

(b) Symmetric: A <=> B if and only if B <=> A. Generally, Inform relations are not symmetric, but there are two important cases which are:

Meeting relates people to each other.
Marriage relates one person to another.

These are automatically symmetric, so that to assert one way round is to assert the other as well.

(c) Transitive: A <=> B and B <=> C means that A <=> C as well. Again, Inform relations are not generally transitive. In many relations, there can be long chains of things, each perhaps related to the one in front and the one behind, so that there is some indirect sense in which the two ends of the chain are connected to each other: but they are not related as such. For instance, a journey across the map might pass through ten rooms, each adjacent to the last and next, but the two ends would not themselves be adjacent. The concept we need is the "transitive closure" of the original relation, defined as the smallest transitive relation including the original. If R is a relation between "things", then the following:

TC relates a thing (called A) to a thing (called B) when the number of steps via R from A to B is greater than 0.

is the transitive closure of R. In particular,

Accessibility relates a room (called A) to a room (called B) when the number of moves from B to A is greater than 0. The verb to be accessible from means the accessibility relation.

calculates the transitive closure of adjacency. Here, though, the way we normally understand "accessible from" suggests that it would be better to write:

Accessibility relates a room (called A) to a room (called B) when the number of moves from B to A is at least 0.

which is reflexive as well as transitive. The usefulness of Inform's "next step via R from A to B" construction, in a wide variety of settings, reflects the importance of transitivity as an idea.

A relation which has all three properties of being reflexive, symmetric and transitive is called an "equivalence relation". (If all the map connections are two-way, then the accessibility relation above is symmetric and therefore a full equivalence relation: but if not, it may not be.) Inform has a special construction for making equivalence relations:

Nationality relates people to each other in groups.

This language - "in groups" - relies on the standard theorem that every equivalence relation on a set naturally defines a partition of that set, and vice versa. The "groups" referred to are what are normally called "equivalence classes". (Inform does little with these equivalence classes: it might be interesting to do so, in effect forming quotient kinds.)

One way to look at a relation is to regard it as a directed graph: that is, a collection of things ("vertices") with arrows drawn between them ("edges"). We write our items A, B, C, ... on a piece of paper: then, if A relates to B, we draw an arrow pointing from A to B, and so on. If we made this drawing for the adjacency relation, we would more or less have reconstructed the map (or at least a simplified one which does not care about precise directions, like the famous diagram of the London Underground). But the drawing can be made for any relation. If we define:

Suspecting relates various people to one person.

then, in the corresponding graph, each "vertex" will have at most one arrow leading away from it - though there could be many (or none) leading towards. Conversely, a one-to-various relation produces a graph where each vertex has at most one arrow coming in. A one-to-one relation means that the picture consists of some vertices on their own, with no arrows, a few perhaps with looped arrows leading from and to themselves, and then a collection of pairs joined by arrows. On the other hand, a various-to-various relation is just a free-for-all, with no restrictions on the arrows. The relations:

Meeting relates people to each other.
Marriage relates one person to another.

always have the property of working both ways round, and these are easiest to visualise by forgetting the direction of the arrows, so that they just become lines joining the vertices.

Inform uses a different algorithm for finding routes ("the next step via R from A to B") in each of these cases, and internally it stores relations in different formats in the different cases, because it makes a big difference to the efficiency of Inform to minimise the storage required for a relation and the time taken to explore it.

All the cases are benign except for "various to various" - the most useful - and for its closely related symmetrical version, "relates... to each other". Inform cannot afford to assume that the relation will be "sparse" (that is: that no vertex will have more than a certain number of arrows, or that the total number of arrows will be small), because it has no idea how the arrows will come and go during play. It therefore uses 1 bit of storage for each pair of objects. This sounds harmless, but if there are 200 rooms, there are 40,000 pairs of rooms, which means a 5000-byte allocation of storage (plus a handful of bytes as overhead). Gratuitous various-to-various relations are therefore not a good idea.

There is a standard algorithm for calculating shortest paths through a directed graph, but Inform does not always use it, because there is not always memory to store the required matrix of partial results. Inform's slow method, likely to be used on the Z-machine, requires a storage overhead which is equal to the number of vertices, not the square of that number, but the worst-case running time can be bad: if there are N vertices, and the diameter of graph (the longest distance between vertices) is D, then the running time is proportional to D times N. The worst case in finding routes from A to B is when almost every vertex can reach B, some across long trails, but A cannot. In the case of finding routes across the game's map, this must be multiplied further by the number of possible directions - usually 16.

This does not sound too awful, but if one is trying to find (say) "the most distant room from A", that means a further loop and now the running time will be D times N squared. Extension writers will need to be careful of this kind of thing: it is easy to write highly cool prototypes which work terribly slowly on larger, more realistic maps.

	Start of Chapter 1: How to Use The Recipe Book
	Back to §1.3. Disenchantment Bay
	Onward to Chapter 2: Adaptive Prose: §2.1. Varying What Is Written