Nonterminals

The angle-bracketed terms appearing in Preform grammar.

§1. How nonterminals are stored.Each different nonterminal defined in the Syntax.preform code read in, such as <any-integer>, is going to correspond to a global variable in the program reading it in, such as any_integer_NTM. On the face of it, this is impossible. How can what happens at run-time affect what variables are named at compile time?

The answer is that the inweb literate programming tool looks through the complete source code, sees the Preform nonterminals described in it, and inserts declarations of the corresponding variables into the "tangled" form of the source code sent to a C compiler to make the actual program. (This is a feature of inweb available only for programs written in InC.)

In particular, the tangler of inweb replaces the [[nonterminals]] below with invocations of the REGISTER_NONTERMINAL and INTERNAL_NONTERMINAL macros. For example, it inserts the C line:

    INTERNAL_NONTERMINAL(U"<any-integer>", any_integer_NTM, 1, 1);

since this is an "internal" nonterminal; and the macro will then expand to code which sets up any_integer_NTM — see below.

void Nonterminals::register(void) {
     The following is not valid C, but causes Inweb to insert lines which are
    [[nonterminals]];
     Back to regular C now
    nonterminal *nt;
    LOOP_OVER(nt, nonterminal)
        if ((nt->marked_internal) && (nt->internal_definition == NULL))
            internal_error("internal nonterminal has no definition function");
}

§2. So, then, inweb tangles out code which uses the REGISTER_NONTERMINAL macro for any standard nonterminal, and also tangles a compositor function for it; the name of which is the nonterminal's name with a C suffix. For example, suppose inweb sees the following in the web it is tangling:

    <competitor> ::=
        the pacemaker |              ==> { 1, - }
        <ordinal-number> runner |    ==> { pass 1 }
        runner no <cardinal-number>  ==> { pass 1 }

It then tangles this macro usage into Nonterminals::register above:

    REGISTER_NONTERMINAL(U"<competitor>", competitor_NTM);

And it also tangles matching declarations for:

(a) the global variable competitor_NTM, of type nonterminal *;
(b) the "compositor function" competitor_NTMC, which is a function to deal with what happens when a successful match is made against the grammar — this incorporates the material which inweb finds to the right of the ==> markers in the Preform definition.

But if we left things at that, we would find ourselves at run-time with a null variable, a function not called from anywhere, and an instance somewhere in memory of a nonterminal read in from Preform syntax and called "<competitor>", but which has no apparent connection to either the function or the variable. We clearly need to join these together.

And so the REGISTER_NONTERMINAL macro expands to code which initialises the variable to the nonterminal having its name, and then connects that to the compositor function:

define REGISTER_NONTERMINAL(quotedname, identifier)
    identifier = Nonterminals::find(Vocabulary::entry_for_text(quotedname));
    identifier->compositor_fn = identifier##C;

§3. For example, this might expand to:

    competitor_NTM = Nonterminals::find(Vocabulary::entry_for_text(U"<competitor>"));
    competitor_NTM->compositor_fn = competitor_NTMC;

Note that it is absolutely necessary that Nonterminals::find does return a nonterminal. But we can be sure that it does, since the function creates a nonterminal object of that name even if one does not already exist.

§4. The position for internal nonterminals (i.e. those defined by a function written by the programmer, not by Preform grammar lines) is similar:

(a) again there is a global variable, say any_integer_NTM, of type nonterminal *;
(b) but now there is no compositor, and instead there is a function any_integer_NTMR which actually performs the parse directly.

The INTERNAL_NONTERMINAL macro similarly initialises and connects these declarations. min and max are conveniences for speedy parsing, and supply the minimum and maximum number of words that the nonterminal can match; these are needed because the Preform optimiser can't see inside any_integer_NTMR to calculate those bounds for itself. max can be infinity, in which case we use the constant INFINITE_WORD_COUNT for it.

define INTERNAL_NONTERMINAL(quotedname, identifier, min, max)
    identifier = Nonterminals::find(Vocabulary::entry_for_text(quotedname));
    identifier->opt.nt_extremes = LengthExtremes::new(min, max);
    identifier->internal_definition = identifier##R;
    identifier->marked_internal = TRUE;

§5. So, then, the following rather lengthy class declaration shows what goes into a nonterminal. Note that nonterminals are uniquely identifiable by their names: there can be only one called, say, <any-integer>. This is why its textual name is referred to as an "ID".

typedef struct nonterminal {
    struct vocabulary_entry *nonterminal_id;  e.g. "<any-integer>"

     For internal nonterminals
    int marked_internal;  has, or will be given, an internal definition...
    int (*internal_definition)(wording W, int *result, void **result_p);  ...this one
    int voracious;  if true, scans whole rest of word range

     For regular nonterminals
    struct production_list *first_pl;  if not internal, this defines it
    int (*compositor_fn)(int *r, void **rp, int *i_s, void **i_ps, wording *i_W, wording W);
    int multiplicitous;  if true, matches are alternative syntax tree readings
    int number_words_by_production;  this parses names for numbers, like "huit" or "zwei"
    unsigned int flag_words_in_production;  all words in the production should get these flags

     Storage for most recent correct match
    struct wording range_result[MAX_RANGES_PER_PRODUCTION];  storage for word ranges matched

    struct nonterminal_optimisation_data opt;  see The Optimiser
    struct nonterminal_instrumentation_data ins;  see Instrumentation

    CLASS_DEFINITION
} nonterminal;

The structure nonterminal is accessed in 4/lp, 4/to, 4/le, 4/ni, 4/prf, 4/ins, 4/pu and here.

§6. A few notes on this are in order:

(a) As noted above, every nonterminal is either "internal" or "regular". If internal, it is defined by a function; if regular, it is defined by lines of grammar (called "productions") and a compositor function.
(b) A few internal nonterminals are "voracious". These are given the entire word range for their productions to eat, and encouraged to eat as much as they like, returning a word number to show how far they got. While this effect could be duplicated with non-voracious nonterminals, that would be quite a bit slower, since it would have to test every possible word range.
(c) A few regular nonterminals are "multiplicitous". These composite their results in a way special to the Inform compiler's syntax tree, by stacking them up as alternative possible readings of the same text. Ordinarily, the result of parsing text against a nonterminal is that the first grammar line matching that text determines the meaning, but for a multiplicitous nonterminal, every line matching the text determines one of perhaps many possible meanings.
(d) For numbering and flagging on regular NTs, see Nonterminals::make_numbering below.
(e) The optimisation data helps the parser to reject non-matching text quickly. For example, if the optimiser can determine that <competitor> only ever matches texts of between 3 and 7 words in length, it can quickly reject any run of words outside that range. (However: note that a maximum of 0 means that the maximum and minimum word counts are disregarded.) The other fields are harder to explain — see The Optimiser.

§7. So, then, as noted above, nonterminals are identified by their name-words. The following is not especially fast but doesn't need to be: it's used only when Preform grammar is parsed, not when Inform text is parsed.

nonterminal *Nonterminals::detect(vocabulary_entry *name_word) {
    nonterminal *nt;
    LOOP_OVER(nt, nonterminal)
        if (name_word == nt->nonterminal_id)
            return nt;
    return NULL;
}

§8. And the following always returns one, creating it if necessary:

nonterminal *Nonterminals::find(vocabulary_entry *name_word) {
    nonterminal *nt = Nonterminals::detect(name_word);
    if (nt == NULL) {
        nt = CREATE(nonterminal);
        nt->nonterminal_id = name_word;

        nt->marked_internal = FALSE;  by default, nonterminals are regular
        nt->internal_definition = NULL;
        nt->voracious = FALSE;

        nt->first_pl = NULL;
        nt->compositor_fn = NULL;
        nt->multiplicitous = FALSE;
        nt->number_words_by_production = FALSE;  i.e., don't
        nt->flag_words_in_production = 0;  i.e., apply no flags

        for (int i=0; i<MAX_RANGES_PER_PRODUCTION; i++)
            nt->range_result[i] = EMPTY_WORDING;

        Optimiser::initialise_nonterminal_data(&(nt->opt));
        Instrumentation::initialise_nonterminal_data(&(nt->ins));
    }
    return nt;
}

§9. Word ranges in a nonterminal.We now need to define the macros GET_RW and PUT_RW, which get and set the results of a successful match against a nonterminal (see About Preform for more on this).

We do so by giving each nonterminal a small array of wordings, which are lightweight structures incurring little time or space overhead. The fact that they are attached to the NT itself, rather than, say, being placed on a parsing stack of some kind, makes them faster to access, but is possible only because the parser never backtracks. Similarly, results word ranges are overwritten if a nonterminal calls itself directly or indirectly: that is, the inner one's results are wiped out by the outer one. But this is no problem, since we never extract word-ranges from grammar which is recursive.

Word range 0 is reserved in case we ever need it for the entire text matched by the nonterminal, though at present we don't need that.

define MAX_RANGES_PER_PRODUCTION 5  in fact, one less than this, since range 0 is reserved
define GET_RW(nt, N) (nt->range_result[N])
define PUT_RW(nt, N, W) { nt->range_result[N] = W; }
define INHERIT_RANGES(from, to) {
    for (int i=1; i<MAX_RANGES_PER_PRODUCTION; i++)  not copying range 0
        to->range_result[i] = from->range_result[i];
}
define CLEAR_RW(from) {
    for (int i=0; i<MAX_RANGES_PER_PRODUCTION; i++)  including range 0
        from->range_result[i] = EMPTY_WORDING;
}

§10. Other results.The parser records the result of the most recently matched nonterminal in the following global variables — which, unlike word ranges, are not attached to any single NT.

inweb translates the notation <<r>> and <<rp>> to these variable names:

int most_recent_result = 0;  the variable which inweb writes <<r>>
void *most_recent_result_p = NULL;  the variable which inweb writes <<rp>>

§11. Flagging and numbering.The following mechanism arranges for words used in the grammar for a NT to be given properties just because of that — either flags or numerical values. For example, if we wanted the numbers from Stoppard's play "Dogg's Hamlet", we might have:

    <dogg-numbers> ::=
        sun | dock | trog | slack | pan

And if <dogg-numbers> were made a "numbering" NT, the effect would be that these five words would pick up the numerical values 1, 2, 3, 4, 5, because they occur in production number 1, 2, 3, 4, 5 for the NT.

void Nonterminals::make_numbering(nonterminal *nt) {
    nt->number_words_by_production = TRUE;
}

§12. Similarly, we could flag this NT with NUMBER_MC, and then the five words sun, dock, trog, slack, pan would all pick up the NUMBER_MC flag automatically.

void Nonterminals::flag_words_with(nonterminal *nt, unsigned int flags) {
    nt->flag_words_in_production = flags;
}

§13. This is all done by the following function, which is called when a word ve is read as part of a production with match number pc for the nonterminal nt:

void Nonterminals::note_word(vocabulary_entry *ve, nonterminal *nt, int pc) {
    ve->flags |= (nt->flag_words_in_production);
    if (nt->number_words_by_production) ve->literal_number_value = pc;
}