Pattern Matching

To provide a limited regular-expression parser.

§1. Character types.We will define white space as spaces and tabs only, since the various kinds of line terminator will always be stripped out before this is applied.

int Regexp::white_space(inchar32_t c) {
    if ((c == ' ') || (c == '\t')) return TRUE;
    return FALSE;
}

§2. The presence of : here is perhaps a bit surprising, since it's illegal in C and has other meanings in other languages, but it's legal in C-for-Inform identifiers.

int Regexp::identifier_char(inchar32_t c) {
    if ((c == '_') || (c == ':') ||
        ((c >= 'A') && (c <= 'Z')) ||
        ((c >= 'a') && (c <= 'z')) ||
        ((c >= '0') && (c <= '9'))) return TRUE;
    return FALSE;
}

§3. Simple parsing.The following finds the earliest minimal-length substring of a string, delimited by two pairs of characters: for example, << and >>. This could easily be done as a regular expression using Regexp::match, but the routine here is much quicker.

int Regexp::find_expansion(text_stream *text, inchar32_t on1, inchar32_t on2,
    inchar32_t off1, inchar32_t off2, int *len) {
    for (int i = 0; i < Str::len(text); i++)
        if ((Str::get_at(text, i) == on1) && (Str::get_at(text, i+1) == on2)) {
            for (int j=i+2; j < Str::len(text); j++)
                if ((Str::get_at(text, j) == off1) && (Str::get_at(text, j+1) == off2)) {
                    *len = j+2-i;
                    return i;
                }
        }
    return -1;
}

§4. Still more simply:

int Regexp::find_open_brace(text_stream *text) {
    for (int i=0; i < Str::len(text); i++)
        if (Str::get_at(text, i) == '{')
            return i;
    return -1;
}

§5. Note that we count the empty string as being white space. Again, this is equivalent to Regexp::match(p, " *"), but much faster.

int Regexp::string_is_white_space(text_stream *text) {
    LOOP_THROUGH_TEXT(P, text)
        if (Regexp::white_space(Str::get(P)) == FALSE)
            return FALSE;
    return TRUE;
}

§6. A Worse PCRE.I originally wanted to call the function in this section a_better_sscanf, then thought perhaps a_worse_PCRE would be more true. (PCRE is Philip Hazel's superb C implementation of regular-expression parsing, but I didn't need its full strength, and I didn't want to complicate the build process by linking to it.)

This is a very minimal regular expression parser, simply for convenience of parsing short texts against particularly simple patterns. Here is an example of use:

    match_results mr = Regexp::create_mr();
    if (Regexp::match(&mr, text, U"fish (%d+) ([a-zA-Z_][a-zA-Z0-9_]*) *") {
        PRINT("Fish number: %S\n", mr.exp[0]);
        PRINT("Fish name: %S\n", mr.exp[1]);
    }
    Regexp::dispose_of(&mr);

Note the L at the front of the regex itself: this is a wide string.

This tries to match the given text to see if it consists of the word fish, then any amount of whitespace, then a string of digits which are copied into mr->exp[0], then whitespace again, and then an alphanumeric identifier to be copied into mr->exp[1], and finally optional whitespace. (If no match is made, the contents of the found strings are undefined.)

Note that this differs from, for example, Perl's regular expression matcher in several ways. The regular expression syntax is slightly different and in general simpler. A match has to be made from start to end, so it's as if there were an implicit ^ at the front and $ at the back (in Perl terms). The full match text is therefore always the entire text put in, so there's no need to record this. In Perl, matching against m/(.*) plus (.*)/ would set three subexpressions: number 0 would be the whole text matched, number 1 would be the first bracketed part, number 2 the second. Here, though, the corresponding regex would be written L"(%c*) plus (%c*)", and the bracketed terms would be subexpressions 0 and 1.

define MAX_BRACKETED_SUBEXPRESSIONS 5  this many bracketed subexpressions can be extracted

§7. The internal state of the matcher is stored as follows:

typedef struct match_position {
    int tpos;  position within text being matched
    int ppos;  position within pattern
    int bc;  count of bracketed subexpressions so far begun
    int bl;  bracket indentation level
    int bracket_nesting[MAX_BRACKETED_SUBEXPRESSIONS];
     which subexpression numbers (0, 1, 2, 3) correspond to which nesting
    int brackets_start[MAX_BRACKETED_SUBEXPRESSIONS], brackets_end[MAX_BRACKETED_SUBEXPRESSIONS];
     positions in text being matched, inclusive
} match_position;

The structure match_position is private to this section.

§8. It may appear that match texts are limited to 64 characters here, but they are not. They are simply a little faster to access if short.

define MATCH_TEXT_INITIAL_ALLOCATION 64

typedef struct match_result {
    inchar32_t match_text_storage[MATCH_TEXT_INITIAL_ALLOCATION];
    struct text_stream match_text_struct;
} match_result;
typedef struct match_results {
    int no_matched_texts;
    struct match_result exp_storage[MAX_BRACKETED_SUBEXPRESSIONS];
    struct text_stream *exp[MAX_BRACKETED_SUBEXPRESSIONS];
    int exp_at[MAX_BRACKETED_SUBEXPRESSIONS];
} match_results;

The structure match_result is private to this section.
The structure match_results is accessed in 3/cla, 4/prp, 4/jsn, 5/mv, 5/im, 8/ws, 8/bdfw, 8/wcp, 8/sw, 8/wm, 8/wr, 8/ls, 8/lc, 8/ws2, 8/bf, 9/pl, 9/ec, 9/taf, 9/tp, 9/as, 9/cl, 9/is, 10/pt, 10/apacs, 10/tc, 10/tw, 10/twot, 10/tf, 10/tu, 11/mk, 11/rw, 11/cl and here.

§9. Match result objects are inherently ephemeral, and we can expect to be creating them and throwing them away frequently. This must be done explicitly. Note that the storage required is on the C stack (unless some result strings grow very large), so that it's very quick to allocate and deallocate.

match_results Regexp::create_mr(void) {
    match_results mr;
    mr.no_matched_texts = 0;
    for (int i=0; i<MAX_BRACKETED_SUBEXPRESSIONS; i++) {
        mr.exp[i] = NULL;
        mr.exp_at[i] = -1;
    }
    return mr;
}

void Regexp::dispose_of(match_results *mr) {
    if (mr) {
        for (int i=0; i<MAX_BRACKETED_SUBEXPRESSIONS; i++)
            if (mr->exp[i]) {
                STREAM_CLOSE(mr->exp[i]);
                mr->exp[i] = NULL;
            }
        mr->no_matched_texts = 0;
    }
}

§10. So, then: the matcher itself.

int Regexp::match(match_results *mr, text_stream *text, inchar32_t *pattern) {
    if (mr) Regexp::prepare(mr);
    int rv = (Regexp::match_r(mr, text, pattern, NULL, FALSE) >= 0)?TRUE:FALSE;
    if ((mr) && (rv == FALSE)) Regexp::dispose_of(mr);
    return rv;
}

int Regexp::match_from(match_results *mr, text_stream *text, inchar32_t *pattern,
    int x, int allow_partial) {
    int match_to = x;
    if (x < Str::len(text)) {
        if (mr) Regexp::prepare(mr);
        match_position at;
        at.tpos = x; at.ppos = 0; at.bc = 0; at.bl = 0;
        match_to = Regexp::match_r(mr, text, pattern, &at, allow_partial);
        if (match_to == -1) {
            match_to = x;
            if (mr) Regexp::dispose_of(mr);
        }
    }
    return match_to - x;
}

void Regexp::prepare(match_results *mr) {
    if (mr) {
        mr->no_matched_texts = 0;
        for (int i=0; i<MAX_BRACKETED_SUBEXPRESSIONS; i++) {
            mr->exp_at[i] = -1;
            if (mr->exp[i]) STREAM_CLOSE(mr->exp[i]);
            mr->exp_storage[i].match_text_struct =
                Streams::new_buffer(
                    MATCH_TEXT_INITIAL_ALLOCATION, mr->exp_storage[i].match_text_storage);
            mr->exp_storage[i].match_text_struct.stream_flags |= FOR_RE_STRF;
            mr->exp[i] = &(mr->exp_storage[i].match_text_struct);
        }
    }
}

§11.

int Regexp::match_r(match_results *mr, text_stream *text, inchar32_t *pattern,
    match_position *scan_from, int allow_partial) {
    match_position at;
    if (scan_from) at = *scan_from;
    else { at.tpos = 0; at.ppos = 0; at.bc = 0; at.bl = 0; }
    while ((Str::get_at(text, at.tpos)) || (pattern[at.ppos])) {
        if ((allow_partial) && (pattern[at.ppos] == 0)) break;
        Parentheses in the match pattern set up substrings to extract11.1;

        int chcl,  what class of characters to match: a *_CHARCLASS value
            range_from, range_to,  for LITERAL_CHARCLASS only
            reverse = FALSE;  require a non-match rather than a match
        Extract the character class to match from the pattern11.2;

        if (chcl == BEGIN_CONTENT_CHARCLASS) {
            if (at.tpos == 0) continue;
            if (Str::get_at(text, at.tpos-1) == '\n') continue;
            return -1;
        }
        if (chcl == END_CONTENT_CHARCLASS) {
            if (Str::get_at(text, at.tpos) == 0) continue;
            if (Str::get_at(text, at.tpos) == '\n') continue;
            return -1;
        }

        int rep_from = 1, rep_to = 1;  minimum and maximum number of repetitions
        int greedy = TRUE;  go for a maximal-length match if possible
        Extract repetition markers from the pattern11.3;

        int reps = 0;
        Count how many repetitions can be made here11.4;
        if (reps < rep_from) return -1;

         we can now accept anything from rep_from to reps repetitions
        if (rep_from == reps) { at.tpos += reps; continue; }
        Try all possible match lengths until we find a match11.5;

         no match length worked, so no match
        return -1;
    }
    Copy the bracketed texts found into the global strings11.6;
    return at.tpos;
}

§11.1. Parentheses in the match pattern set up substrings to extract11.1 =

    if (pattern[at.ppos] == '(') {
        if (at.bl < MAX_BRACKETED_SUBEXPRESSIONS) at.bracket_nesting[at.bl] = -1;
        if (at.bc < MAX_BRACKETED_SUBEXPRESSIONS) {
            at.bracket_nesting[at.bl] = at.bc;
            at.brackets_start[at.bc] = at.tpos; at.brackets_end[at.bc] = -1;
        }
        at.bl++; at.bc++; at.ppos++;
        continue;
    }
    if (pattern[at.ppos] == ')') {
        at.bl--;
        if ((at.bl >= 0) && (at.bl < MAX_BRACKETED_SUBEXPRESSIONS) && (at.bracket_nesting[at.bl] >= 0))
            at.brackets_end[at.bracket_nesting[at.bl]] = at.tpos-1;
        at.ppos++;
        continue;
    }

This code is used in §11.

§11.2. Extract the character class to match from the pattern11.2 =

    if (pattern[at.ppos] == 0) return -1;
    int len = 0;
    chcl = Regexp::get_cclass(pattern, at.ppos, &len, &range_from, &range_to, &reverse);
    if (at.ppos+len > Wide::len(pattern)) internal_error("Yikes");
    else at.ppos += len;

This code is used in §11.

§11.3. This is standard regular-expression notation, except that I haven't bothered to implement numeric repetition counts, which we won't need:

Extract repetition markers from the pattern11.3 =

    if (chcl == WHITESPACE_CHARCLASS) {
        rep_from = 1; rep_to = Str::len(text)-at.tpos;
    }
    if (pattern[at.ppos] == '+') {
        rep_from = 1; rep_to = Str::len(text)-at.tpos; at.ppos++;
    } else if (pattern[at.ppos] == '*') {
        rep_from = 0; rep_to = Str::len(text)-at.tpos; at.ppos++;
    }
    if (pattern[at.ppos] == '?') { greedy = FALSE; at.ppos++; }

This code is used in §11.

§11.4. Count how many repetitions can be made here11.4 =

    for (reps = 0; ((Str::get_at(text, at.tpos+reps)) && (reps < rep_to)); reps++)
        if (Regexp::test_cclass(Str::get_at(text, at.tpos+reps), chcl,
            range_from, range_to, pattern, reverse) == FALSE)
            break;

This code is used in §11.

§11.5. Try all possible match lengths until we find a match11.5 =

    int from = rep_from, to = reps, dj = 1, from_tpos = at.tpos;
    if (greedy) { from = reps; to = rep_from; dj = -1; }
    for (int j = from; j != to+dj; j += dj) {
        at.tpos = from_tpos + j;
        int try = Regexp::match_r(mr, text, pattern, &at, allow_partial);
        if (try >= 0) return try;
    }

This code is used in §11.

§11.6. Copy the bracketed texts found into the global strings11.6 =

    if (mr) {
        for (int i=0; i<at.bc; i++) {
            Str::clear(mr->exp[i]);
            for (int j = at.brackets_start[i]; j <= at.brackets_end[i]; j++)
                PUT_TO(mr->exp[i], Str::get_at(text, j));
            mr->exp_at[i] = at.brackets_start[i];
        }
        mr->no_matched_texts = at.bc;
    }

This code is used in §11.

§12. So then: most characters in the pattern are taken literally (if the pattern says q, the only match is with a lower-case letter "q"), except that:

(a) a space means "one or more characters of white space";
(b) %d means any decimal digit;
(c) %c means any character at all;
(d) %C means any character which isn't white space;
(e) %i means any character from the identifier class (see above);
(f) %p means any character which can be used in the name of a Preform nonterminal, which is to say, an identifier character or a hyphen;
(g) %P means the same or else a colon;
(h) %t means a tab;
(i) %q means a double-quote.

% otherwise makes a literal escape; a space means any whitespace character; square brackets enclose literal alternatives, and note as usual with grep engines that []xyz] is legal and makes a set of four possibilities, the first of which is a literal close square; within a set, a hyphen makes a character range; an initial ^ negates the result; and otherwise everything is literal.

define ANY_CHARCLASS 1
define DIGIT_CHARCLASS 2
define WHITESPACE_CHARCLASS 3
define NONWHITESPACE_CHARCLASS 4
define IDENTIFIER_CHARCLASS 5
define PREFORM_CHARCLASS 6
define PREFORMC_CHARCLASS 7
define LITERAL_CHARCLASS 8
define TAB_CHARCLASS 9
define QUOTE_CHARCLASS 10
define NEWLINE_CHARCLASS 11
define BEGIN_CONTENT_CHARCLASS 12
define END_CONTENT_CHARCLASS 13

int Regexp::get_cclass(inchar32_t *pattern, int ppos, int *len, int *from, int *to, int *reverse) {
    if (pattern[ppos] == '^') { ppos++; *reverse = TRUE; } else { *reverse = FALSE; }
    switch (pattern[ppos]) {
        case '%':
            ppos++;
            *len = 2;
            switch (pattern[ppos]) {
                case 'd': return DIGIT_CHARCLASS;
                case 'c': return ANY_CHARCLASS;
                case 'C': return NONWHITESPACE_CHARCLASS;
                case 'i': return IDENTIFIER_CHARCLASS;
                case 'p': return PREFORM_CHARCLASS;
                case 'P': return PREFORMC_CHARCLASS;
                case 'q': return QUOTE_CHARCLASS;
                case 't': return TAB_CHARCLASS;
                case 'r': return NEWLINE_CHARCLASS;
                case 'a': return BEGIN_CONTENT_CHARCLASS;
                case 'z': return END_CONTENT_CHARCLASS;
            }
            *from = ppos; *to = ppos; return LITERAL_CHARCLASS;
        case '[':
            *from = ppos+1;
            ppos += 2;
            while ((pattern[ppos]) && (pattern[ppos] != ']')) ppos++;
            *to = ppos - 1; *len = ppos - *from + 2;
            return LITERAL_CHARCLASS;
        case ' ':
            *len = 1; return WHITESPACE_CHARCLASS;
    }
    *len = 1; *from = ppos; *to = ppos; return LITERAL_CHARCLASS;
}

§13.

int Regexp::test_cclass(inchar32_t c, int chcl, int range_from, int range_to, inchar32_t *drawn_from, int reverse) {
    int match = FALSE;
    switch (chcl) {
        case ANY_CHARCLASS: if (c) match = TRUE; break;
        case DIGIT_CHARCLASS: if (Characters::isdigit(c)) match = TRUE; break;
        case WHITESPACE_CHARCLASS: if (Characters::is_whitespace(c)) match = TRUE; break;
        case NEWLINE_CHARCLASS: if (c == '\n') match = TRUE; break;
        case TAB_CHARCLASS: if (c == '\t') match = TRUE; break;
        case NONWHITESPACE_CHARCLASS: if (!(Characters::is_whitespace(c))) match = TRUE; break;
        case QUOTE_CHARCLASS: if (c != '\"') match = TRUE; break;
        case IDENTIFIER_CHARCLASS: if (Regexp::identifier_char(c)) match = TRUE; break;
        case PREFORM_CHARCLASS: if ((c == '-') || (c == '_') ||
            ((c >= 'a') && (c <= 'z')) ||
            ((c >= '0') && (c <= '9')))
                match = TRUE;
            break;
        case PREFORMC_CHARCLASS: if ((c == '-') || (c == '_') || (c == ':') ||
            ((c >= 'a') && (c <= 'z')) ||
            ((c >= '0') && (c <= '9')))
                match = TRUE;
            break;
        case LITERAL_CHARCLASS:
            if ((range_to > range_from) && (drawn_from[range_from] == '^')) {
                range_from++; reverse = reverse?FALSE:TRUE;
            }
            for (int j = range_from; j <= range_to; j++) {
                inchar32_t c1 = drawn_from[j], c2 = c1;
                if ((j+1 < range_to) && (drawn_from[j+1] == '-')) { c2 = drawn_from[j+2]; j += 2; }
                if ((c >= c1) && (c <= c2)) {
                    match = TRUE; break;
                }
            }
            break;
    }
    if (reverse) match = (match)?FALSE:TRUE;
    return match;
}

§14. Replacement.And this routine conveniently handles searching and replacing. This time we can match at substrings of the text (i.e., we are not forced to match from the start right to the end), and multiple replacements can be made. For example,

    Regexp::replace(text, U"[aeiou]", U"!", REP_REPEATING);

will turn the text "goose eggs" into "g!!s! !ggs".

define REP_REPEATING 1
define REP_ATSTART 2

int Regexp::replace(text_stream *text, inchar32_t *pattern, inchar32_t *replacement, int options) {
    TEMPORARY_TEXT(altered)
    match_results mr = Regexp::create_mr();
    int changes = 0;
    for (int i=0, L=Str::len(text); i<L; i++) {
        match_position mp; mp.tpos = i; mp.ppos = 0; mp.bc = 0; mp.bl = 0;
        Regexp::prepare(&mr);
        int try = Regexp::match_r(&mr, text, pattern, &mp, TRUE);
        if (try >= 0) {
            if (replacement)
                for (int j=0; replacement[j]; j++) {
                    inchar32_t c = replacement[j];
                    if (c == '%') {
                        j++;
                        inchar32_t ind = replacement[j] - '0';
                        if (ind < MAX_BRACKETED_SUBEXPRESSIONS)
                            WRITE_TO(altered, "%S", mr.exp[ind]);
                        else
                            PUT_TO(altered, replacement[j]);
                    } else {
                        PUT_TO(altered, replacement[j]);
                    }
                }
            int left = L - try;
            changes++;
            Regexp::dispose_of(&mr);
            L = Str::len(text); i = L-left-1;
            if ((options & REP_REPEATING) == 0) { Add the rest14.1; break; }
            continue;
        } else PUT_TO(altered, Str::get_at(text, i));
        if (options & REP_ATSTART) { Add the rest14.1; break; }
    }
    Regexp::dispose_of(&mr);
    if (changes > 0) Str::copy(text, altered);
    DISCARD_TEXT(altered)
    return changes;
}

§14.1. Add the rest14.1 =

    for (i++; i<L; i++)
        PUT_TO(altered, Str::get_at(text, i));

This code is used in §14 (twice).