MSTA (syntax description translator): MSTA description language

2. MSTA description language

MSTA description languages is superset of YACC language. The major additional features are Extended Backus Naur Form (EBNF) for more convenient descriptions of languages, additional constructions in rules for more convenient description of scanners, and named attributes.

2.1 Layout of MSTA description

MSTA description structure has the following layout which is similar to one of YACC file.


     DECLARATIONS
     %%
     RULES
     %%
     ADDITIONAL C/C++ CODE

The `%%' serves to separate the sections of description. All sections are optional. The first `%%' starts section of keywords and is obligatory even if the section is empty, the second `%%' may be absent if section of additional C/C++ code is absent too.

Full YACC syntax of MSTA description file is placed in Appendix 1.

2.2 Declarations

The section of declarations may contain the following construction:


   %start  identifier

which determines axiom of the grammar. If such construction is absent, the axiom is believed to be nonterminal in the left hand side of the first rule. If there are several such construction, all ones except for the first are ignored.

By default, the values of attributes of the terminals (tokens) and nonterminals shall be integers. If you are going to use the values of different types, you shall use


   <tag>

in constructs declaring symbols (%token, %type, %left, ...) and shall insert corresponding union member names in the following construction:


   %union { body of union in C/C++ }

Alternatively, the union can be declared in interface file, and a typedef used to define the symbol YYSTYPE (see generated code) to represent this union. The effect of %union is to provide the declaration of YYSTYPE directly from the input.

There is group of the following declarators which take token (terminal) or nonterminal names as arguments.


   %token [<tag>] name [number] [name [number]]...
   %left [<tag>] name [number] [name [number]]...
   %right [<tag>] name [number] [name [number]]...
   %nonassoc [<tag>] name [number] [name [number]]...
   %type <tag> name...

The names can optionally be preceded by the name of a C/C++ union member (called a tag see above) appearing within ``<'' and ``>''. The use of tag specifies that the tokens or nonterminals named in this construction are to be of the same C/C++ type as the union member referenced by the tag.

If symbol used in grammar is undefined by a %token, %left, %right, or %nonassoc declaration, the symbol will be considered as a nonterminal.

The first occurrence of a given token can be followed by a positive integer in constructions `%token', `%left', `%right', and `%nonassoc' defining tokens. In this case the value assigned to it shall be code of the corresponding token returned by scanner.

Constructions `%left', `%right', and `%nonassoc' assign precedence and to the corresponding tokens. All tokens in the same construction have the same precedence level and associativity; the constructions is suggested to be placed in order of increasing precedence. Construction `%left' denotes that the operators (tokens) in that construction are left associative, and construction `%right' similarly denotes right associative operators.

Construction `%nonassoc' means that tokens cannot be used associatively. If the parser encounters associative use of this token it will report an error.

The construction `%type' means that the attributes of the corresponding nonterminals are of type given in the tag field.

Once the type, precedence, or token number of a symbol is specified, it shall not be changed. If the first declaration of a token does not assign a token number, MSTA will assign a token number. Once this assignment is made, the token number shall not be changed by explicit assignment.

Usually real grammars can not be declared without shift/reduce conflicts. To control suggested number of shift/reduce conflicts, the following construction can be used.


   %expect number

If such construction is present, MSTA will report error if the number of shift/reduce conflicts is not the same as one in the construction. Remember that it is not standard YACC construction.

The following construction in declarations means that the scanner should be generated.


   %scanner

There are the following major differences in parser and scanner generated by MSTA

In order to use a MSTA generated parser with a MSTA generated scanner, all objects in a MSTA scanner (variables, types, macro, and so on) are named by adding letter `s' or `S' after prefixes `yy' or `YY'.
Additional function `yylex_start' is generated. The function should be used for initiation of scanner (see Generated code).
Function `yylex' is generated instead of function `yyparse'. This function can be called many times for getting next token. Code of the next token is suggested to returned by statements `return' in the actions. Input stream (look ahead characters) is saved from a call of `yylex' to the next its call.
Instead of function `yylex' a function `yyslex' is used to read the next character (token in terminology of MSTA specification file) from the input stream. -1 is used as the end of file instead of 0 because scanner must read and process zero characters.
Macro `YYSABORT' is -1 in order to differ token code from flag of finishing work of the scanner. Remember that analogous macro `YYABORT' for MSTA parser is 1.
To extract all regular parts in the scanner grammar, splitting LR-sets is fulfilled (see MSTA implementation).

You can look at a scanner specification in Appendix 2.

There may be also the following constructions in the declaration section


     %{
        C/C++ DECLARATIONS
     %}

     %local {
        C/C++ DECLARATIONS
     }

     %import {
        C/C++ DECLARATION
     }

     and

     %export {
        C/C++ DECLARATION
     }

which contain any C/C++ declarations (types, variables, macros, and so on) used in sections. Remember the only first construction is standard POSIX YACC construction.

The local C/C++ declarations are inserted at the begin of generated implementation file (see section `generated code') but after include-directive of interface file (if present -- see MSTA Usage). You also can use more traditional construction of YACC %{ ... %} instead.

C/C++ declarations which start with `%import' are inserted at the begin of generated interface file. If the interface file is not generated, the code is inserted at the begin of the part of implementation file which would correspond the interface file.

C/C++ declarations which start with `%export' are inserted at the end of generated interface file. For example, such exported C/C++ code may contain definitions of external variables and functions which refer to definitions generated by MSTA. If the interface file is not generated, the code is inserted at the end of the part of implementation file which would correspond the interface file.

All C/C++ declarations are placed in the same order as in the section of declarations.

2.3 Rules

The section of declarations is followed by section of rules.

The rules section defines the context-free grammar to be accepted by the function yacc generates, and associates with those rules C language actions and additional precedence information. The grammar is described below, and a formal definition follows.

The rules section contains one or more grammar rules. A grammar rule has the following form:


      nonterminal : pattern ;

The nonterminal in the left side hand of the rule describes a language construction and pattern into which the nonterminal is derivated. The semicolon at the end of the rule can be absent.

MSTA can use EBNF (Extended Backus-Naur Form) to describe the patterns. Because the pattern can be quite complex, MSTA internally transforms rules in the description into simple rules and assigns a unique number to each simple rule. Simple rule can contains only sequence of nonterminals and tokens. Simple rules and the numbers assigned to the rules appear in the description file (see MSTA usage). To achieve to the simple rules, MSTA makes the following transformations (in the same order).

Alternatives


         nonterminal : pattern1 | pattern2

are transformed into


         nonterminal : pattern1
         nonterminal : pattern2

Lists
nonterminal : ... pattern / s_pattern ...
are transformed into
nonterminal : ... N ... N : N s_patter pattern
N denotes here a new nonterminal created during the transformation. This construction is very convenient for description of lists with separators, e.g. identifier separated by commas. Remember that the lists are not feature of standard POSIX YACC.
Naming
nonterminal : ... N @ identifier ...
is transformed into
nonterminal : ... N ...
Here N denotes a nonterminal, a token, or the following constructions. Instead of number in actions, the identifier can be used for naming attributes of the nonterminal, the token, or nonterminal which is created during transformation of the following constructions. Remember that the naming is not feature of standard POSIX YACC.
Optional construction
nonterminal : ... [ pattern ] ...
is transformed into
nonterminal : ... N ... N : pattern N :
N denotes here a new nonterminal created during the transformation. This construction is very convenient for description of optional constructions. Remember that the optional construction is not feature of standard POSIX YACC.
Optional repetition
nonterminal : ... pattern * ...
is transformed into
nonterminal : ... N ... N : N pattern N :
N denotes here a new nonterminal created during the transformation. This construction is very convenient for description of zero or more the patterns. Remember that the optional repetition is not feature of standard POSIX YACC.
Repetition
nonterminal : ... pattern + ...
is transformed into
nonterminal : ... N ... N : N pattern N : pattern
N denotes here a new nonterminal created during the transformation. This construction is very convenient for description of one or more the patterns. Remember that the repetition is not feature of standard POSIX YACC.
Grouping
nonterminal : ... ( pattern ) ...
is transformed into
nonterminal : ... N ... N : pattern
N denotes here a new nonterminal created during the transformation. This construction is necessary to change priority of the transformations. Remember that the grouping is not feature of standard POSIX YACC.
String
nonterminal : ... string ...
is transformed into
nonterminal : ... '1st char' '2nd char' ... 'last char' ...
Here the string is simply sequence of string characters as MSTA literals. Remember that the strings are not standard feature of POSIX YACC.
Range
nonterminal : ... token1 - tokenN ...
is transformed into
nonterminal : N N : token1 N : token2 ... N : tokenN
N denotes here a new nonterminal created during the transformation. The range is simply any token with code between code of token1 and code of token2 (inclusively). The code of token1 must be less or equal to the code of token2. Remember that the ranges are not feature of standard POSIX YACC.
Left open range
nonterminal : ... token1 <- tokenN ...
is transformed into
nonterminal : N N : token2 N : token3 ... N : tokenN
N denotes here a new nonterminal created during the transformation. The left open range is simply any token with code between code of token1 + 1 and code of token2 (inclusively). The code of token1 must be less to the code of token2. Remember that the ranges are not feature of standard POSIX YACC.
Right open range
nonterminal : ... token1 -> tokenN ...
is transformed into
nonterminal : N N : token1 N : token2 ... N : tokenN-1
N denotes here a new nonterminal created during the transformation. The right open range is simply any token with code between code of token1 and code of token2 - 1 (inclusively). The code of token1 must be less to the code of token2. Remember that the ranges are not feature of standard POSIX YACC.
Left right open range
nonterminal : ... token1 <-> tokenN ...
is transformed into
nonterminal : N N : token2 N : token3 ... N : tokenN-1
N denotes here a new nonterminal created during the transformation. The left right open range is simply any token with code between code of token1 + 1 and code of token2 - 1 (inclusively). The code of token1 must be less to the code of token2 - 1. Remember that the ranges are not feature of standard POSIX YACC.

Action inside pattern


         nonterminal : ... action  something non empty

is transformed into


         nonterminal : ... N  something non empty
         N : action

N denotes here a new nonterminal created during the transformation. The action is a C/C++ block.

After the all possible transformations mentioned above, the rules will contain sequence of only tokens (literals or token identifiers) and nonterminals finishing optional %prec or/and %la construction or/and an action.

The action is an arbitrary C/C++ block, i.e. declarations and statements enclosed in curly braces { and }. Certain pseudo-variables can be used in the action for attribute references. These are changed by data structures known internally to MSTA. The pseudo-variables have the following forms:

$$: This pseudo-variable denotes the nonterminal in the left hand side of the simple rule.
$number: This pseudo-variable refers to the attribute of sequence element (nonterminal, token, or action) specified by its number in the right side of the rule before changing actions inside pattern (see transformation above), reading from left to right. The number can be zero or negative. If it is, it refers to the attribute of the symbol (token or nonterminal) on the parser's stack preceding the leftmost symbol of the rule. (That is, $0 refers to the attribute of the symbol immediately preceding the leftmost symbol in the rule, to be found on the parser's stack, and $-1 refers to the symbol to its left.) If number refers to an element past the current point in the rule (i.e. past the action), or beyond the bottom of the stack, the result is undefined.
$identifier: These pseudo-variable is analogous to the previous one but the attribute name is used instead of its number. Of course the attribute naming must exist.
$<...>number: This pseudo-variable is used when there are attributes of different types in the grammar and the number corresponds to the nonterminal whose type is not known because the nonterminal has been generated during the transformation of rules into the simple rules. The type name of the attribute is placed into angle braces.
$<...>identifier: These pseudo-variable is analogous to the previous one but the attribute name is used instead of its number. Of course the attribute naming must exist.
$<...>$: This pseudo-variable is used when there are attributes of different types in the grammar and the type of nonterminal is not known because the nonterminal has been generated during the transformation of rules into the simple rules.

Messages about some shift/reduce conflicts are not generated if the rules in the conflict has priority and associativity. The priority and associativity of rule are simply the precedence level and associativity of the last token in the rule with declared precedence level and associativity.

The optional construction `%prec ...' can be used to change the precedence level associated with a particular simple rule. Examples of this are in cases where a unary and binary operator have the same symbolic representation, but need to be given different precedences. The reserved keyword `%prec' can be followed by a token identifier or a literal. It shall cause the precedence of the grammar rule to become that of the following token identifier or literal.

The optional construction `%la number' can be used to change the maximal look ahead associated with a particular simple rule. Example of this is when there is a classical conflict if-then-else which is to be resolved correctly with look ahead equal to 1 and there is a rule with conflict which must be resolved with look ahead equal to 3. In this case you can call MSTA with maximal look ahead equal to 1 (this is default) and place %la 3 in the rule which takes part in the conflict which must be resolved with look ahead equal to 3.

If a program section follows, the grammar rules shall be terminated by %%.

Next Previous Contents