This Is The GOLD Meta-Language Over
This Is The GOLD Meta-Language Over
Overview
The GOLD Meta-Language used to define a grammar using the GOLD Parsing System.
The notation used by the GOLD Meta-Language closely follows the standards used in
language theory. This allows both students and professionals, familiar with
language theory, to be able to write grammars without a large learning curve.
Syntax
Line Comments
!
Comment
Block Comments
!*
Comments
*!
Details
Comments in the GOLD Meta-Language use the 'explanation mark' for line comments and
block comments. Block comments can be nested.
Examples
! This is a comment
! This is also a comment
!*
Remember to always comment your code. This
can add a great deal to readability.
*!
The role of properties is nebulous by design. Some may set the formal name of the
grammar, for instance, and others can have significant impact on how the system
constructs parse tables. Property names are delimited by doubled-quotes and can be
set to any of the symbols and literals recognized by the GOLD Meta-Language. In
most cases, the value will be a string.
This feature was added in Version 2.2 of the Builder. In Version 5, you can set a
terminal to "Virtual", but assigning its attributes.
Start Symbol Required The starting symbol in the grammar. When LALR parse
tables are constructed by the GOLD Builder, an "accepted" grammar will reduce to
this nonterminal.
Example
"Name" = 'My Programming Language'
"Version" = '1.0 beta'
"Author" = 'John Q. Public'
Literal sets of characters are delimited using the square brackets '[' and ']' and
pre-defined sets are delimited by the braces '{' and '}'. For instance, the text
"[abcde]" denotes a set of characters consisting of the first five letters of the
alphabet; while the text "{abc}" refers to a set named "abc".
Sets can then be declared by adding and subtracting previously declared sets and
literal sets. The GOLD Builder provides a collection of pre-defined sets that
contain characters often used to define terminals.
The developer can specify any character using either its decimal or hexadecimal
code point. Decimal values are denoted by a number-sign prefix (#) and hexadecimal
values are denoted by an ampersand (&).
Set ranges can be specified by using a ".." between two values. Both the start and
end values can be in either decimal or hexadecimal.
Regular Expressions
The notation is rather simple, yet versatile enough to express any terminal needed.
Basically, regular expressions consist of a series of characters that define the
pattern of the terminal.
Literal sets of characters are delimited using the square brackets '[' and ']' and
defined sets are delimited by the braces '{' and '}'. For instance, the text
"[abcde]" denotes a set of characters consisting of the first five letters of the
alphabet; while the text "{abc}" refers to a set named "abc". Neither of these are
part of the "pure" notation for regular expressions, but are widely used in other
parser generators such as Lex/Yacc.
Sub-expressions are delimited by normal parenthesis '(' and ')'. The pipe character
'|' is used to denote alternate expressions.
Either a set, a sub expression, or a single character can be followed by any of the
following three symbols:
* Kleene Closure. This symbol denotes 0 or more or the specified character(s)
+ One or more. This symbol denotes 1 or more of the specified character(s)
? Optional. This symbol denotes 0 or 1 of the specified character(s)
For example, the regular expression ab* translates to "an a followed by zero or
more b's" and [abc]+ translates to "an series of one or more a's, b's or c's".
Note: When text is read by the Builder, all characters delimited by single quotes
are analyzed as literal strings. In other words, any text delimited by single
quotes is considered to be exactly as printed. This allows you to specify
characters that would normally be limited by the notation. For instance, when
defining a rule, angle brackets are used to delimit nonterminals. By typing '<' and
'>', you can specify these two characters without worrying about the system
misinterpreting them. A single quote character can be specified by typing a double
single quote ''.
In the case of regular expressions, single quotes allow you to specify the
following characters: ? * + ( ) { } [ ]
Special Terminals
Whitespace
In practically all programming languages, the parser recognizes (and usually
ignores) the spaces, new lines, and other meaningless characters that exist between
tokens. For instance, in the code:
If Done Then
Counter = 1;
End If
The fact that there are two spaces between the 'If' and 'Done', a new line after
'Then', and multiple space before 'Counter' is irrelevant.
From the parser's point of view (in particular the Deterministic Finite Automata
that it uses) these whitespace characters are recognized as a special terminal
which can be discarded. In GOLD, this terminal is simply called the Whitespace
terminal and can be defined to whatever is needed. If the Whitespace Terminal is
not defined explicitly in the grammar, it will be implicitly declared as one or
more of the characters in the pre-defined Whitespace set: {Whitespace}+.
Normally, you would not need to worry about the Whitespace terminal unless you are
designing a language where the end of a line is significant. This is the case with
Visual Basic, BASIC and many, many others. The proper declaration can be seen in an
example.
Comment
Block and line comments are common in programming languages. The Comment Terminal
is often generated as a container for group. Since comments are considered
whitespace, GOLD will set the Comment Terminal to "noise" if it is created.
Examples
Declaration Valid strings
Example1 = a b c* ab, abc, abcc, abccc, abcccc, ...
Example2 = a b? c abc, ac
Example3 = a|b|c a, b, c
Example4 = a[12]*b ab, a1b, a2b, a12b, a21b, a22b, a111b, ...
Example5 = '*'+ *, **, ***, ****, ...
Example6 = {Letter}+ cat, dog, Sacramento, ...
Identifier = {Letter}{AlphaNumeric}* e4, Param4b, Color2, temp, ...
ListFunction = c[ad]+r car, cdr, caar, cadr, cdar, cddr, caaar, ...
ListFunction = c(a|d)+r The same as the above using a different, yet equivalent,
regular expression.
NewLine = {CR}{LF}|{CR} Windows and DOS use {CR}{LF} for newlines, UNIX simply uses
{CR}. This definition will detect both.
Overview
Symbol Attributes
Attribute Valid Values
Type Content / Noise
Source Lexer / Virtual
Details
Type
Terminal attributes allows you to control how the symbol is viewed by the parser.
If can be merely considered noise or an essential part of the grammar. Noise
symbols are ignored by the parser. Normally, the symbols 'Whitespace' and 'Comment'
are automatically defined as 'noise'. Any other terminal defaults to 'Content'.
Source
This controls whether GOLD will generate DFA states to recognize the terminal. In
practically all cases, this will be the case and the terminal will be generated by
the lexer. However, in rare circumstances, the developer may want to create the
terminal manually at runtime. The alternative value, virtual, will place the symbol
into the Symbol Table, but will not create any DFA states.
Languages such as Python, do not use symbols to mark the start and end of a block
of statements. Instead, the indentation of each statement is used to determine when
a block begins and ends. For Python, content of whitespace is important - or at
least the position of a token rather than solely its classification by the lexer.
If a program has an indent of 10 spaces, the grammar must contain a set of rules
for statements at this level. The same is true for all other levels of indentation
- requiring an infinite number of rules to parse.
Virtual terminals are design to resolve problems like this by allowing the
developer to define terminals that will be create manually. Each of these terminals
are entered into the Symbol Table, but will not be recognized by the Engine's
lexer. In other words, these terminals only exist in the Symbol Table and will
never be created by the lexer's Deterministic Finite Automata.
As a result, the developer must create these terminals at runtime and push them
onto the front of the input queue. The actual semantics and method calls will vary
greatly between different implementations of the Engine. In the case of Python,
tokens could be created to represent a increase or decrease in indentation. Of
course, the developer can create special tokens for any number of reasons; not just
this one alone.
This is redundant with the old "Virtual Terminals" grammar property. Both can be
used.
Examples
Virtual
The following example sets the "source" attribute of IndentIncrease and
IndentDecrease to "virtual".
In GOLD, lexical groups are used for situations where a number of recognized tokens
should be organized into a single "group". This mechanism is most commonly used to
handle line and block comments. However, it is not limited to "noise", but can be
used for any content.
Comment Groups
One of the key principles in programming languages is the ability to incorporate
comments and other documentation directly to the source code. Whether it is
FORTRAN, COBOL or C++, the ability exists, but in varying forms. Essentially, there
are three different types of comments used in programming languages: those that
tell the compiler to ignore the remaining text in the current line of code and
those used to denote the start and end of a multi-line comment.
To accommodate the intricacies of comments, the GOLD Parser Builder provides for
this special class of terminals called "groups". These allow you to create comments
with little typing.
Block Comments
Any time a symbol is defined ending with Start, End, and Line , a lexical group
will be created. Each group acts as a container for data - such as block comments
or any other group the user may need. Once a group is read, the system will treat
it as a single token. This is the group's "container".
All groups have a start and end symbol. In the case of the following definition,
both are quite obvious.
Line Comments
Many programming languages have constructs, such as line comments, that read an
entire line and end at the newline. The same group mechanism is used for block
comments and line comments. However, a few things happen "behind the scenes" to
make this possible. The following definition creates a line comment:
GOLD uses the first part of the definition, the 'name' to create the group name and
container. Note: line-based groups and block-groups, of the same name, share the
same container.
Groups
Definition Generated Group Name Generated Container
name Start name Block name
name End
name Line name Line
Group Attributes
Overview
Group Attributes
Attribute Valid Values
Nesting All / None / Self / { ... }
Advance Token / Character
Ending Open / Closed
Details
Nesting
Any group can allow groups to be nested with it.
This sets what groups can be nested within it. The user can also specify a set of
group names using a braced-list. This defaults to None.
Advance
This sets whether the lexical group will advance on each token or character. The
default is Character.
Ending
This attribute determines if the group is open-ended or closed. If the group is
closed, the end token is consumed when the group is complete. Otherwise, it is left
in the queue. Block groups default to 'closed'. Line groups default to 'open' to
leave the newline for languages that need it as part of their syntax.
Examples
Example 1
The following example declares Comment Start and Comment End. The group is set to
Unnested and Character.
Comment Start = '*/'
Comment End = '/*'
Example: Groups
Non-Noise Groups
The group approach is designed to allow the developer to create any number of
groups. The most common, naturally, will be the block comment. However, there will
be situations where the developer will want to create a group that will be
recognized by the parser as a regular terminal.
So, if the developer wants to create a HTML block in code, they can specify:
page =
<html>
<head>
<title>Some page</title>
</head>
</body>
This is a tad easier than concatenating a series of strings!
</body>
</html>
Real-World Examples
ANSI-C (and its children)
ANSI C comments are pretty basic. They cannot be nested, and only advance a
character at a time (untokenized).This format is used by all successors of ANSI-C
such as C++, Java and C#. Strangely, line comments are not part of the ANSI-C
language definition. But, rarely, has a compiler not recognized them.
In this case, the single comment group can be defined. The regular expressions can
be defined so Comment Start and Comment End can accept either notation. Normally,
Pascal comments cannot be nested, but this varies by compiler. Group definitions
specify a terminal name, so extra definitions are necessary.
To set this group to noise, the developer uses 'noise' in the attributes. As a
result, both Comment and Comment2 will be ignored by the parser. The developer
could also manually add the 'noise' attribute to "Comment Attributes", but is not
necessary.
The majority of the grammar, that you will write, will be used to specify the
syntactic structure of the language. When an input string is parsed (such as the
user's program), it is stored into a tree structure that follows the syntactic
structure of the language. There are several ways to specifying the structure of a
grammar and different parsing systems use different notations and formats.
When text is read by the Builder, all characters delimited by single quotes are
analyzed as literal strings. Any text delimited by single quotes is considered to
be exactly as printed. This allows you to specify characters that would normally be
limited by the notation. For instance, when defining a rule, angle brackets are
used to delimit nonterminals. By typing '<' and '>', you can specify these two
characters without worrying about the system misinterpreting them. A single quote
character can be specified by typing two single quotes ''.
Productions
The syntax of the grammar is defined using a series of "productions". These consist
of a single nonterminal called the "head". The head is defined to consist of
multiple symbols making up the production's "handle".
For instance, the following production defines a generic If-Statement syntax. The
symbols 'if', 'then', and 'end' are terminals. The symbols <Expression> and
<Statements> are nonterminals. The first nonterminal, before the ::= symbol, is the
"head". All the terminals/nonterminals after the ::= is called a "handle". The
"head" and "handle", together, are called a "production".
Rules
If you are declaring a series of productions that derive the same nonterminal (they
have the same head), you can use pipe characters '|' to create a series of
different handles. The pipe symbol basically means "or". So, in the example below,
<Statement> is defined using three different handles. It can be equivalent to any
of these.
Nullable Rules
A rule can also be declared as "nullable". This basically means that the
nonterminal, that the rule represents, can contain zero terminals. For all intents
and purposes, the nonterminal an be seen as optional. In grammars, nullable rules
are often used for optional clauses (on statements) or creating lists that contain
zero or more items.
Nullable Clause
The three clauses in the C-style For Statement are optional. The <Opt Exp> rule
defines an optional expression.
<For Statement> ::= for '(' <Opt Exp> ';' <Opt Exp> ';' <Opt Exp> ')' <Statements>
<Value> ::= ID
| Integer
| '(' <Expression> ')'
Example: Lists
Comma Delimited
The following declaration defines a list of identifiers where commas are used to
separate list items. The last item in the list will not be followed by a comma.
This definition does not allow the list to be completely empty.
Identifier = {Letter}{Alphanumeric}*
<List> ::= <List> ',' <List Item>
| <List Item>
Statement List
The following declaration defines a list of "Statements" where each <Statement>
will be followed by a semicolon. Unlike the example above, the last item in the
list will be followed by the symbol. However, since the <List> rule is nullable
(the blank line), the list can be completely empty.
Identifier = {Letter}{Alphanumeric}*
<List> ::= <List> <Statement> ';'
|
Identifier = {Letter}{Alphanumeric}*
<List> ::= <List> <Statement> ';'
| <Statement> ';'
Id = {Letter}{AlphaNumeric}*
<Statement> ::= if Id then <Statement>
| if Id then <Statement> else <Statement>
| Id ':=' Id
When this grammar is analyzed by the GOLD Parser Builder (or any other generator
for that matter), a problem arises in the LALR(1) parse tables. Invariably, a
shift-reduce error occurs when the parser reaches the "else" option on the If-Then
statement.
This type of error is caused by ambiguity in the grammar itself; in the case of the
above grammar, the parser does not know where it can reduce a rule or must push a
token onto the parse stack. Technical issues aside, it is important to understand
why this grammar is ambiguous.
The ambiguity of the grammar can be seen with a very simple piece of source code:
if Enrolled then if Studied then Grade:=A else Grade:=B
The sample source code could be interpreted two distinct ways by the grammar. The
first interpretation would bind the "else" to the first "if".
Id = {Letter}{AlphaNumeric}*
In other words, no "if" statements without "else" clauses can appear inside the
"then" part of an "if-then-else" statement. Using this solution, the "else" will
bind to the last "If" statement, and still allows chaining. This is the case with
the C/C++ programming language family.
Id = {Letter}{AlphaNumeric}*
In the GOLD Meta-language, you can control the attributes of terminals and groups.
These attributes control how a terminal is treated and how groups will behave. Each
attribute assignment consists of a set of attribute-value pairs.
Terminal Attributes
Overview
Symbol Attributes
Attribute Valid Values
Type Content / Noise
Source Lexer / Virtual
Details
Type
Terminal attributes allows you to control how the symbol is viewed by the parser.
If can be merely considered noise or an essential part of the grammar. Noise
symbols are ignored by the parser. Normally, the symbols 'Whitespace' and 'Comment'
are automatically defined as 'noise'. Any other terminal defaults to 'Content'.
Source
This controls whether GOLD will generate DFA states to recognize the terminal. In
practically all cases, this will be the case and the terminal will be generated by
the lexer. However, in rare circumstances, the developer may want to create the
terminal manually at runtime. The alternative value, virtual, will place the symbol
into the Symbol Table, but will not create any DFA states.
Languages such as Python, do not use symbols to mark the start and end of a block
of statements. Instead, the indentation of each statement is used to determine when
a block begins and ends. For Python, content of whitespace is important - or at
least the position of a token rather than solely its classification by the lexer.
If a program has an indent of 10 spaces, the grammar must contain a set of rules
for statements at this level. The same is true for all other levels of indentation
- requiring an infinite number of rules to parse.
Virtual terminals are design to resolve problems like this by allowing the
developer to define terminals that will be create manually. Each of these terminals
are entered into the Symbol Table, but will not be recognized by the Engine's
lexer. In other words, these terminals only exist in the Symbol Table and will
never be created by the lexer's Deterministic Finite Automata.
As a result, the developer must create these terminals at runtime and push them
onto the front of the input queue. The actual semantics and method calls will vary
greatly between different implementations of the Engine. In the case of Python,
tokens could be created to represent a increase or decrease in indentation. Of
course, the developer can create special tokens for any number of reasons; not just
this one alone.
This is redundant with the old "Virtual Terminals" grammar property. Both can be
used.
Examples
Virtual
The following example sets the "source" attribute of IndentIncrease and
IndentDecrease to "virtual".
Group Attributes
Overview
Group Attributes
Attribute Valid Values
Nesting All / None / Self / { ... }
Advance Token / Character
Ending Open / Closed
Details
Nesting
Any group can allow groups to be nested with it.
This sets what groups can be nested within it. The user can also specify a set of
group names using a braced-list. This defaults to None.
Advance
This sets whether the lexical group will advance on each token or character. The
default is Character.
Ending
This attribute determines if the group is open-ended or closed. If the group is
closed, the end token is consumed when the group is complete. Otherwise, it is left
in the queue. Block groups default to 'closed'. Line groups default to 'open' to
leave the newline for languages that need it as part of their syntax.
Examples
Example 1
The following example declares Comment Start and Comment End. The group is set to
Unnested and Character.
Constants
Name Codepoint Description
{HT} &09 Horizontal Tab character.
{LF} &10 Line Feed character.
{VT} &11 Vertical Tab character. This character is rarely used.
{FF} &12 Form Feed character. This character is also known as "New Page".
{CR} &13 Carriage Return character {#13}.
{Space} &20 Space character. Technically, this set is not needed since a
"space" can be expressed by using single quotes: ' '. The set was added to allow
the developer to more explicitly indicate the character and add readability.
{NBSP} &A0 No-Break Space character. The No-Break Space character is used to
represent a space where a line break is not allowed. It is often used in source
code for indentation.
{LS} &2028 Unicode Line Separator
{PS} &2029 Unicode Paragraph Separator
Common Character Sets
Please see the Pre-Defined Character Set Chart for pictures of these characters.
Name Codepoints
{Basic Latin} &00 .. &7F
{Latin-1 Supplement} &80 .. &FF
{Latin Extended-A} &0100 .. &017F
{Latin Extended-B} &0180 .. &024F
{IPA Extensions} &0250 .. &02AF
{Spacing Modifier Letters} &02B0 .. &02FF
{Combining Diacritical Marks} &0300 .. &036F
{Greek and Coptic} &0370 .. &03FF
{Cyrillic} &0400 .. &04FF
{Cyrillic Supplement} &0500 .. &052F
{Armenian} &0530 .. &058F
{Hebrew} &0590 .. &05FF
{Arabic} &0600 .. &06FF
{Syriac} &0700 .. &074F
{Arabic Supplement} &0750 .. &077F
{Thaana} &0780 .. &07BF
{NKo} &07C0 .. &07FF
{Samaritan} &0800 .. &083F
{Devanagari} &0900 .. &097F
{Bengali} &0980 .. &09FF
{Gurmukhi} &0A00 .. &0A7F
{Gujarati} &0A80 .. &0AFF
{Oriya} &0B00 .. &0B7F
{Tamil} &0B80 .. &0BFF
{Telugu} &0C00 .. &0C7F
{Kannada} &0C80 .. &0CFF
{Malayalam} &0D00 .. &0D7F
{Sinhala} &0D80 .. &0DFF
{Thai} &0E00 .. &0E7F
{Lao} &0E80 .. &0EFF
{Tibetan} &0F00 .. &0FFF
{Myanmar} &1000 .. &109F
{Georgian} &10A0 .. &10FF
{Hangul Jamo} &1100 .. &11FF
{Ethiopic} &1200 .. &137F
{Ethiopic Supplement} &1380 .. &139F
{Cherokee} &13A0 .. &13FF
{Unified Canadian Aboriginal Syllabics} &1400 .. &167F
{Ogham} &1680 .. &169F
{Runic} &16A0 .. &16FF
{Tagalog} &1700 .. &171F
{Hanunoo} &1720 .. &173F
{Buhid} &1740 .. &175F
{Tagbanwa} &1760 .. &177F
{Khmer} &1780 .. &17FF
{Mongolian} &1800 .. &18AF
{Unified Canadian Aboriginal Syllabics Extended} &18B0 .. &18FF
{Limbu} &1900 .. &194F
{Tai Le} &1950 .. &197F
{New Tai Lue} &1980 .. &19DF
{Khmer Symbols} &19E0 .. &19FF
{Buginese} &1A00 .. &1A1F
{Tai Tham} &1A20 .. &1AAF
{Balinese} &1B00 .. &1B7F
{Sundanese} &1B80 .. &1BBF
{Lepcha} &1C00 .. &1C4F
{Ol Chiki} &1C50 .. &1C7F
{Vedic Extensions} &1CD0 .. &1CFF
{Phonetic Extensions} &1D00 .. &1D7F
{Phonetic Extensions Supplement} &1D80 .. &1DBF
{Combining Diacritical Marks Supplement} &1DC0 .. &1DFF
{Latin Extended Additional} &1E00 .. &1EFF
{Greek Extended} &1F00 .. &1FFF
{General Punctuation} &2000 .. &206F
{Superscripts and Subscripts} &2070 .. &209F
{Currency Symbols} &20A0 .. &20CF
{Combining Diacritical Marks for Symbols} &20D0 .. &20FF
{Letterlike Symbols} &2100 .. &214F
{Number Forms} &2150 .. &218F
{Arrows} &2190 .. &21FF
{Mathematical Operators} &2200 .. &22FF
{Miscellaneous Technical} &2300 .. &23FF
{Control Pictures} &2400 .. &243F
{Optical Character Recognition} &2440 .. &245F
{Enclosed Alphanumerics} &2460 .. &24FF
{Box Drawing} &2500 .. &257F
{Block Elements} &2580 .. &259F
{Geometric Shapes} &25A0 .. &25FF
{Miscellaneous Symbols} &2600 .. &26FF
{Dingbats} &2700 .. &27BF
{Miscellaneous Mathematical Symbols-A} &27C0 .. &27EF
{Supplemental Arrows-A} &27F0 .. &27FF
{Braille Patterns} &2800 .. &28FF
{Supplemental Arrows-B} &2900 .. &297F
{Miscellaneous Mathematical Symbols-B} &2980 .. &29FF
{Supplemental Mathematical Operators} &2A00 .. &2AFF
{Miscellaneous Symbols and Arrows} &2B00 .. &2BFF
{Glagolitic} &2C00 .. &2C5F
{Latin Extended-C} &2C60 .. &2C7F
{Coptic} &2C80 .. &2CFF
{Georgian Supplement} &2D00 .. &2D2F
{Tifinagh} &2D30 .. &2D7F
{Ethiopic Extended} &2D80 .. &2DDF
{Cyrillic Extended-A} &2DE0 .. &2DFF
{Supplemental Punctuation} &2E00 .. &2E7F
{CJK Radicals Supplement} &2E80 .. &2EFF
{Kangxi Radicals} &2F00 .. &2FDF
{Ideographic Description Characters} &2FF0 .. &2FFF
{CJK Symbols and Punctuation} &3000 .. &303F
{Hiragana} &3040 .. &309F
{Katakana} &30A0 .. &30FF
{Bopomofo} &3100 .. &312F
{Hangul Compatibility Jamo} &3130 .. &318F
{Kanbun} &3190 .. &319F
{Bopomofo Extended} &31A0 .. &31BF
{CJK Strokes} &31C0 .. &31EF
{Katakana Phonetic Extensions} &31F0 .. &31FF
{Enclosed CJK Letters and Months} &3200 .. &32FF
{CJK Compatibility} &3300 .. &33FF
{CJK Unified Ideographs Extension A} &3400 .. &4DBF
{Yijing Hexagram Symbols} &4DC0 .. &4DFF
{CJK Unified Ideographs} &4E00 .. &9FFF
{Yi Syllables} &A000 .. &A48F
{Yi Radicals} &A490 .. &A4CF
{Lisu} &A4D0 .. &A4FF
{Vai} &A500 .. &A63F
{Cyrillic Extended-B} &A640 .. &A69F
{Bamum} &A6A0 .. &A6FF
{Modifier Tone Letters} &A700 .. &A71F
{Latin Extended-D} &A720 .. &A7FF
{Syloti Nagri} &A800 .. &A82F
{Common Indic Number Forms} &A830 .. &A83F
{Phags-pa} &A840 .. &A87F
{Saurashtra} &A880 .. &A8DF
{Devanagari Extended} &A8E0 .. &A8FF
{Kayah Li} &A900 .. &A92F
{Rejang} &A930 .. &A95F
{Hangul Jamo Extended-A} &A960 .. &A97F
{Javanese} &A980 .. &A9DF
{Cham} &AA00 .. &AA5F
{Myanmar Extended-A} &AA60 .. &AA7F
{Tai Viet} &AA80 .. &AADF
{Meetei Mayek} &ABC0 .. &ABFF
{Hangul Syllables} &AC00 .. &D7AF
{Hangul Jamo Extended-B} &D7B0 .. &D7FF
{Private Use Area} &E000 .. &F8FF
{CJK Compatibility Ideographs} &F900 .. &FAFF
{Alphabetic Presentation Forms} &FB00 .. &FB4F
{Arabic Presentation Forms-A} &FB50 .. &FDFF
{Variation Selectors} &FE00 .. &FE0F
{Vertical Forms} &FE10 .. &FE1F
{Combining Half Marks} &FE20 .. &FE2F
{CJK Compatibility Forms} &FE30 .. &FE4F
{Small Form Variants} &FE50 .. &FE6F
{Arabic Presentation Forms-B} &FE70 .. &FEFF
{Halfwidth and Fullwidth Forms} &FF00 .. &FFEF
Miscellaneous Character Sets
Name Codepoints Description
{All Valid} &01 .. &D7FF, &E000 .. &FFEF The {All Valid} character set contains
every valid character in the Basic Multilingual Plane of the Unicode Character Set.
This includes the characters from &1 to &D7FF and &DC00 to &FFEF.
Please note that this set also includes whitespace and control characters that are
rarely used in programming languages. It is only available in versions 2.6 and
later of the Builder.
{ANSI Mapped} &0152, &0153, &0160, &0161, &0178, &017D, &017E, &0192, &02C6,
&02DC, &2013, &2014, &2018 .. &201A, &201C .. &201E, &2020 .. &2022, &2026, &2030,
&2039, &203A, &20AC, &2122 This set contains the characters between 128 and 159
that have different values in Unicode. The set is only available in versions 2.0.6
and later of the Builder.
{ANSI Printable} &20 .. &7E, &A0 .. &FF This set contains all printable
characters available in ANSI. Essentially, this is a union of {Printable},
{Printable Extended} and {ANSI Mapped}. The set is only available in versions
2.0.6 and later of the Builder.
{Control Codes} &00 .. &1F, &7F .. &9F This set includes the characters from 1
to 31 and from 127 to 159. It is only available in versions 2.6 and later of the
Builder.
{Euro Sign} &20AC The Euro Currency Sign.
{Formatting} &200B .. &200F, &202A .. &202E, &2060 .. &2064, &206A .. &206F,
&FEFF, &FFF9 .. &FFFB Unicode formatting characters.
Character Mapping
The following is the mapping between the Windows-1252 characters between 128 and
159 and Unicode values. If the Character Mapping Property is set to Windows-1252,
the system will populate the Windows-1252 values.
Shift-Reduce Conflict
Overview
The Shift-Reduce Conflict is the most common type of conflict found in grammars. It
is caused when the grammar allows a rule to be reduced for particular token, but,
at the same time, allowing another rule to be shifted for that same token. As a
result, the grammar is ambiguous since a program can be interpreted more than one
way. This error is often caused by recursive grammar definitions where the system
cannot determine when one rule is complete and another is just started.
When a token is read by the LALR algorithm, it looks up token in the current state
and then performs the associated action. If an entry exists with a "shift" action,
the system will push the token on an internal stack and jump to the specified
state. If a "reduce" action is found, the associated rule is reduced and passed to
the developer. If the token is not found in the state, a syntax error occurs.
Naturally, there can only be one action for any given state. For example, if a
programming language contains a terminal for the reserved word "while", only one
entry for "while" can exist in the state. A shift-reduce action, therefore, is
caused when the system does not know if to "shift" or "reduce" for a given token.
Solutions
Versions 2.4 and later of the GOLD Parser Builder automatically fixes Shift-Reduce
Conflicts by not "reducing" when it would cause a conflict. This is the same
behavior of the YACC compiler-compiler. It is best to closely check these states to
determine if the "shift" is the action wanted.
The error is commonly caused by ambiguous grammars. This may not be the fault of
the language being defined. Modifying the grammar can ultimately resolve the
conflict. For an example of how to resolve Shift-Reduce conflicts, please see the
If Statement Example Page.
Reduce-Reduce Conflict
Overview
A Reduce-Reduce error is a caused when a grammar allows two or more different rules
to be reduced at the same time, for the same token. When this happens, the grammar
becomes ambiguous since a program can be interpreted more than one way. This error
can be caused when the same rule is reached by more than one path.
When a token is read by the LALR algorithm, it looks up token in the current state
and then performs the associated action. If an entry exists with a "shift" action,
the system will push the token on an internal stack and jump to the specified
state. If a "reduce" action is found, the associated rule is reduced and passed to
the developer. If the token is not found in the state, a syntax error occurs.
Naturally, there can only be one action for any given state. For example, if a
programming language contains a terminal for the reserved word "while", only one
entry for "while" can exist in the state. A shift-reduce action, therefore, is
caused when the system does not know if to "shift" or "reduce" for a given token.
Glossary
Backus-Naur Form Backus-Naur Form is a notation used to describe the syntax of
programming languages. In particular is it used to define productions.
Configuration In parsing terms, a configuration is a production in the process
of being completed. Configurations play a major role in the construction of LR
parse tables.
Deterministic Finite
Automata A Deterministic Finite Automaton is often used to analyze a series of
characters. Often it is implemented using state driven transition graph. Please
see theinline-arrow-r.gif (99 bytes)Deterministic Finite Automata page for more
information.
Grammar Please see theinline-arrow-r.gif (99 bytes)Grammar and Backus-Naur Form
page
LALR Parsing LALR Parsing, or "Lookahead LR parsing", is a variant of LR
Parsing that combines different "configurations" to limit the size of the parse
tables. As a result, the algorithm is slightly less powerful than the LR Parsing.
Grammars that can be parsed by a LR parser, might be found to be "ambiguous" by
LALR. However, this is very rarely the case and real-world examples are few.
The number of states eliminated by LALR are sometimes huge. The C programming
language, for instance, has over 10,000 LR states. LALR drops this number to
slightly more than 200.
LR Parsing
LR Parsing, or Left-to-right Right-derivative parsing, uses tables to determine
when a rule is complete and when additional tokens are needed to be read from the
source.The LR Parser does very little "thinking" at runtime. All decisions are
based on the content of the parse tables. The construction of these tables where
all the "thinking" takes place. LR parser generators, such as YACC and GOLD,
construct these tables by analyzing the grammar and determining all the possible
"states" the system can be in when parsing.
Each state represents a point in the parse process where a number of tokens have
been read from the source and rules are in different states of completion. Each
production in a state of completion is called a "configuration" and each state is
really a configuration set.
When a LALR parser generator is analyzing a grammar and constructing the parse
tables, these conflicts are located immediately.
a & b
Even though the syntax is the same between the two languages, the semantics are
quite different. In C++, this expression is interpreted as a binary-and of "a" and
"b". In Visual Basic, however, this expression returns the concatenation of two
strings.
Symbol In parsing terms, a symbol is the building block of a grammar and can
be either a terminal or nonterminal. Essentially, the term "symbol" is used to
refer to either of its two forms.
Syntax The term "syntax" refers to the structure of a programming language, in
particular, the different series of symbols and words that make up the basic parts
of the language. This is quite different from the programming language's semantics
- the actual meaning of those different parts.
Shift A "shift" is an action performed by a parsing engine when it reads a token
that is valid in the current state. Lookahead parsers maintain a list of terminals
which are expected to be read from each state, given that the syntax of the program
is correct. If the token is not the expected, a syntax error occurs.
In the bottom-up parser engines, such as GOLD and YACC, a shift pushes the token
onto an internal stack that is used to hold tokens that are being used to construct
completed productions.
Shift-Reduce
Conflict The Shift-Reduce Conflict is the most common type of conflict found in
grammars. It is caused when the grammar allows a production to be reduced for
particular token, but, at the same time, allowing another rule to be shifted for
that same token.
As a result, the grammar is ambiguous since a program can be interpreted more than
one way.
This error is often caused by recursive grammar definitions where the system cannot
determine when one rule is complete and another is just started. The Builder
documentation contains an inline-arrow-r.gif (99 bytes)example of the common if-
then-else grammar problem and how to fit it.
To accomplish this, the grammar must be able to recognize the Newline as a terminal
rather than simply considering it whitespace. The characters used to represent a
Newline differ slightly between computer platforms. The Windows operating system
uses the combination of a Carriage Return followed by a Line Feed; UNIX, on the
other hand, merely uses the Carriage Return. The definition of a Newline terminal
must take this into account.
In the grammars below, a Newline terminal is declared with the two possible
permutations of the Carriage Return and Line Feed. It may also be advisable to make
a solitary Line Feed recognized as a Newline terminal (for fault tolerance).
The Whitespace Terminal must also be declared such that it does not accept the
Newline as whitespace. Below, Whitespace is declared as a series of the normal
whitespace characters without the Carriage Return and Line Feed.
Whitespace = {WS}+
NewLine = {CR}{LF}|{CR}
A better solution is to use a rule that accepts NewLines rather than using the
NewLine terminal at the end of each statement.
The following solution replaces each NewLine with a new rule called <nl> - for
NewLines. The <nl> rule is designed to accept one or more NewLine tokens. This
solution makes it far easier to write complex line-based grammars. Each line is now
logically followed by one or more NewLines rather than just a one. The rule that
accepted a blank line as a statement is no longer needed.
However, since NewLine characters are only acceptable following a statement, any
blank lines before the start of the program must be removed. In the grammar below,
the <nl opt> rule removes any NewLines before the start of the first actual line.
Whitespace = {WS}+
NewLine = {CR}{LF}|{CR}
Identifier = {Letter}{Alphanumeric}*
<List> ::= <List> ',' <List Item>
| <List Item>
Statement List
The following declaration defines a list of "Statements" where each <Statement>
will be followed by a semicolon. Unlike the example above, the last item in the
list will be followed by the symbol. However, since the <List> rule is nullable
(the blank line), the list can be completely empty.
Identifier = {Letter}{Alphanumeric}*
<List> ::= <List> <Statement> ';'
|
Identifier = {Letter}{Alphanumeric}*
<List> ::= <List> <Statement> ';'
| <Statement> ';'
The following example declares the string terminal commonly used in programming
languages. It is declared as a series of zero or more printable characters (not
including the double-quotes used for delimiters).
The following contains a much more complex definition for strings, but implements
the second approach mentioned above. The backslash character '\' now acts as an
override and can be used to represent the double-quote character. Essentially, the
string terminal is now a series of any printable character (not including the
double-quote and backslash) and any printable character preceded by the backslash.