-
Notifications
You must be signed in to change notification settings - Fork 6
Rule Versioning
Rule versioning refers to a feature in ANTLR 4 where Java code may declare explicit dependencies on grammar rules. Each declared dependency is annotated with a version number; when a version number does not match the actual version of a rule in the grammar then the Java code depending on the rule may not work. The feature is currently implemented in pull request #490: Rule versioning artifact and in the unofficial sharwell/optimized fork of ANTLR 4.
Note by Sam Harwell: Rule versioning has proven to be an essential productivity aid during the development of ANTLRWorks 2 and GoWorks. I highly recommend all projects considering the use of ANTLR 4 take full advantage of this feature from the start of the project to reduce the risk of regression bugs and improve long-term maintainability of a code base.
Large applications using complicated grammars, such as compilers or IDEs, frequently face productivity bottlenecks when changes to the core grammars are required.
- Changes to the language specification must be implemented
- An opportunity to alter the grammar for performance or clarity purposes is identified
- Evaluation of proposed changes to the language within a fully-functional environment
When a rule in the grammar is altered, all code affected by the change must be updated to reflect the
new form. In some applications, the Find Usages feature may be used to identify code either calling
the rule method or using the return value of the method (rule foo
could returns an instance of
foo_return
in ANTLR 3 or FooContext
in ANTLR 4). However, this is not sufficient for identifying
all dependent code. Consider the following code block from the code completion feature in
nFringe (using the CSharp3 target for ANTLR 3).
if (tree.ChildCount == 2) {
CommonTree decltree = tree.GetChild(1) as CommonTree;
if (decltree != null) {
CommonToken token = decltree.Token as CommonToken;
if (token != null && !string.IsNullOrEmpty(token.Text))
return declarations.Where(decl => decl.InsertText.Equals(token.Text, StringComparison.OrdinalIgnoreCase));
}
}
In this code, tree
is a CommonTree
created for a token of type DEFAULT
. The code handles
the tree produced by the following rule, which handles UnrealScript expressions like
ClassName.default.Property
.
postfix_expression
: ( root_expression
-> root_expression
)
( DOT DEFAULT DOT primary_expression
-> ^(DEFAULT $postfix_expression primary_expression)
)*
;
The nFringe code does not use a variable of type postfix_expression_return
, yet makes several
assumptions about the tree produced by this rule.
- The tree contains exactly two children.
- The
Token
property of the second child contains the name of the referenced default property, i.e."Property"
for the above UnrealScript expression.
If the grammar were modified to include the DOT
tokens in the AST (perhaps for assistance in
some form of error reporting), the code completion feature would stop working for these expressions
until the code was updated. Over many years of work on nFringe, these implicit dependencies have
resulted in many problems.
- Uses of properties (e.g.
ChildCount
) and methods (e.g.GetChild
) from the ANTLR 3 trees API:- Appear frequently,
- Are difficult to trace back to particular locations in the grammar,
- Successfully compile whether or not the implied dependencies are valid.
- Broken features rarely result in build failures, and often appear only in infrequent "edge-cases".
- Tracing a broken feature back to the original source is difficult.
- Fixing code affecting a broken feature often has other implicit dependencies, and any alteration often results in a different edge-case breaking.
The approach to rule versioning chosen for this implementation involves the following changes to
ANTLR. The features are disabled by default, and included by specifying the -rule-versioning
command line argument to the ANTLR Tool when generating parsers from a grammar.
- Each generated rule method is annotated
@RuleVersion(number)
. The version number defaults to 0 when not explicitly stated in the grammar. - The
@RuleDependency
annotation can be added to types, methods, and/or fields in Java code to declare dependencies. The@RuleDependencies
annotation provides a way to include multiple dependency annotations on a single code element. - The
RuleDependencyProcessor
annotation processor validates declared rule dependencies at compile time. The processor registered in the classpath by adding a file to theMETA-INF/services
folder inantlr4-dependency-analysis.jar
, allowing Javac to automatically enable the validation process during compilation.
In its simplest form, declared dependencies specify a rule and a version number.
Note: Due to an implementation limitation in
RuleDependencyProcessor
, dependencies must currently specify arecognizer
as well. If the related StackOverflow question produces a usable answer, this requirement may be removed in a future release. For the purposes of this article, therecognizer
is always omitted from theRuleDependency
annotations for brevity, since it is always just the declaring type of therule
constant.
fieldDef : typeRef id;
primaryExpr : id | INT;
typeRef : id;
id : IDENTIFIER;
One Java method depending on the id
rule might look like the following.
@RuleDependencies({
@RuleDependency(rule=RULE_id, version=0),
@RuleDependency(rule=RULE_fieldDef, version=0),
})
boolean isDefinition(IdContext id) {
return id.getParent() instanceof FieldDefContext;
}
The isDefinition
method makes the following assumptions about the parse tree produced by the
grammar rules.
- The
id
rule in the grammar is part of a definition if the parent context is a certain type of context. If a new rule were added to the grammar in the future, e.g. amethodDef
rule, the change to the parents of theid
rule could result inisDefinition
returning an incorrect result. - The
fieldDef
rule in the grammar invokes theid
rule. If thefieldDef
rule is modified in the future to not directly call theid
rule, theisDefinition
rule may produce an incorrect result.
While reviewing the XPath Axes, it became apparent that many
complicated rule dependencies could easily be expressed by providing support for specifying
dependency type(s) in the @RuleDependency
annotation. The most frequently encountered dependency
types can be expressed as follows.
- Self: the code depends on the specified rule.
- Parents: the code depends on the parents of the specified rule. This can also be seen as code depending on the rule in one level of context.
- Descendants: the code depends on all potential descendants of the specified rule.
- Ancestors: the code depends on all potential ancestors of the specified rule. This can also be seen as code depending on the rule with its complete context.
The default dependency type is Parents, matching the most frequent case encountered in code
using the initial implementation. Note that the Self dependency type is always implied, whether
or not the particular @RuleDependency
annotation specifies a set of dependency types. The
Dependents.SELF
enumeration value is provided so @RuleDependency
annotations which clearly
depend only on the rule itself can be documented as such. Consider the following grammar rule.
id : Identifier ('.' Identifier)*;
The following code shows an example of proper use of Dependents.SELF
.
@RuleDependency(rule=RULE_id, dependents=Dependents.SELF)
boolean isQualified(IdContext ctx) {
return ctx.Identifier().size() > 1;
}
All dependents other than Self are considered linked dependencies. Proper validation of linked dependencies requires rule version numbers to satisfy particular conditions. The following rules for managing version numbers in a grammar ensure that linked dependencies will properly validate at compile time.
-
When a rule is added to the grammar, it is given a version number 1 greater than the maximum version number of any rule previously in the grammar.
-
When a non-trivial change is made to an existing rule, it is assigned a new version number 1 greater than the maximum version number of any rule previously in the grammar. If the changed rule no longer invokes a rule which was previously invoked by the rule, the update should be performed in two steps to ensure all linked dependencies are validated. For example,
qualifiedName : id ('.' id)*;
First becomes the following. Note that the guard clause only includes
id
once, even ifid
can appear multiple times in the previous version of the rule.qualifiedName @version{1} : Identifier ('.' Identifier)* | // guard clause for dependencies linked to qualifiedName through id. {false}? id ;
After examining the affected rule dependencies, the guard clause can be removed. If the two steps are performed in succession, the rule version number does not need to be updated a second time. In addition, removing the guard clause may invalidate some rule dependency annotations in code, but if these two steps are performed in succession then the affected dependency annotations can be updated to match the versions reported by the annotation processor without rechecking the code again. After this step, the
qualifiedName
rule would be reduced to the following.qualifiedName @version{1} : Identifier ('.' Identifier)* ;
-
Before removing a rule, the name of the rule should be changed and the renamed rule should be given a version number 1 greater than the maximum version number of any rule previously in the grammar (e.g.
foo
could be renamed tofoo_removed
before deleting it). This two-step process provides validation for certain linked dependencies which may not be detected when a rule is simply deleted.
-
Methods taking a context as input (e.g. as a method argument) should include
PARENTS
at a minimum unless there is absolutely no way the behavior can be affected by later reusing the rule in another context. When no dependents are explicitly specified for a@RuleDependency
annotation,PARENTS
is included by default. -
Code which enumerates the ancestors of a rule context (or calls a
findAncestor
method, etc.) should includeANCESTORS
in the dependents for that rule dependency. -
Code which enumerates the descendents of a rule context (or calls a
findDescendant
method, etc.) should includeDESCENDANTS
in the dependents for that rule dependency. -
Code which accesses tokens indirectly matched by a rule (e.g. by calling the
getStart()
on the context for a grammar rule which contains at least one outer alternative with a non-terminal on the left edge) should includeDESCENDANTS
. -
Any rule context accessed by the method directly (e.g. via a generated accessor method) or indirectly (e.g. via the
getChild(n)
method) should declareSELF
at a minimum, but the default ofPARENTS
is generally a safe bet.
The rule versioning feature assumes that all code which directly depends on the grammar properly declares explicit rule dependencies. Any implicit dependencies which are not declared may result in program failures if the grammar is modified.
Rule version numbers should always monotonically increase across the entire grammar. For example,
suppose a change was made to the bar
rule of the following grammar.
foo
@version{2}
: 'x';
bar
@version{1}
: 'y';
After the change, the bar
rule should be marked at version 3, which is greater than the highest
version number previously appearing in the grammar. If the bar
rule was instead only incremented
to version 2 (greater than the previous version of the bar
rule, but not greater than the
highest version appearing in the grammar), linked dependencies in code may not be properly invalidated.
The initial implementation of rule versioning provided a minimal set of features. While the implementation was simpler and the API smaller, it soon became clear that this approach did not meet the needs of large applications. In this implementation, each rule was given a version number and a dependency consisted of a link from a code element to a particular version of a single rule.
This approach required particular attention when changes were made to the grammar. For example, consider the following grammar segment (repeated from above).
fieldDef : typeRef id;
primaryExpr : id | INT;
typeRef : id;
id : IDENTIFIER;
One Java method depending on the id
rule might look like the following.
@RuleDependencies({
@RuleDependency(rule=RULE_id, version=0),
@RuleDependency(rule=RULE_fieldDef, version=0),
})
boolean isDefinition(IdContext id) {
return id.getParent() instanceof FieldDefContext;
}
Now suppose the grammar is modified by the addition of a methodDef
rule.
methodDef : typeRef id '(' ')';
fieldDef : typeRef id;
primaryExpr : id | INT;
typeRef : id;
id : IDENTIFIER;
In this case, no change was made to the fieldDef
or id
rules in the grammar, yet the
isDefinition
method will no longer return the correct result for identifiers within a
methodDef
. The following rule-of-thumb became apparent.
In general, code which depends on a grammar rule tends to make assumptions about the immediate context of the rule.
The following practices provided a manual solution to this problem.
- When a rule is added to the grammar,
- Leave the new rule at the default version 0
- Increment the version numbers of each rule invoked by the new rule, e.g. adding
methodDef
above requires incrementing the version number of thetypeRef
andid
rules.
- When a rule is removed from the grammar, increment the version numbers of each rule
invoked by the rule being removed, e.g. removing
typeRef
above would require incrementing the version number of theid
rule. - When a rule is changed,
- Increment the version number of the changed rule
- Increment the version numbers of each rule which is invoked by the updated rule but was not previously invoked by that rule
- Increment the version numbers of each rule which was previously invoked by the rule but is no longer invoked by the updated rule
- Optionally increment the version numbers of rules previously invoked by the rule and still invoked by the updated rule (doing so appeared safer yet led to more false-positives during the compilation process)
When the dependency validator reported compilation errors due to a change in a rule version,
the code marked with the @RuleDependency
annotation should be reviewed and updated when
necessary to behave correctly with the updated grammar.
According to these rules, the updated grammar after adding the methodDef
rule might look
like the following.
methodDef : typeRef id '(' ')';
fieldDef : typeRef id;
primaryExpr : id | INT;
typeRef @version{1} : id;
id @version{1} : IDENTIFIER;
The dependency validator would then report an error on the isDefinition
method, which
could be updated as follows.
@RuleDependencies({
@RuleDependency(rule=RULE_id, version=1), // this version was incremented
@RuleDependency(rule=RULE_fieldDef, version=0),
@RuleDependency(rule=RULE_methodDef, version=0), // this dependency was added
})
boolean isDefinition(IdContext id) {
// the expression now handles both definition kinds
return id.getParent() instanceof FieldDefContext
|| id.getParent() instanceof MethodDefContext;
}
At this point, most code dependencies could be declared in a straightforward manner, although
some implicit dependencies still could not be expressed. For example, statements like the following
implicitly depend on a chain of rule invocations. In particular, this code depends on the
ancestors of the some
rule in the grammar since a change to any rule between some
and other
could result in some
no longer appearing in the context of other
.
void checkContext(SomeContext ctx) {
assert ctx.getAncestor(OtherContext.class) != null;
}
Code which directly references the tokens matched by a rule implicitly depends on a chain of rule
invocations in the other direction. In particular, the following code depends on the descendants
of the some
rule in the grammar since direct access to the start and stop tokens makes assumptions
about any non-terminals invoked between some
and the terminals matched while parsing the rule.
void checkContext(SomeContext ctx) {
assert ctx.getStart() == ctx.getStop();
}