CPSC411: COMPILER CONSTRUCTION IAuthor: Robin Cockett Date: Jan 28th 2002 (modified 2014)
Lecture: 7 & 8Title: Transforming grammars
Reading:
We may see the problem as follows: we are given a grammar G and we wish to transform it to a grammar G' while providing a translation T: G' -> G such that
INPUTcommutes, in the sense that parsing an input for G' and then translating the parse tree into one for G is the same as parsing straight into G. In fact, futhermore, it is desirable for the translation T to be given by a linear time "fold" (an attribute grammar). As we shall see we can do a number of useful transformation steps in the direction of obtaining an LL(1) grammar. These steps almost amount to a procedure for generating an equivalent LL(1) grammar (if one exists) -- the one aspect we shall not discuss is how to eliminate the follow set clashes.
/ \
parse / \ parse
/ \
/ \
\/ \/
G' -----------> G
T
The basic methods for transforming grammars which we shall look at are:
Eliminating left recursion:
x << y if and if x -> w y w' where w and w' are nullable.The grammar has no cycles in case the transitive closure of this relation is anti-reflexive (i.e. it has no cycles).
Question:  Can you explain why this works? 
    
Testing for null ambiguity:
    
A grammar is null unambiguous if
and
      only if each non-terminal has at most one nullable production.  It is as easy as this!
      If a grammar is not null unambiguous it must be ambiguous so it is
      not of great interest in parsing applications.  In
      particular, this means there is no chance it can be turned into an
      LL(1) grammar.   It is possible, however, to make a grammar which
      recognizes the same language and is null unambiguous (this is
      essentially done using \epsilon separation see below) 
      ...  the equivalence at the level of parse trees, however,
      will be lost.
    
      Removing left-recursion requires a number of steps which we now
      describe in some detail.  The algorithm starts by isolating
      where the "problem areas" are in the grammar.   This is
      done by partially ordering the nonterminals of the grammar and
      classifying the productions as either being "good" or "bad": the
      aim is then to focus on the bad productions and transform them
      into good productions. 
A feature of this algorithm is that it mimics how one might
      approach the problem when you do it by hand.  If you were
      doing it by hand you would inspect the grammar and choose an
      ordering on the nodes which causes the minimal amount of damage to
      the grammar.  The algorithm does this by developing the
      partial order on the nodes as it is needed. 
        
A production r, where pr(r) = x -> \alpha, is good in case \alpha satisfies any of the following:
      A non-terminal x is good if all its associated
      productions  x -> \alpha_1 | ... | \alpha_n are
      good.   Once a non-terminal is good, in this sense, you
      need not change its productions agian while transforming to remove
      left recursion ...
    
OBSERVATION:
A grammar has no left-recursion if and only if each nonterminal (or equivalently if every production) is good with respect to some partial order on the nonterminals of the grammar.
      Proof. 
Suppose we have a derivation a =*=> a w then there must be be a series of one step rules
a => w_1' a_1 w_1where each w_i' is nullable. But when all rules are good this means a < a_1 < ... < a_n so that a < a which cannot be.
a_1 => w_2' a_2 w_2
....
a_n = a w_n
Thus, when all nonterminals (or productions) are good this implies that the grammar is not left-recursive.The idea is then to transform the rules so that they all become good. Clearly this means that we need to modify the rules which fail to be good!Conversely, suppose we have a non left-recursive grammar then there is a relation induced on the non-terminals by demanding that each production be good. So we will say x < y in case there is a production x -> w' y w where w' is nullable. Saying that the transitive reflexive closure of this is an anti-symmetric relation is an equivalent way of defining what it means to say that the grammar is not left-recursive!
Most importantly one can create a partial order by specifying
      which elements are "better" than which others, x < y,
      (where  x<y if and only if x =< y and x =/= y) and then
      taking the transitive reflexive closure.   If you do
      this, however, you must be careful to check the relation is still
      anti-symmetric (i.e. that there are no cycles). 
    
x -> x \alpha_1 | .. | x \alpha_r | \beta_1 | ... | \beta_nwhere each x -> \beta_i is good (note: each \alpha_i must be non-null as the grammar can have no cycles).
OBSERVATION:
A reachable realizable grammar G with a total order on nonterminals either satisfies one of the following:
- all productions good;
 - there is an improvable production;
 - there is an almost good non-terminal (that is an immediate left recursion);
 - there is a hidden immediate left recursion.
 
      Proof. 
Consider the least nonterminal: the only way one of its rules is not good or is not improvable is if there is a production x -> w' x w where w' is nullable. Which means it has a hidden left recursion.Suppose now that x is the smallest nonterminal which is not good: that is all y with y<x have all their productions good then any production which is not good fails to be good because it is of the form
x -> w' y w
where w' is nullable and y =< x. If y < x then y must be good, thus, we would have found an improvable production. Suppose this does not happen then y = x, but this makes the production squeezable or an immediate left recursion!NOTE: A grammar with no null-productions will never require squeezing! However, if only squeezing is left, then for every y in w' , in a production x -> w' x w with a hidden left recursion on x, we must have x <y as otherwise we would have a source of improvement. As we shall see shortly this can happen.
| expr -> expr ADD
              term                            
 
              --P_1  | expr SUB term --P_2 | term --P_3 term -> term MUL factor --P_4 | term DIV factor --P_5 | factor --P_6 factor -> LPAR expr RPAR --P_7 | NUM --P_8  | 
          
Whatever partial order we put on the nonterminals notice that P_7 and P_8 will be good so that factor is good. To make expr almost good we must suppose expr < term. Finally, to make term almost good we must assume that term < factor.
Note how we choose the partial order to suit our grammar .... and, indeed, the strategy always should be to choose a partial order which makes as much as possible good!
Example 2: Consider the
      grammar: 
      
| s ->
              A                                                         
--R_1
               | a --R_2 a -> B --R_3 | b --R_4 b -> s --R_5 | --R_6  | 
          
Here we can make s < a, a < b but now we have to leave R_5 as an improvable production. In fact, whichever way we choose the partial order, we will always have to leave one production as improvable (because there is a left recursion).
Example 3: Consider the
      grammar: 
      
| s -> s s s
              B                                                  
--
              S_1  | -- S_2  | 
          
Here the start symbol s is almost good.
Example 5: Consider the
      grammars: 
      
| s -> s s s  | Grammar (a)  | 
            s -> s   | Grammar (b)  | 
            s -> s B  Grammar (c)  | 
            s -> a b C  a -> A | b -> a b C | B Grammar (d)  | 
          
Here in (a) the start symbol is almost good while in (b) it is not and in (c) it is almost good. Exercise: what is wrong with the first three grammars? Work through (d) as an exercise!
x -> \gamma y \alpha' y -> \beta_1 | ... | \beta_n(B) Immediate left-recursion removal for almost good nonterminal
------------------------------------------------------------------------ (y good, \gamma nullable)
x -> \gamma \beta_1 \alpha' | ... | \gamma \beta_n \alpha'
x -> x \alpha_1'| ..... | x \alpha_n| \beta_1 | ..... | \beta_m
------------------------------------------------ ----------------------- (x -> \beta_j good)
x -> \beta_1 x' | ... | \beta_m x'
x' -> \alpha_1 x' | .. | \alpha_n x' | \epsilon
and we add the non-terminal x' to the partial order with x < x' whenever any of \beta_j are nullable. The \beta_1,...,\beta_m are called the base cases ... if there are no base cases then there will be no productions x -> \alpha_i x' and so x' will not be reachable (and tacitly we are assuming that all actions are reachable ...) so that the whole production set can be removed.
We apply (A) in preference to
      (B), so that we start by improving all our rules (enlarging
      the preorder to make as much good or as possible) until there is
      no choice but to transform an almost good nonterminal.
    
Only if these transformations (and
      enlargement of the partial order) do not make all the productions
      good do we resort to squeezing a rule with a hidden left
      recursion:
    
(C) Squeezing productions with
        hidden immediate left-recursions:
      
 Let us trace through what these transformations do to an example
      which requires squeezing but not null \epsilon-separation before
      we return to discuss this 
      transformation in more detail:
    
a -> b
      C                
  
good           
a
      < b
           | c
      D.               
  
good           
a
      < c
      b -> e a
      E              
 
bad             
b
      < e 
           | c
      B.                  
good        
         b < c
      c ->
      A.                    
      good
      e -> F
      e                   
      good
          |
      .                        
      good
    
By trying to partially order the non-terminals to make as much
      good as possible we may come up with the partial ordering above
      (note that e is nullable).  There is just one rule which
      causes a problem.  However, note that it is improvable as a
      is a good non-terminal.   So we improve the bad
      production:
    
a -> b
      C                
  
good           
a
      < b
           | c
      D.               
  
      good           
      a < c
      b -> e b C
      E            
bad           
  
      b < e   hidden left recursion
           | e c D
      E             
good           
b
      < c
           | c
      B.                  
good        
        
      c ->
      A.                    
      good
      e -> F
      e                   
      good
          |
      .                        
      good
    
This improvement now exposes a hidden left recursion.  
      We must squeeze the non-terminal e in the bad rule.   As
      e is a good non-terminal we may expand it:
    
a -> b
      C                
  
good           
a
      < b
           | c
      D.               
  
      good           
      a < c
      b -> F e b C
      E          good
            
           | b C E          
            bad            
      immediate left recursion    
           | e c D
      E             
good           
b
      < e, b < c
           | c
      B.                  
good        
        
      c ->
      A.                    
      good
      e -> F
      e                   
      good
          |
      .                        
      good
    
Now we can remove the immediate left recursion.
    
a -> b
      C                
  
good           
a
      < b
           | c
      D.               
  
      good           
      a < c
      b -> F e b C E  b'     good
            
           | e c D E  b'
             
      good           
      b< e, b < c
           | c B
      b'.              
good      
      
      b'  -> C E  b'        
         good   
           |.            
                  good   
      c ->
      A.                    
      good
      e -> F
      e                   
      good
          |
      .                        
      good
    
As all the rules are now good we have removed the left recursion.
    
| expr -> expr ADD
            term                             
--P_1
             | expr SUB term --P_2 | term --P_3 term -> term MUL factor --P_4 | term DIV factor --P_5 | factor --P_6 factor -> LPAR expr RPAR --P_7 | NUM --P_8  | 
        
| expr -> term
            expr'                                      
--P_1
             expr' -> ADD term expr' --P_1' | SUB term expr' --P_2' | --P_3' term -> factor term' --P_4 term' -> MUL factor term' --P_4' | DIV factor term' --P_5' | --P_6' factor -> LPAR expr RPAR --P_7 | NUM --P_8  | 
        
There is no need to add any relations to the order in this case
      as now all the productions are good ... so we have removed left
      recursion.
    
Example 3: Consider the
      grammar: 
      
| s ->
              A                                                         
--R_1
               | a --R_2 a -> B --R_3 | b --R_4 b -> s --R_5 | --R_6  | 
          
This gives a good example of the role of the partial order. Remember we can choose the partial order and we shall try to do so in such a manner as to make as many of the productions as possible good (or almost good) after all we do wish to have all the productions good eventually.
Now rather than saying outright what the partial order is we actually develop it as we go. Thus, we may reason as follows: in order to make s -> a a good production we shall make s < a ; this also has the desirable effect of ensuring that s is a good non-terminal. Excellent! Next to ensure that a -> b is a good production we shall make a < b; this will also ensure that a is a good non-terminal. Marvelous!
Now it is perhaps tempting to think we can always continue in this vein making productions good by adding to the ordering. However, to see what goes WRONG consider the attempt to now make b -> s good. To do this have to demand that b < s ... however here we run into a problem for s is already less than b! So we cannot make b < s. This is where the partial order begins to play an important role for it is stopping us from having a cycle s -> a -> b -> s -> .... etc.
So now realizing that s < b, we conclude that b -> s is an improvable production. So we may improve it by expanding to the productions b -> a and b -> A. But now, while b -> A is a good, we notice that b -> a has the same problem as before, namely a is already less than b . Thus b -> a is improvable. Expanding this production gives b -> b and b -> B . This now is VERY bad news! The production b -> b is an immediate cycle and its presence after transformations means that our original grammar had cycles!! This means we can never turn this grammar into an LL(1) grammar let alone into one with no left-recursion!
Thus, our attempt to transform this grammar must be abandoned as hopeless. In retrospect, inspecting the grammar more carefully this really is no surprise as it is cyclic! But what is interesting is how this transformation procedure discovers this problem ... AND this is a little reminder that one should check for cycles before you begin ...
Example 4: Consider the
      grammar: 
      
| s -> s s s
              B                                                  
--
              S_1  | -- S_2  | 
          
Here the start symbol S  is almost good so we
      transform to: 
      
| s ->
              s'                                                          
--
              S'_1  s'-> s s B s' -- S'_2 | -- S'_3  | 
          
where we add s < s'. This makes s' -> s s B s'
      improvable to s' -> s' s B s' which makes s'
      almost good so we may expand to obtain: 
      
| s ->
              s'                                                          
--
              S''_3  s' -> s'' -- S''_3 s'' -> s B s' s'' -- S''_3 | -- S''_3  | 
          
where we add s' < s'' . This makes s'' -> s B
        s' s'' improvable in two steps to s'' -> s'' B s' s''
      whence making s'' almost good ... so we obtain: 
      
| s ->
              s'                                                            
--
              S'''_3  s' -> s'' -- S'''_3 s'' -> s''' -- S'''_3 s''' -> B s' s'' s''' -- S'''_3 | -- S'''_3  | 
          
where we add s'' < s'''. Now all our productions are
      all good so we have removed left recursion.
    
Exercise: Check that you can
      transform the grammar below to remove left recursion!
      
| stmts -> stmts stmt  | stmt -> VAR ASSIGN expr SEMICOLON expr -> expr addop term | term addop -> ADD | SUB term -> term mulop factor | factor mulop -> MUL | DIV factor -> LPAR expt RPAR | NUM  | 
          
| s -> a b C  a -> A | b -> a b C | B  | 
          
Suppose we deliberately choose an inappropriate total ordering b
        < a < s of the non-terminals. This leaves a
      good. Notice that b -> a b C is not good as a
      is nullable. The production s -> a b C is not good
      either, however, it can be improved by expanding a: 
      
| s -> A b C  | b C a -> A | b -> a b C | B  | 
          
This introduces a new bad production s -> b C.  However we can also improve
      the last rule and this gives:
    
| s -> A b C  | b C a -> A | b -> A b C | b C | B  | 
        
This makes the last rule almost good and we can remove the
      immediate left recursion:
    
| s -> A b C  | b C a -> A | b -> A b C b' | B b' b' -> C b' |  | 
        
This makes all the rules good except for the second rule which now can be improved:
| s -> A b C  | A b C b' C | B b' C a -> A | b -> A b C b' | B b' b' -> C b' |  | 
        
Now all rules are good so that we have removed the left
      recursion.  However, do notice that the cost of choosing an
      inappropriate order on the non-terminal was significant as the
      resulting grammar is larger and less perspicuous than the grammar
      which results from simply eliminating the obvious left recursion!
    
       Unfortunately there are grammars in which one has to squeeze
      a non-terminal,  x,
      which is not already good, in order to make progress.  
      In this case we are forced to resort to extreme measures. 
      The ideas is that, in this case, we shall replace the non-terminal
      by a non-nullable version, x*
      , and by nulling it.  This will cause the hidden recursion to
      be blocked in the first case and made more immediate in the second
      case.   But the question is how do we get the
      non-nullable version x* of x?  Our strategy, in an attempt to be
      minimalistic, is to let x* inherit all the non-nullable
      productions of x and an
      \epsilon-separated
      version of the (one) null production.   
    
First we explain \epsilon
      separation:
    
The translation is indicated by placing
T[ ]round the new parse tree. We have to show how each transformation gives a translation backward between the parse trees.
If we improve a -> b \alpha (production P) by
      expanding along b -> \beta (production Q) to
      get a new production b -> \beta \alpha 
      (production R ) then the synthesis translation is simply:
      
        
T[R(\alpha,\beta) ] = P(Q(\alpha),\beta)The translation for an expansion by an empty production is more complex (but still quite simple): if we improve a -> b \alpha (production P) by expanding along b -> \epsilon (production Q) we get a -> a' (production R) a' -> \alpha (production S). The translation is:
T[ R(x) ] = x (the identity)Now we turn to the translation required for the expansion which removes immediate left recursion. We have already discussed the idea when we looked at the translation from left associative form to right associative form. Here we are improving:
T[S(\alpha)] => P(Q \alpha)
a -> a \alpha_1 --P'_1to
| ...
| \beta_j --P_j
a -> \beta_1 a' --Q_1The translation is:
| ...
a' -> \alpha_1 a' --Q'_1
| ...
| --E
T[ Q_j(\beta_j,f)] = f(P_j(\beta_j))where here I have used lambda calculus style expressions.
T[Q'_j( \alpha_i,g) ] = \x -> g(P'_j(x, \alpha_i)))
T[E] = \x => x (the identity function)
When there are no cyclic derivations
x =+=> xthen we can sketch the argument that this process must terminate with a non-left-recursive grammar.
The key steps in proving this are to realize that
a -> \alpha \gamma_1 | ... | \alpha \gamma_r | \gamma_{r+1} | ... | \gamma_n
---------------------------------------------- --
a -> \alpha a' | \gamma_{r+1} | ... | \gamma_n
a' -> \gamma_1 | ... | \gamma_r
      This has the benefit of removing the obvious first set
      clashes.  However, there is no guarantee that it will remove
      all first set clashes.  Note that this transformation is a
      right non-terminal expansion in reverse! 
Note that an LL(1) grammar necessarily has no left recursion and is necessarily left-factored.
| expr -> term ADD expr   | term SUB expr | term term -> factor MUL term | factor DIV factor | factor factor -> LPAR expr RPAR | NUM  | 
        
Leftmost nonterminal  expansion
      
    
Given a nonterminal which occurs as the leftmost nonterminal in
      the body of a production:
    
x -> y \alpha
 we shall say that the occurrence has a first set clash
      in case: 
    
there is another production with the same head, x -> \gamma, such that first(y \alpha) is not disjoint from first(\gamma).
Notice that if there is a leftmost occurrence of a non-terminal
      with a first set clash then the grammar cannot be LL(1). Given
      such an occurrence we shall substitute the nonterminal to expose
      the first set clash:
    
x -> y \alpha              
            y -> \beta_1 | ... | \beta_n
      ---------------------------------------------------------------
             x -> \beta_1 \alpha | ... | \beta_n
      \alpha
    
Transforming to LL(1) 
    
In trying to turn a grammar into an LL(1) grammar you should try the following steps:
Health
                              warning: 
              
      Recall there is no guarantee this process will terminate.  In
      particular the expansion of nonterminals could simply blow up the
      grammar making the rules bigger and bigger