How Computer Compilers actually work  (http://www.patmoss.com/pcref Item 1.)

The famous linguist Noam Chomsky is credited with discovering the underlying
theory used to implement modern computer compilers. Professor Chomsky defines
four classes of "Phrase Structured Grammars" as follows: 

Type 0: Unrestricted grammar. Anything can be rewritten as anything. 
Type 1: Context sensitive grammar (CSG).
        Each rewriting must be simpler (and shorter) than the previous one. 
Type 2: Context free grammar (CFG).
        Same as Type 1, plus each statement must be complete (ie., it must
        contain no context references). 
Type 3: Regular grammar, aka Finite State Automaton (FSA).
        Common examples of FSAs are: a hand calculator, an LCD wrist watch,
        the coin changer in a vending machine, TV/VCR remote controls, and
        the direct dial telephone system. 

Because of limited memory capacity, modern computers can only afford to
implement Type 2 context free grammars (CFG). But users will not accept
CFG and insist on much more flexible context sensitive grammars (CSG).
For example, in a CFG, every time we refer to a variable "x", we must
define its data type, and its "complete life history". In a CSG, we can
define "x" in one place, and then use it in numerous other places in the
computer software program. 

Note: "Overloading" of names in the C++ language adds the capability for
      even more context sensitivity. 

A compiler implementation consists of a Scanner and a Parser. The Scanner
examines the source program and converts it into a series of "tokens". The
Parser reads the token stream, and creates a "parse tree". The parse tree
represents the complete "meaning" of the source program. Then the compiler
converts the parse tree into object code for a "target machine". 

The parser reads each statement, and rewrites it into a simpler form. Each
rewriting must be simpler (and shorter) than the previous one. This process
continues until the parser "understands" the complete meaning of a statement.
The simplest form of parser uses the "recursive descent" model. Most modern
parsers use much more complex parsing strategies, such as LALR(1). A popular
scanner generator is "Lex", and a popular parser generator is "Yacc" (Yet
Another Compiler Compiler). Both are available with the Unix operating system. 

Every modern compiler (parser) is implemented as a Type 2 context free grammar
(CFG) that appears to the user to behave as a much more sophisticated Type 1
context sensitive grammar (CSG). 

Professor Chomsky's famous discovery is that a compiler can simulate a Type 1
CSG with a Type 2 CFG plus a "symbol table"! Thus, when we declare the data
type for a variable "x", an entry is created in a local symbol table for the
current block. Then whenever we refer to "x" later on in this block, the
compiler looks in the symbol table to determine what "x" is, and also where
it is stored in memory. 

The use of local/global symbol tables leads directly to the concepts of "scope",
"visibility", and "longevity" of a variable. When we refer to a variable name,
the parser looks first in the current local symbol table. If the variable name
is not found there, the parser looks in the next higher level symbol table.
This process continues until either the variable name is found, or the parser
gives up and issues an "undefined symbol" complaint on the compilation report.
Note: The global symbol table is searched last, after all nested local symbol
tables have been searched. This process can be circumvented by using the "::"
global resolution operator in C++. Then the parser goes directly to the global
symbol table to find the variable name.  That's all there is to it. Now you
know some of the secrets of the Computer Compiler "gurus"!

   Note: If you find errors/omissions, or have suggestions to improve this
         documentation please send an email to patmoss@patmoss.com. Thanks.

Return to Top Return to Index Return to Home Page