If the output of a phase cannot be determined without looking at the remainder of the phase’s input, the phase can generate output with slots that can be filled in later, after more of the input is read.
Consider an assembler might have statement GOTO L
which precedes a statement with label L.
- The two-pass assembler uses its first pass to enter into its symbol table a list of all identifiers together with the machine address.
- Then the second pass replaces mnemonic operation codes, such as GOTO, by their machine language equivalent and replaces uses of identifiers by their machine language equivalent and replaces uses of identifiers by their machine addresses.
- Append the machine address for this instruction to a list of instructions to be backpatched once the machine address for L is determined. L: ADD X
- It scans the list of statements referring to L and places the machine address for statement L: ADD X in the address field of each such instruction.
- The distance over which backpatching occurs must remain accessible until backpatching is complete.
- It is the interface between the source program and the compiler.
- It reads the source program one character at a time.
- Divides source program into a sequence of atomic units called tokens.
- Token represents the sequence of characters treated as a single logical entity.
- Identifiers, keywords, constants, operators, and punctuation symbols (comma and parenthesis) are typical tokens.
Example (Fortran statement)
IF (5 .EQ. MAX) GOTO 100
Tokens are : IF, (, 5, .EQ., MAX, ), GOTO, 100
Types of Tokens:
- Specific Strings: IF or semicolon
- Classes of Strings: Identifiers, constants, or labels
- Token consists of two parts token type and token value.
- Specific strings such as semicolons are treated as having a type but no value.
- Token such as identifier MAX has a type “identifier” and a value consisting of string MAX.
- The lexical analyzer and syntax analyzer are grouped together into the same pass.
- Lexical analyzer operates either under control of parser or as a co-routine with the parser.
- Parser asks the lexical analyzer for the next token whenever the parser needs one.
- Lexical analyzer returns a code, for the token that it found, to the parser.
- If the token is an identifier or another token with a value, the value is also passed to the parser.
- The method of providing this information by lexical analyzer is called bookkeeping routine.
- This routine puts the actual value in the symbol table if it is not already present.
- Lexical analyzer passes two components of token:
- Code for the token type (identifier)
- A pointer to the place in the symbol table
- The lexical analyzer examines successive characters in the source program, starting from the first character not yet grouped into a token.
- Lexical analyzer may require to search many characters beyond the next token in order to determine what the next token actually is.
Example (Fortran statement)
- Consider the situation that strings to the left of the bracket are already broken up into tokens.
- When the parser asks for the next token, the lexical analyzer reads all the characters between
- 5 and Q including. (dot) to determine that the next token is constant 5
- The reason to read up to Q is until it sees Q, it is not sure that it has seen complete constant.
- After determining that the next token is constant 5, the lexical analyzer repositions its input pointer at first. (dot).
- The lexical analyzer return token type “constant” to the parser and the value associated with this constant could be numerical value 5 or a pointer to the string 5.
- After processing the statement, the token stream looks like
if ( [const, 341] eq [id, 729] ) goto [label, 554]
The relevant entries of symbol tables are
341 Constant, integer, value = 5
554 Label , value = 100
729 Variable, integer, value = MAX
- It has two functions It checks that tokens appearing in its input are in proper pattern as per the specification of the source language.
- Also imposes on tokens a tree-like structure used by subsequent phases of compiler
A + / B
After lexical analysis, this expression appears for syntax analyzer as
id + / id
On seeing the /, the parser should detect an error since the presence of these two adjacent binary operators violate the formation rules
Parser makes the hierarchical structure of the incoming token stream by identifying which parts of the token stream should be grouped together.
A / B * C
has two possible interpretation
- Divide A by B then multiply by C
- Multiply B by C then use the result to divide A
- The interpretation can be represented in terms of the parse tree.
- The parse tree exhibits the syntactic structure of the expression.
- The language specification must tell us which interpretation is to be used.
- Context-free grammars are helpful in specifying the syntactic structure of a