Scanner, Parser, Compiler for a subset of the C language written in C
Running the scanner and parser together
For now just call scanner.c on the test_main_program.core files and then call parser.c.
.\scanner {filename}.core; .\parser;
The scanner writes the tokens into the symbol_table.txt, then the parser loads the tokens from it and writes the parse tree to parse_tree_output.ebnf.
Grammar rule the parser.c can parse
Will serve to show our progress on parser.c, and will eventually match our grammar rule. This shows the current grammar rule that the parser can read and output a working parse tree.
<program> ::= { <declaration> }
<declaration> ::= <variable_declaration>
| <array_declaration>
| <function_declaration>
<variable_declaration> ::= <data_type> <identifier> "=" <factor> { "," <identifier> "=" <factor> } ";"
| <data_type> <identifier> { "," <identifier> } ";"
<array_declaration> ::= <data_type> <identifier> "[" [ <const> ] "]" [ "=" "{" [ <argument_list> ] "}" ] ";"
<function_declaration> ::= <data_type> <identifier> "(" <parameter_list> ")" ( <block> | ";" )
<parameter_list> ::= ["void"] | <data_type> <identifier> { "," <data_type> <identifier> }
<argument_list> ::= <factor> { "," <factor> }
<data_type> ::= "int" | "float" | "char" | "bool"
<block> ::= "{" <block_item_list> "}"
<block_item_list> ::= { <block_item> }
<block_item> ::= <statement> | <variable_declaration> | <array_declaration>
<statement> ::= "return" <factor> ";"
| <factor> ";"
| ";"
| <block>
| <for_statement>
| "while" "(" <factor> ")" <block>
| "for" "(" (<variable_declaration> | <array_declaration> | <factor> ";") <factor> ";" <factor> ")" <block>
| <if_statement>
| <input_statement>
| <outout_statement>
<input_statement> ::= "scanf" "(" <string> { "," "&" <identifier> } ")" ";"
<output_statement> ::= "printf" "(" <string> ")" ";"
| "printf" "(" <string> "," <factor> ")" ";"
| "printf" "(" <string> { "," <factor> } ")" ";"
| "printf" "(" <identifier> ")" ";"
<if_statement> ::= "if" "(" <factor> ")" <block> [<else-clause>]
<else-clause> ::= "else" <block>
| "else" <if_statement>
<factor> ::= <const>
| <identifier>
| "(" <factor> ")"
| <identifier> "(" [ <argument_list> ] ")"
| <identifier> "[" <const> "]"
<const> ::= <int> | <float> | <char> | <bool>
<string> ::= STRING
<identifier> ::= identifier
<int> ::= integer_literal
<float> ::= FLOAT_LITERAL
<char> ::= CHARACTER_LITERAL
<bool> ::= "true" | "false"Test test_main_program.core
The following test code
bool array[5] = {1, 2};
int array[10];
// Test variables
int a, b, c;
int d = 5;
int c = bar(1000, 2, 3);
int b = a;
int arr[3] = {(foo(5)), a[20], multiply(a)};
// Function prototypes
int isValid(bool x, int y, char z);
int empty(void);
int main(void) {
// Test input statements
scanf("%d");
scanf("dog");
scanf("%f", &value);
scanf("%f %f", &value1, &value2);
scanf(" ");
// Test output statements
printf("Hello, Universe!");
printf(name)
printf("Hello, your grade is %d %d", grade);
printf("Hello, your grade is %d %d", grade, year);
// Test if statements
if (true) {
return 0;
}
if (false) {
bool a;
} else {
bool b;
}
}
...produces the following parse tree using my gawa-gawang notation ala Writing a C Compiler. The output of the parser is also stored at parse_tree_output.ebnf
Program(
Declaration(
Array_Declaration(
Data_Type(
bool
),
Identifier(
identifier: "array"
),
l,
Const(
Int(
integer_literal: "5"
)
),
RIGHT_BRACKET,
ASSIGN,
left_brace,
Argument_List(
Factor(
Const(
Int(
integer_literal: "1"
)
)
),
comma,
Factor(
Const(
Int(
integer_literal: "2"
)
)
)
),
RIGHT_BRACE,
SEMICOLON
)
),
Declaration(
Array_Declaration(
Data_Type(
INT
),
Identifier(
identifier: "array"
),
left_bracket,
Const(
Int(
integer_literal: "10"
)
),
RIGHT_BRACKET,
SEMICOLON
)
),
Declaration(
Variable_Declaration(
Data_Type(
INT
),
Identifier(
identifier: "a"
),
comma,
Identifier(
identifier: "b"
),
comma,
Identifier(
identifier: "c"
),
SEMICOLON
)
),
Declaration(
Variable_Declaration(
Data_Type(
INT
),
Identifier(
identifier: "d"
),
ASSIGN,
Factor(
Const(
Int(
integer_literal: "5"
)
)
),
SEMICOLON
)
),
Declaration(
Variable_Declaration(
Data_Type(
INT
),
Identifier(
identifier: "c"
),
ASSIGN,
Factor(
Identifier(
identifier: "bar"
),
left_parenthesis,
Argument_List(
Factor(
Const(
Int(
integer_literal: "1000"
)
)
),
comma,
Factor(
Const(
Int(
integer_literal: "2"
)
)
),
comma,
Factor(
Const(
Int(
integer_literal: "3"
)
)
)
),
RIGHT_PARENTHESIS
),
SEMICOLON
)
),
Declaration(
Variable_Declaration(
Data_Type(
INT
),
Identifier(
identifier: "b"
),
ASSIGN,
Factor(
Identifier(
identifier: "a"
)
),
SEMICOLON
)
),
Declaration(
Array_Declaration(
Data_Type(
INT
),
Identifier(
identifier: "arr"
),
left_bracket,
Const(
Int(
integer_literal: "3"
)
),
RIGHT_BRACKET,
ASSIGN,
left_brace,
Argument_List(
Factor(
left_parenthesis,
Factor(
Identifier(
identifier: "foo"
),
left_parenthesis,
Argument_List(
Factor(
Const(
Int(
integer_literal: "5"
)
)
)
),
RIGHT_PARENTHESIS
),
RIGHT_PARENTHESIS
),
comma,
Factor(
Identifier(
identifier: "a"
),
left_bracket,
Const(
Int(
integer_literal: "20"
)
),
RIGHT_BRACKET
),
comma,
Factor(
Identifier(
identifier: "multiply"
),
...The first two parts in the Scanner is the reading of the file, and the error handling. And only after is the reading of lexemes/tokens.
Only write valid tokens into Symbol Table
In parsing later on, we have to take a token one at a time, and it would be messy if we still have to clear and check for error/invalid tokens and also comments. The symbol table now only contains real tokens for our language.
Our scanner will now only print error tokens and invalid characters at the console, and no longer write them into the symbol table.
The same for comments. We'll also try not writing COMMENT tokens into the symbol table since it's of no use in parsing. But we still scan it and print into the console.
The tokens were not loaded into memory in our lexical analysis, only stored in the symbol table. It looks like we have to read from the symbol table one token at a time (a stream of tokens) for the parsing.
But it feels icky to read one-at-a-time directly from the symbol table file. Will try to load it as a list/array of tokens for now for cleaner access, and then call the next token from it.
Syntax Analyzer Rubric
- Input must be read one by one from the symbol table (one token at a time)
- A parser algorithm was implemented
- Algorithm: Recursive Descent Parsing
- Transition Diagram
- Parsing table
- Input Statement /1
- Output Statement /3
- Assignment Statement /3
- Condition Statement /3
- Iterative Statement /2
- Declaration Statement
- Error Recovery Method used:
- Error Messages
Recursive Descent parser and Pratt parsing
We will use recursive descent in parsing the program: a subprogram to parse each non-terminal symbol in the formal grammar.
Discussions and references seem to recommend utilizing Pratt parsing or Precedence climbing to parse expressions to tackle the difficulty of precedence in recrusive descent.
- Writing a C Compiler
- Simple but powerful Pratt parsing
- What are the advantages of pratt parsing over recursive descent parsing?
- Pratt Parsers: Expression Parsing Made Easy
The parser must output a parse tree. For now our goal is a pretty-printer outputting the parse tree (not necessarilly a working data structure). I think this is done by each parse function for each grammar rule outputting the rule and its contents.
I am not sure what notation we will use for the parse tree, I don't think it really matters.
Example code
int main() {
;
}
Example parse tree output
Made up the notation similar from Writing a C Compiler)
Program(
Declaration(
Function_Declaration(
Data_type(INT),
identifier("main"),
left_parenthesis,
Parameter_List(),
Block(
left_brace,
Block_Item_List(
Block_Item(
Statement(
semicolon
)
)
)
right_brace
)
right_parenthesis,
)
)
)Pretty-printing general rules It's a bit hard handling all the ways to close of non-terminals with a comma and newline. It seems like the formatting should be handled by the non-terminal function that called the next non-terminal instead of it being within the latter's code.
For example, does not have any sibling nodes when within , while within it can happen. So I'll try formatting be a responsibility of the calling function.
Can not do function declaration within functions
We can have <function_declaration> within <block>, and since the body of a function is itself a block our grammar rules allows for this. Either specificy the to only allow for <variable_declaration> and <array_declaration>.
Array declaration without assignment, without array size
int arr[]; // is this allowed?
The next stages of this project involve the following development milestones:
- Full Grammar Implementation: Complete implementation of all grammar rules according to our desired C subset.
- Improved Error Handling: Enhance the parser's error detection and recovery capabilities.
- Semantic Analysis: Implement a semantic analyzer to check for type compatibility, scope issues, and other semantic errors.
- Intermediate Representation (IR): Generate an intermediate code representation, such as three-address code, as a step before code generation.
- Code Generation: Develop a code generator to translate the intermediate representation into machine code or assembly.
- Optimization: Explore opportunities for optimization, such as constant folding or register allocation.
- Crafting Interpreters creates two implementations of their language
lox: a Java interpreterjloxand a C interpreter/compiler to bytecodeclox - Sebesta - Concepts of Programming Languages assigned book