Revision history for DevFbcLexer


Revision [21526]

Last edited on 2016-08-10 12:23:15 by DkLwikki [Try work-around weird indentation behaviour]
Additions:
the variable is added to the AST,
back to toplevel parser)
create a CALL to fb_PrintInt(), the expression is the argument)
Deletions:
the variable is added to the AST,
back to toplevel parser)
create a CALL to fb_PrintInt(), the expression is the argument)


Revision [21525]

Edited on 2016-08-10 12:17:41 by DkLwikki [Revert bad indentation changes]
Additions:
the variable is added to the AST,
back to toplevel parser)
create a CALL to fb_PrintInt(), the expression is the argument)
Deletions:
the variable is added to the AST,
back to toplevel parser)
create a CALL to fb_PrintInt(), the expression is the argument)


Revision [20787]

Edited on 2016-03-12 13:14:24 by fxm [Formatting]

No Differences

Revision [19981]

Edited on 2016-02-10 15:48:07 by DkLwikki [Update link format]
Additions:
the variable is added to the AST,
back to toplevel parser)
create a CALL to fb_PrintInt(), the expression is the argument)
Deletions:
the variable is added to the AST,
back to toplevel parser)
create a CALL to fb_PrintInt(), the expression is the argument)


Revision [15692]

Edited on 2012-01-16 02:18:36 by SirMud [Update link format]
Additions:
{{fbdoc item="back" value="DocToc|Table of Contents"}}


Revision [15670]

Edited on 2012-01-16 01:58:11 by SirMud [Update link format]
Additions:
{{fbdoc item="back" value="DevToc|FreeBASIC Developer Information"}}
Deletions:
{{fbdoc item="back" value="DevToc|Table of Contents"}}


Revision [15652]

Edited on 2012-01-14 12:27:21 by DkLwikki [Add links to sub topics]
Additions:
{{fbdoc item="title" value="Lexer & preprocessor"}}----
{{fbdoc item="keyword" value="DevFbcLexerTokens|Tokens"}}
{{fbdoc item="keyword" value="DevFbcLexerMacros|Macro storage and expansion"}}
{{fbdoc item="keyword" value="DevFbcLexerDirectives|Preprocessor directive parsing"}}
{{fbdoc item="keyword" value="DevFbcLexerFiles|File contexts"}}
{{fbdoc item="keyword" value="DevFbcLexerCallGraph|Quick overview of the call graph"}}
Deletions:
{{fbdoc item="title" value="Purpose"}}----


Revision [14939]

Edited on 2010-10-25 17:45:34 by DkLwikki [Add links to sub topics]
Additions:
{{fbdoc item="title" value="Purpose"}}----
Deletions:
{{fbdoc item="title" value="Lexer & preprocessor"}}----
<<{{fbdoc item="keyword" value="DevFbcLexerTokens|Tokens - The lexer's core"}}
{{fbdoc item="keyword" value="DevFbcLexerMacros|Macro storage and expansion"}}
{{fbdoc item="keyword" value="DevFbcLexerDirectives|Preprocessor directive parsing"}}
{{fbdoc item="keyword" value="DevFbcLexerFiles|File contexts"}}
{{fbdoc item="keyword" value="DevFbcLexerCallGraph|Quick overview of the call graph"}}>>::c::
{{fbdoc item="section" value="Motivation"}}


Revision [14933]

Edited on 2010-10-25 16:48:29 by DkLwikki [Add links to sub topics]
Additions:
<<{{fbdoc item="keyword" value="DevFbcLexerTokens|Tokens - The lexer's core"}}
{{fbdoc item="keyword" value="DevFbcLexerMacros|Macro storage and expansion"}}
{{fbdoc item="keyword" value="DevFbcLexerDirectives|Preprocessor directive parsing"}}
{{fbdoc item="keyword" value="DevFbcLexerFiles|File contexts"}}
{{fbdoc item="keyword" value="DevFbcLexerCallGraph|Quick overview of the call graph"}}>>::c::
Deletions:
{{fbdoc item="keyword" value="DevFbcLexerTokens|Tokens - The lexer's core"}}
{{fbdoc item="keyword" value="DevFbcLexerMacros|Macro storage and expansion"}}
{{fbdoc item="keyword" value="DevFbcLexerDirectives|Preprocessor directive parsing"}}
{{fbdoc item="keyword" value="DevFbcLexerFiles|File contexts"}}
{{fbdoc item="keyword" value="DevFbcLexerCallGraph|Quick overview of the call graph"}}


Revision [14930]

Edited on 2010-10-25 16:26:33 by DkLwikki [Add links to sub topics]
Additions:
{{fbdoc item="keyword" value="DevFbcLexerTokens|Tokens - The lexer's core"}}
{{fbdoc item="keyword" value="DevFbcLexerMacros|Macro storage and expansion"}}
{{fbdoc item="keyword" value="DevFbcLexerDirectives|Preprocessor directive parsing"}}
{{fbdoc item="keyword" value="DevFbcLexerFiles|File contexts"}}
{{fbdoc item="keyword" value="DevFbcLexerCallGraph|Quick overview of the call graph"}}
{{fbdoc item="section" value="Motivation"}}


Revision [14926]

Edited on 2010-10-25 16:23:08 by DkLwikki [Add links to sub topics]
Additions:
The lexer is an abstraction hiding the ugly details of user input (indentation, comments, keyword capitalization, #includes) from the parser. Additionally it does preprocessing, consisting of macro expansion and preprocessor directive parsing. The general idea is to handle all preprocessing in the lexer, so the parser does not get to see it. The parser never calls preprocessor functions, the lexer functions do that.
Deletions:
==__Motivation__==
The lexer is an abstraction hiding the ugly details of user input (indentation, comments, keyword capitalization, #includes) from the parser.
==__Basics__==
The basic public interface of the lexer is from ''##lex.bas##'':
- ''##lexGetToken()##'': Retrieve current token's id, an FB_TK_* value.
- ''##lexGetLookAhead(N)##'': Look ahead N tokens
- ''##lexSkipToken()##'': Go to next token
- ''##lexGetText()##'': Returns a zstring ptr to the text of the current token, e.g. string/number literals (their values are retrieved like this), or the text representation of other tokens (e.g. operators).
- some more ''##lexGet*()##'' accessors to data of the current token
- ''##lexPeekLine()##'': Used by error reporting to retrieve the current line of code.
==Current token + look ahead tokens==
Tokens are a pretty short-living thing. There only is the current token and a few look ahead tokens in the token queue. That's all the parser needs to decipher FB code. The usual pattern is to check the current token, decide what to do next based on what it is, then skip it and move on. Backward movement is not possible. The file name, line number and token position shown during error reporting also comes from the current lexer state.

The token queue is a static array of tokens, containing space for the current token plus the few look ahead tokens. The token structures contain fairly huge (static) buffers for token text. Each token has a pointer to the next one, so they form a circular list. This is a cheap way to move forward and skip tokens, without having to take care of an array index. Copying around the tokens themselves is out of question, because of the huge text buffers. The "head" points to the current token; the next "k" tokens are look ahead tokens; the rest is unused. When skipping we simply do "head = head->next". Unless the new head already contains a token (from some look ahead done before), we load a new token into the new current token struct (via lexNextToken()). Look ahead works by loading the following tokens in the queue (but without skipping the current one).
==Tokenization==
''##lex.bas:lexNextToken()##''

The lexer breaks down the file input into tokens. A token conceptually is an identifier, a keyword, a string literal, a number literal, an operator, EOL or EOF, or other characters like parentheses and commas. Each token as an unique value assigned to it that the parser will use to identify it, instead of doing string comparisons (which would be too slow).

''##lexNextToken()##'' uses the current char, and if needed also the look ahead char, to parse the input. Number and string literals are handled here too. Alphanumeric identifiers are looked up in the ''##symb##'' hash table, which will tell whether it's a keyword, a macro, or another FB symbol (type, procedure, variable, ...).

Identifiers containing dots (QB compatibility) and identifier type suffixes (as in stringvar$) are handled here too (but not namespace/structure member access). Tokens can have a data type associated with them. That is also used with number literals, which can have type suffixes (as in ##&hFFFFFFFFFFFFFFFFull##).
==Side note on single-line comments==
Quite unusual, single-line comments are handled by the parser instead of being skipped in the lexer. This is done so that usage of ##REM## can easily be restricted as in QB, afterall REM is more like a statement than a comment. Besides that, comments can contain QB meta statements, so comments cannot just be ignored. Note that the parser will still skip the rest of a comment (without tokenizing it), if it does not find a QB meta statement.

(Multi-line comments are completely handled during tokenization though.)
==File input==
''##lex.bas:hReadChar()##''

The input file is opened in ''##fb.bas:fbCompile()##''; the file number is stored in the global ''##env##'' context (similar for #includes in ''##fb.bas:fbIncludeFile()##''). The lexer uses the file number from the ''##env##'' context to read input from. It has a static zstring buffer that is used to stream the file contents (instead of reading character per character), and for Unicode input, the lexer uses a wstring buffer and decodes UTF32 or UTF8 to UTF16. The lexer advances through the chars in the buffer and then reads in the next chunk from the file. EOF is represented by returning a NULL character.
==__Preprocessing__==
Consists of macro expansion and directive parsing. The general idea is to handle all preprocessing in the lexer, so the parser does not get to see it. The parser never calls preprocessor functions, the lexer functions do that.
==__Macros (Preprocessing part 1)__==
Some terms used in the code:
- ##macro##: The #defined/#macroed object that will be expanded to its replacement text
- ##macro##: a function-like macro, e.g. ###define m(a, b)##
- ##define##: an object-like macro, e.g. ###define simple##
- ##argless define## (should be parameter-less): a function-like macro without parameters, e.g. ###define f()##
==How macros are stored==
Macros are basically stored as raw text, not as token runs (as in GCC's libcpp for example). The body of simple #defines without parameters is stored as one string. Macros with parameters are stored as sequence of "macro tokens". There are three types of macro tokens:
- ##text("<text>")##
Raw text, but spaces and empty lines trimmed (like in a #define without parameters)
- ##textw("<wstring text>")##
Same as above, just for Unicode input.
- ##parameter(index)##
A macro parameter was used here in the declaration. The index specifies which one. During expansion, the text of argument(index) is inserted where the parameter was in the declaration.
- ##stringify_parameter(index)##
Same as above, except the argument will be stringified during expansion.

Note: macro tokens are actually ''##symb.bi:FB_DEFTOK##'' structures, and they contain an id field holding on of the ''##FB_DEFTOK_TYPE_*##'' values to tell what they contain.

%%For example:
#define add(x, y) x + y
becomes:
parameter(0), text(" + "), parameter(1)
And the expansion text will be:
argument(0) + " + " + argument(1)
%%

Storing macros as text is a fairly easy implementation, but it requires to re-parse the macro body over and over again. For example, since GCC works with preprocessing tokens and tokenruns, macros are stored as tokens, making expansion very fast, because there is no need to tokenize the macro body again and again. fbc's implementation is not as flexible and maybe not as efficient, but is less complex (regarding code and memory management) and has an upside too: Implementation of ##""##""## (PP token merge) is trivial. ##""##""## simply is omitted while recording the macro's body, where as in token runs the tokens need to be merged explicitly.
==When are macros expanded?==
Because of token look ahead, macros must be expanded during tokenization, otherwise the wrong tokens might be loaded into the token queue. Afterall the parser should only get to see the final tokens, even during look ahead.

In ''##lexNextToken()##'', each alphanumeric identifier is looked up in the symb module to check whether it is a keyword or a macro. Macros and keywords are kept in the same hash table. Note that macros cannot have the name of keywords; "#define integer" causes an error. If a macro is detected, it is immediately expanded, a process also called "loading" the macro (''##pp-define.bas:ppDefineLoad()##'').
==Macro call parsing==
If the macro takes arguments, the macro "call" must be parsed, much like a function call, syntax-wise. Since macro expansion already happens in ''##lexNextToken()##'', the source of tokens, the parsing here is a little tricky. Forward movement is only possible by replacing (and losing) the current token. The token queue and token look ahead cannot be relied upon. Instead it can only replace the current token to move forward while parsing the macro's arguments.

Since ''##lexNextToken()##'' is used to parse the arguments, macros in the arguments themselves are recursively macro-expanded while the arguments are being parsed and recorded in text form. The argument texts are stored for use during the expansion.

So, a macro's arguments are expanded before that macro itself is expanded, which could be seen as both good and bad feature:

%%#define stringify(s) #s
stringify(__LINE__)%%

results in ##2## in FB, but ##""__LINE__""## in C, because in C, macro parameters are not expanded when used with ##### or ##""##""##. In C, two macros have to be used to get the ##2##:

%%#define stringize(s) #s
#define stringify(s) stringize(s)
stringify(__LINE__)%%
==Putting together the macro expansion text==
The expansion text is a string build up from the macro's body tokens. For macro parameters, the argument text is retrieved from the argument array created by the macro call parser, using the indices stored in the parameter tokens. Parameter stringification is done here.

There is a specialty for the builtin defines (##""__LINE__""##, ##""__FUNCTION__""##, ##""__FB_DEBUG__""##, etc.):
A callback is used to retrieve their "value". For example: ##""__LINE__""##'s callback simply returns a string containing the lexer's current line number.
==Expansion==
The macro expansion text (##deftext##) is stored by the lexer, and now it will read characters from there for a while, instead of reading from the file input buffer. Skipping chars in the macro text is like skipping chars in the file input: Once skipped it's lost, there is no going back. So, there never is "old" (parsed) macro text, only the current char and to-be-parsed text. New macro text is prepended to the front of existing macro text. That way macros inside macros are expanded.

This implementation does not (easily) allow to detect macro recursion. It would be hard to keep track of which characters in the macro text buffer belong to which macro, but that would be needed to be able to push and pop macros properly. It could be done more easily with a token run implementation as seen in GCC's libcpp. However C doesn't allow recursive macros in the first place: In C, a macro's identifier is undefined (does not trigger expansion) inside that macro's body. That is not the case in fbc, because (again) a way to detect when a macro body ends is not implemented.

Currently fbc only keeps track of the first (toplevel) macro expanded, because it's easy to detect when that specific macro's end is reached: as soon as there is no more macro text.

That's why the recursion is detected here:

%%#define a a
a%%

and here too:

%%#define a b
#define b a
a%%

but not here: (Note that fbc will run an infinite loop)

%%#define a a
#define m a
m%%
==__Directive parsing (Preprocessing part 2)__==
Preprocessor directives (###if##, ###define##, ###include##, etc.) are parsed during ''##lex.bas:lexSkipToken()##'' by calling ''##pp.bas:ppCheck()##''. After moving to the next token (or loading a new token), ''##ppCheck()##'' will check whether the new current token is a ##'#'##. If so it will also check whether the previous token was an ##EOL##. If so, it found a ##'#'## at line begin, and directly parses the PP directive, using the same ''##lexGetToken()##''/''##lexSkipToken()##'' interface used by the parser. This is necessary because some PP directives result in parser functions being called, for example ''##parser-identifier.bas:cIdentifier()##'' is used by the ###ifdef## parser, to recognize variables etc.:
%%dim as integer i
#ifdef i
#print yes, the variable will be recognized
#endif%%
So, ''##lexSkipToken()##'' is recursive because of the PP. ''##ppCheck()##'' will only call ''##pp.bas:ppParse()##'' for toplevel calls to ''##lexSkipToken()##''. This is needed for example when parsing #macro/#endmacro blocks, because they can span across multiple lines. It's also needed while the PP skips ###if FALSE## blocks.
As a result, every time the parser skips an EOL, ''##lexSkipToken()##'' might detect a ##'#'## and handle the PP directive, skipping ahead some more lines, so the parser stays fully unaware that the PP directives are even there. The PP parsing launched from ''##lexSkipToken()##'' might even encounter an ###include## and call ''##fb.bas:fbIncludeFile()##'' to parse it immediately, recursively starting a ''##parser-toplevel.bas:cProgram()##'' for that #include file. The parser has to be able to handle the recursion that might happen during every ''##lexSkipToken()##'' at ##EOL##, but luckily that is not a big deal. The parser needs a stack to keep track of compound statements anyways.
Note that PP directives are not handled during token look ahead (''##lex.bas:lexGetLookAhead()##''). If the parser were to look ahead across ##EOL##, it could very well see a PP directive. Luckily though looking ahead across lines is never necessary.
==Macro expansion in PP directives==
The beginning of directives, the keyword following the ##'#'##, is parsed without macro expansion. This means redefining PP keywords (intentionally) has no effect on the PP directives. For example:

%%#define define foo
#define bar baz%%

will //not// intermediately be seen as:

%%#foo bar baz%%

Directives like ###if## & co. make use of the PP expression parser, which does expand macros. Afterall that's the point of PP expressions. For example:
%%#define foo 1
#if foo = 1
#endif%%

The ###define## and ###macro## directives don't do macro expansion at all. A macro's body is recorded as-is.
==#define/#macro parsing==
''##pp.bas:ppDefine()##'' first parses the macro's identifier. If there is a ##'('## following, without space in between, then the parameter list is parsed too.

Then the macro body is parsed. For each token, its text representation is retrieved via ''##lexGetText()##'', and it is appended to the macro body text. Space is preserved (but trimmed); comments are left out; in multi-line #macros empty lines are removed.

If the macro has parameters, the macro tokens will be created (as discussed in Macro Expansion). To do that, the macro parameters are added to a temporary hash table, which associates the parameter names to their indices. Then, identifiers in the macro body are looked up, and when a parameter is recognized, a parameter(index) macro token is created, instead of appending the token to the previous text() macro token (or creating a new text() for it). After that parameter(index), if there is other text again, a new text() macro token is created.

Using ##### on a parameter results in the creation of a stringify_parameter(index) macro token. The PP merge operator ##""##""## is simply ommitted from the macro body, so ##""a##b""## becomes ##ab## in a text() macro token. All normal text before/after/between parameters goes into text() macro tokens.

%%For example:
#define add(x, y) foo bar x + y
And the actions of the #define/#macro parser will be:
'add' - The macro's name
'(' following the name, without space in between: Parse the parameter list.
'x' - Parameter 0.
',' - Next parameter.
'y' - Parameter 1.
')' - End of parameter list.
Create the macro body in form of macro tokens.
' ' - Create new text(" ").
'foo' - Append "foo".
' ' - Append " ".
'bar' - Append "bar".
' ' - Append " ".
'x' - Is parameter 0, create new param(0).
' ' - Create new text(" ").
'+' - Append "+".
' ' - Append " ".
'y' - Is parameter 1, create new param(1).
EOL - End of macro body.
Resulting in this macro body:
text(" foo bar "), param(0), text(" + "), param(1)
%%

The #define parser allows macros to be redefined, if the body is the same. For example:

%%#define a 1
#define a 1%%

does not result in a duplicated definition. However this would:

%%#define a 1
#define a 2%%

Since those are pure text #defines, the comparison is the bodies is a simple string comparison. This feature is not implemented for macros with parameters currently.
==PP expressions==
The preprocessor has its own (but fairly small and simple) expression parser (''##pp-cond.bas:ppExpression()##''). It works much like ''##parser-expression.bas:cExpression()##'', except instead of creating AST nodes, ''##ppExpression()##'' immediately evaluates the expressions.
==PP skipping==
The preprocessor uses a simple stack to manage ###if##/###endif## blocks. Those can be nested, and there may be #includes in them, but they cannot go across files. False blocks (#if 0, or the #else of an #if 1) are immediately skipped when parsing the #if 0 or the #else (''##pp-cond.bas:ppSkip()##''), before returning to ''##lexSkipToken()##''.

For example:

%%#if 1 (push to stack: is_true = TRUE, #else not visited yet, return to lexSkipToken())
... (will be parsed)
#else 1) Set the #else visited flag for the current stack node,
so further #else's are not allowed.
2) Since the current stack node has is_true = TRUE,
that means the #else block must be skipped, -> call ppSkip().
... (skipped in ppSkip())
#endif (parsed from ppSkip(), skipping ends, ppSkip() returns to #else parser,
which returns to lexSkipToken())%%

Note that there are a few tricky bits about PP skipping. Since macros are allowed to contain PP directives, macro expansion must be done even during PP skipping, because an #else or #endif could be inside a multi-line macro. Also, multi-line #macro declarations are not handled during PP skipping. That means, something like this:

%%#if 0
#macro test()
#endif
#endmacro%%

will be seen as:

%%#if 0
#macro test()
#endif
#endmacro%%

Resulting in an error (#endmacro without #macro).

So, this:

%%#if 0
#macro test()
#endif
#endmacro
#endif%%

will not work as suggested by the indentation.
==__Lexer contexts__==
Because #includes can occur in the middle of input files, the lexer needs to push file contexts to a stack. File input buffer, macro expansion buffer and the token queue form a so-called "context". It is file specific and thus it must be pushed onto a stack, so that the lexer can return to the parent (after parsing an #include), without losing any tokens or macro text. Note that macros can contain #includes too.
''##fb.bas:fbIncludeFile()##'' basically just consists of:
''##lexPush()##''
''##cProgram()##''
''##lexPop()##''
==__fb/lex/pp/parser call chain__==
%%
+------------------> lexGetLookAhead() --------+
| |
| v
(begin) (FB parsing) (PP parsing) (lexing)
fbCompile() -> cProgram() ------------> lexSkipToken() -> lexNextToken()
| ^ | | ^ | ^
v | | v | v |
fbPreIncludes() | |('$include) ppCheck() | ppDefineLoad()
| | | | | (macro expansion)
v | v v |
fbIncludeFile() <-------------- ppParse()
(#include) (directives)%%


Revision [14909]

Edited on 2010-10-25 15:41:29 by DkLwikki [Add links to sub topics]
Additions:
{{fbdoc item="title" value="Lexer & preprocessor"}}----
Deletions:
===Lexing/tokenization, preprocessing===


Revision [14903]

Edited on 2010-10-25 15:30:19 by DkLwikki [typo]
Additions:
A macro parameter was used here in the declaration. The index specifies which one. During expansion, the text of argument(index) is inserted where the parameter was in the declaration.
Deletions:
A macro parameters was used here in the declaration. The index specifies which one. During expansion, the text of argument(index) is inserted where the parameter was in the declaration.


Revision [14897]

Edited on 2010-10-25 14:57:38 by DkLwikki [typo]
Additions:
{{fbdoc item="back" value="DevToc|Table of Contents"}}
Deletions:
@@[[DevToc Back to Table Of Content]]@@


Revision [14885]

Edited on 2010-10-24 15:19:27 by AgSwikki [Removed some typos]
Additions:
The lexer is an abstraction hiding the ugly details of user input (indentation, comments, keyword capitalization, #includes) from the parser.
Storing macros as text is a fairly easy implementation, but it requires to re-parse the macro body over and over again. For example, since GCC works with preprocessing tokens and tokenruns, macros are stored as tokens, making expansion very fast, because there is no need to tokenize the macro body again and again. fbc's implementation is not as flexible and maybe not as efficient, but is less complex (regarding code and memory management) and has an upside too: Implementation of ##""##""## (PP token merge) is trivial. ##""##""## simply is omitted while recording the macro's body, where as in token runs the tokens need to be merged explicitly.
As a result, every time the parser skips an EOL, ''##lexSkipToken()##'' might detect a ##'#'## and handle the PP directive, skipping ahead some more lines, so the parser stays fully unaware that the PP directives are even there. The PP parsing launched from ''##lexSkipToken()##'' might even encounter an ###include## and call ''##fb.bas:fbIncludeFile()##'' to parse it immediately, recursively starting a ''##parser-toplevel.bas:cProgram()##'' for that #include file. The parser has to be able to handle the recursion that might happen during every ''##lexSkipToken()##'' at ##EOL##, but luckily that is not a big deal. The parser needs a stack to keep track of compound statements anyways.
Deletions:
The lexer is an abstraction hiding the ugly details of user input (indendation, comments, keyword capitalization, #includes) from the parser.
Storing macros as text is a fairly easy implementation, but it requires to re-parse the macro body over and over again. For example, since GCC works with preprocessing tokens and tokenruns, macros are stored as tokens, making expansion very fast, because there is no need to tokenize the macro body again and again. fbc's implementation is not as flexible and maybe not as efficient, but is less complex (regarding code and memory management) and has an upside too: Implementation of ##""##""## (PP token merge) is trivial. ##""##""## simply is ommitted while recording the macro's body, where as in token runs the tokens need to be merged explicitely.
As a result, everytime the parser skips an EOL, ''##lexSkipToken()##'' might detect a ##'#'## and handle the PP directive, skipping ahead some more lines, so the parser stays fully unaware that the PP directives are even there. The PP parsing launched from ''##lexSkipToken()##'' might even encounter an ###include## and call ''##fb.bas:fbIncludeFile()##'' to parse it immediately, recursively starting a ''##parser-toplevel.bas:cProgram()##'' for that #include file. The parser has to be able to handle the recursion that might happen during every ''##lexSkipToken()##'' at ##EOL##, but luckily that is not a big deal. The parser needs a stack to keep track of compound statements anyways.


Revision [14881]

Edited on 2010-10-21 14:21:40 by DkLwikki [Typos]
Additions:
==Side note on single-line comments==
Quite unusual, single-line comments are handled by the parser instead of being skipped in the lexer. This is done so that usage of ##REM## can easily be restricted as in QB, afterall REM is more like a statement than a comment. Besides that, comments can contain QB meta statements, so comments cannot just be ignored. Note that the parser will still skip the rest of a comment (without tokenizing it), if it does not find a QB meta statement.
(Multi-line comments are completely handled during tokenization though.)
Deletions:
==Side node on single-line comments==
Quite unusual, single-line comments are handled by the parser instead of being skipped in the lexer. This is done so that usage of ##REM## can be restricted as in QB easily, afterall it's more like a statement than a comment. Besides that, QB meta statements have to be parsed in comments. Note that the parser will still skip the rest of a comment (without tokenizing it), if it does not find a QB meta statement.
(Multi-line comments are handled during tokenization completely though.)


Revision [14871]

Edited on 2010-10-16 10:26:21 by DkLwikki [Typos]
Additions:
@@[[DevToc Back to Table Of Content]]@@


Revision [14870]

Edited on 2010-10-16 10:25:39 by DkLwikki [Typos]
Additions:
#define add(x, y) foo bar x + y
And the actions of the #define/#macro parser will be:
'add' - The macro's name
'(' following the name, without space in between: Parse the parameter list.
'x' - Parameter 0.
',' - Next parameter.
'y' - Parameter 1.
')' - End of parameter list.
Create the macro body in form of macro tokens.
' ' - Create new text(" ").
'foo' - Append "foo".
' ' - Append " ".
'bar' - Append "bar".
' ' - Append " ".
'x' - Is parameter 0, create new param(0).
' ' - Create new text(" ").
'+' - Append "+".
' ' - Append " ".
'y' - Is parameter 1, create new param(1).
EOL - End of macro body.
Resulting in this macro body:
text(" foo bar "), param(0), text(" + "), param(1)
... (will be parsed)
#else 1) Set the #else visited flag for the current stack node,
so further #else's are not allowed.
2) Since the current stack node has is_true = TRUE,
that means the #else block must be skipped, -> call ppSkip().
... (skipped in ppSkip())
which returns to lexSkipToken())%%
Deletions:
###define add(x, y) foo bar x + y##
Actions of the #define/#macro parser:
- ##'add'##: The macro's name
- ##'('## following the name, without space in between: Parse the parameter list.
- ##'x'##: Parameter 0.
- ##','##: Next parameter.
- ##'y'##: Parameter 1.
- ##')'##: End of parameter list.
- Create the macro body in form of macro tokens.
- ##' '##: Create new text(" ").
- ##'foo'##: Append "foo".
- ##' '##: Append " ".
- ##'bar'##: Append "bar".
- ##' '##: Append " ".
- ##'x'##: Is parameter 0, create new param(0).
- ##' '##: Create new text(" ").
- ##'+'##: Append "+".
- ##' '##: Append " ".
- ##'y'##: Is parameter 1, create new param(1).
- ##EOL##: End of macro body.
Resulting in this macro body:
##text(" foo bar "), param(0), text(" + "), param(1)##
... (will be parsed)
#else ( 1) Set the #else visited flag for the current stack node,
so further #else's are not allowed.
2) Since the current stack node has is_true = TRUE,
that means the #else block must be skipped,
-> call ppSkip() )
... (skipped in ppSkip())
which returns to lexSkipToken())%%


Revision [14869]

Edited on 2010-10-16 10:18:01 by DkLwikki [Typos]
Additions:
%%For example:
#define add(x, y) x + y
becomes:
parameter(0), text(" + "), parameter(1)
And the expansion text will be:
argument(0) + " + " + argument(1)
%%
Deletions:
###define add(x, y) x + y##

becomes:
##parameter(0), text(" + "), parameter(1)##

And the expansion text will be:
##argument(0) + " + " + argument(1)##


Revision [14868]

Edited on 2010-10-16 10:16:16 by DkLwikki [Added some highlighting for important functions]
Additions:
The basic public interface of the lexer is from ''##lex.bas##'':
- ''##lexGetToken()##'': Retrieve current token's id, an FB_TK_* value.
- ''##lexGetLookAhead(N)##'': Look ahead N tokens
- ''##lexSkipToken()##'': Go to next token
- ''##lexGetText()##'': Returns a zstring ptr to the text of the current token, e.g. string/number literals (their values are retrieved like this), or the text representation of other tokens (e.g. operators).
- some more ''##lexGet*()##'' accessors to data of the current token
- ''##lexPeekLine()##'': Used by error reporting to retrieve the current line of code.
==Tokenization==
''##lex.bas:lexNextToken()##''
''##lexNextToken()##'' uses the current char, and if needed also the look ahead char, to parse the input. Number and string literals are handled here too. Alphanumeric identifiers are looked up in the ''##symb##'' hash table, which will tell whether it's a keyword, a macro, or another FB symbol (type, procedure, variable, ...).
Identifiers containing dots (QB compatibility) and identifier type suffixes (as in stringvar$) are handled here too (but not namespace/structure member access). Tokens can have a data type associated with them. That is also used with number literals, which can have type suffixes (as in ##&hFFFFFFFFFFFFFFFFull##).
Quite unusual, single-line comments are handled by the parser instead of being skipped in the lexer. This is done so that usage of ##REM## can be restricted as in QB easily, afterall it's more like a statement than a comment. Besides that, QB meta statements have to be parsed in comments. Note that the parser will still skip the rest of a comment (without tokenizing it), if it does not find a QB meta statement.
==File input==
''##lex.bas:hReadChar()##''
The input file is opened in ''##fb.bas:fbCompile()##''; the file number is stored in the global ''##env##'' context (similar for #includes in ''##fb.bas:fbIncludeFile()##''). The lexer uses the file number from the ''##env##'' context to read input from. It has a static zstring buffer that is used to stream the file contents (instead of reading character per character), and for Unicode input, the lexer uses a wstring buffer and decodes UTF32 or UTF8 to UTF16. The lexer advances through the chars in the buffer and then reads in the next chunk from the file. EOF is represented by returning a NULL character.
Raw text, but spaces and empty lines trimmed (like in a #define without parameters)
Same as above, just for Unicode input.
A macro parameters was used here in the declaration. The index specifies which one. During expansion, the text of argument(index) is inserted where the parameter was in the declaration.
Same as above, except the argument will be stringified during expansion.
Note: macro tokens are actually ''##symb.bi:FB_DEFTOK##'' structures, and they contain an id field holding on of the ''##FB_DEFTOK_TYPE_*##'' values to tell what they contain.


In ''##lexNextToken()##'', each alphanumeric identifier is looked up in the symb module to check whether it is a keyword or a macro. Macros and keywords are kept in the same hash table. Note that macros cannot have the name of keywords; "#define integer" causes an error. If a macro is detected, it is immediately expanded, a process also called "loading" the macro (''##pp-define.bas:ppDefineLoad()##'').
If the macro takes arguments, the macro "call" must be parsed, much like a function call, syntax-wise. Since macro expansion already happens in ''##lexNextToken()##'', the source of tokens, the parsing here is a little tricky. Forward movement is only possible by replacing (and losing) the current token. The token queue and token look ahead cannot be relied upon. Instead it can only replace the current token to move forward while parsing the macro's arguments.
Since ''##lexNextToken()##'' is used to parse the arguments, macros in the arguments themselves are recursively macro-expanded while the arguments are being parsed and recorded in text form. The argument texts are stored for use during the expansion.
The macro expansion text (##deftext##) is stored by the lexer, and now it will read characters from there for a while, instead of reading from the file input buffer. Skipping chars in the macro text is like skipping chars in the file input: Once skipped it's lost, there is no going back. So, there never is "old" (parsed) macro text, only the current char and to-be-parsed text. New macro text is prepended to the front of existing macro text. That way macros inside macros are expanded.
Preprocessor directives (###if##, ###define##, ###include##, etc.) are parsed during ''##lex.bas:lexSkipToken()##'' by calling ''##pp.bas:ppCheck()##''. After moving to the next token (or loading a new token), ''##ppCheck()##'' will check whether the new current token is a ##'#'##. If so it will also check whether the previous token was an ##EOL##. If so, it found a ##'#'## at line begin, and directly parses the PP directive, using the same ''##lexGetToken()##''/''##lexSkipToken()##'' interface used by the parser. This is necessary because some PP directives result in parser functions being called, for example ''##parser-identifier.bas:cIdentifier()##'' is used by the ###ifdef## parser, to recognize variables etc.:
So, ''##lexSkipToken()##'' is recursive because of the PP. ''##ppCheck()##'' will only call ''##pp.bas:ppParse()##'' for toplevel calls to ''##lexSkipToken()##''. This is needed for example when parsing #macro/#endmacro blocks, because they can span across multiple lines. It's also needed while the PP skips ###if FALSE## blocks.
As a result, everytime the parser skips an EOL, ''##lexSkipToken()##'' might detect a ##'#'## and handle the PP directive, skipping ahead some more lines, so the parser stays fully unaware that the PP directives are even there. The PP parsing launched from ''##lexSkipToken()##'' might even encounter an ###include## and call ''##fb.bas:fbIncludeFile()##'' to parse it immediately, recursively starting a ''##parser-toplevel.bas:cProgram()##'' for that #include file. The parser has to be able to handle the recursion that might happen during every ''##lexSkipToken()##'' at ##EOL##, but luckily that is not a big deal. The parser needs a stack to keep track of compound statements anyways.
Note that PP directives are not handled during token look ahead (''##lex.bas:lexGetLookAhead()##''). If the parser were to look ahead across ##EOL##, it could very well see a PP directive. Luckily though looking ahead across lines is never necessary.
The beginning of directives, the keyword following the ##'#'##, is parsed without macro expansion. This means redefining PP keywords (intentionally) has no effect on the PP directives. For example:
Directives like ###if## & co. make use of the PP expression parser, which does expand macros. Afterall that's the point of PP expressions. For example:
The ###define## and ###macro## directives don't do macro expansion at all. A macro's body is recorded as-is.
''##pp.bas:ppDefine()##'' first parses the macro's identifier. If there is a ##'('## following, without space in between, then the parameter list is parsed too.
Then the macro body is parsed. For each token, its text representation is retrieved via ''##lexGetText()##'', and it is appended to the macro body text. Space is preserved (but trimmed); comments are left out; in multi-line #macros empty lines are removed.
The preprocessor has its own (but fairly small and simple) expression parser (''##pp-cond.bas:ppExpression()##''). It works much like ''##parser-expression.bas:cExpression()##'', except instead of creating AST nodes, ''##ppExpression()##'' immediately evaluates the expressions.
The preprocessor uses a simple stack to manage ###if##/###endif## blocks. Those can be nested, and there may be #includes in them, but they cannot go across files. False blocks (#if 0, or the #else of an #if 1) are immediately skipped when parsing the #if 0 or the #else (''##pp-cond.bas:ppSkip()##''), before returning to ''##lexSkipToken()##''.
''##fb.bas:fbIncludeFile()##'' basically just consists of:
''##lexPush()##''
''##cProgram()##''
''##lexPop()##''
Deletions:
The basic public interface of the lexer is:
- ##lexGetToken()##: Retrieve current token's id, an FB_TK_* value.
- ##lexGetLookAhead(N)##: Look ahead N tokens
- ##lexSkipToken()##: Go to next token
- ##lexGetText()##: Returns a zstring ptr to the text of the current token, e.g. string/number literals (their values are retrieved like this), or the text representation of other tokens (e.g. operators).
- some more ##lexGet*()## accessors to data of the current token
- ##lexPeekLine()##: Used by error reporting to retrieve the current line of code.
==Tokenization (lexNextToken())==
lexNextToken() uses the current char, and if needed also the look ahead char, to parse the input. Number and string literals are handled here too. Alphanumeric identifiers are looked up in the symb hash table, which will tell whether it's a keyword, a macro, or another FB symbol (type, procedure, variable, ...).
Identifiers containing dots (QB compatibility) and identifier type suffixes (as in stringvar$) are handled here too (but not namespace/structure member access). Tokens can have a data type associated with them. That is also used with number literals, which can have type suffixes (as in &hFFFFFFFFFFFFFFFFull).
Quite unusual, single-line comments are handled by the parser instead of being skipped in the lexer. This is done so that usage of REM can be restricted as in QB easily, afterall it's more like a statement than a comment. Besides that, QB meta statements have to be parsed in comments. Note that the parser will still skip the rest of a comment (without tokenizing it), if it does not find a QB meta statement.
==File input (hReadChar())==
The input file is opened in fbCompile(); the file number is stored in the global ##env## context (similar for #includes in fbIncludeFile()). The lexer uses the file number from the "env" context to read input from. It has a static zstring buffer that is used to stream the file contents (instead of reading character per character), and for Unicode input, the lexer uses a wstring buffer and decodes UTF32 or UTF8 to UTF16. The lexer advances through the chars in the buffer and then reads in the next chunk from the file. EOF is represented by returning a NULL character.
Raw text, but spaces and empty lines trimmed (like in a #define without parameters)
Same as above, just for Unicode input.
A macro parameters was used here in the declaration. The index specifies which one. During expansion, the text of argument(index) is inserted where the parameter was in the declaration.
Same as above, except the argument will be stringified during expansion.
Note: macro tokens are actually ##symb.bi:FB_DEFTOK## structures, and they contain an id field holding on of the ##FB_DEFTOK_TYPE_*## values to tell what they contain.
In lexNextToken(), each alphanumeric identifier is looked up in the symb module to check whether it is a keyword or a macro. Macros and keywords are kept in the same hash table. Note that macros cannot have the name of keywords; "#define integer" causes an error. If a macro is detected, it is immediately expanded, a process also called "loading" the macro (ppDefineLoad()).
If the macro takes arguments, the macro "call" must be parsed, much like a function call, syntax-wise. Since macro expansion already happens in lexNextToken(), the source of tokens, the parsing here is a little tricky. Forward movement is only possible by replacing (and losing) the current token. The token queue and token look ahead cannot be relied upon. Instead it can only replace the current token to move forward while parsing the macro's arguments.
Since lexNextToken() is used to parse the arguments, macros in the arguments themselves are recursively macro-expanded while the arguments are being parsed and recorded in text form. The argument texts are stored for use during the expansion.
The macro expansion text (##deftext##) is stored by the lexer, and now it will read characters from there for a while, instead of reading from the file input buffer. Skipping chars in the macro text is like skipping chars in the file input: Once skipped it's lost, there is no going back. So, there never is "old" (parsed) macro text, only the current char and not-yet-parsed text. New macro text is prepended to the front of existing macro text. That way macros inside macros are expanded.
Preprocessor directives (#if, #define, #includes, etc.) are parsed during lexSkipToken() by calling ppCheck(). Although, ppCheck() is also called from lexGetToken() and lexGetClass() in case those need to load a token, which will only happen if there is no current token, which in turn will only happen when lexGetToken()/lexGetClass() are called before the first lexSkipToken() for the current file.
However, the important place is lexSkipToken(). After moving to the next token (or loading a new token), ppCheck() will check whether the new current token is a '#'. If so it will also check whether the previous token was an EOL. If so, it found a '#' at linebegin, and it then directly parses the PP directive, using the same lexGetToken()/lexSkipToken() interface used by the parser. This is necessary because some PP directives result in parser functions being called, for example cIdentifier() is used by the #ifdef parser, to recognize variables etc.:
So, lexSkipToken() is recursive because of the PP. ppCheck() will only call ppParse() for toplevel calls to lexSkipToken(). This is needed for example when parsing #macro/#endmacro blocks, because they can span across multiple lines. It's also needed while the PP skips #if FALSE blocks.
As a result, everytime the parser skips an EOL, lexSkipToken() might detect a '#' and handle the PP directive, skipping ahead some more lines, so the parser stays fully unaware that the PP directives are even there. The PP parsing launched from lexSkipToken() might even encounter an #include and call fbIncludeFile() to parse it immediately, recursively starting a cProgram() for that #include. The parser has to be able to handle the recursion that might happen during every lexSkipToken() at EOL, but luckily the parser needs a stack to keep track of compound statements anyways.
Note that PP directives are not handled during token look ahead (lexGetLookAhead()). If the parser were to look ahead across EOL, it could very well see a PP directive. Luckily though looking ahead across lines is never necessary.
The beginning of directives, the keyword following the '#', is parsed without macro expansion. This means redefining PP keywords (intentionally) has no effect on the PP directives. For example:
Directives like #if & co. make use of the PP expression parser, which does expand macros. Afterall that's the point of PP expressions. For example:
The #define and #macro directives don't do macro expansion at all. A macro's body is recorded as-is.
ppDefine() first parses the macro's identifier. If there is a '(' following, without space in between, then the parameter list is parsed too.
Then the macro body is parsed. For each token, its text representation is retrieved via lexGetText(), and it is appended to the macro body text. Space is preserved (but trimmed); comments are left out; in multi-line #macros empty lines are removed.
The preprocessor has its own (but fairly small and simple) expression parser (##ppExpression()##). It works much like cExpression(), except instead of creating AST nodes, ppExpression() immediately evaluates the expressions.
The preprocessor uses a simple stack to manage #if/endif blocks. Those can be nested, and there may be #includes in them, but they cannot go across files. False blocks (#if 0, or the #else of an #if 1) are immediately skipped when parsing the #if 0 or the #else (ppSkip()), before turning to lexSkipToken().
fbIncludeFile() basically consists of lexPush(), cProgram(), lexPop().


Revision [14867]

The oldest known version of this page was created on 2010-10-16 09:58:24 by DkLwikki [Added some highlighting for important functions]
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki



sf.net phatcode