A lexical analyzer generator for FreeBASIC

jofers · Post by **jofers** » Nov 08, 2014 21:21

On the stack code, you caught me being lazy :) I might spend some time and update that in the future. Man, I wish FB had generics.

[EDIT] Ouch, tried the java example and got an error because "protected" was misspelled in the reserved keyword list. Fixed!
[EDIT] Oh man, it took me awhile to realize that it was defaulting to the much, much slower polynomial minimizer. Why do these things only crop up the second after I post a release?

jofers · Post by **jofers** » Nov 08, 2014 22:39

After fixing some dumb errors in Poodle, I tried out your java lexer.

I didn't observe any errors with Token.ToString(), but identifiers were not getting captured.

It did seem to choke on a string literal in the hello world example I tried. I looked a bit at the dot graph and it's getting confused because it considered a double-quote as a valid string character. This boiled down to the character constant and the string constant using the same definition for "SingleCharacter", which was defined as every character except a slash or a single quote.

I made a separate variable called "SingleStringCharacter" that matched every character except a slash or a double quote and used it with the string constant. That fixed the issue for me.

Here is the updated Java lexer. I deleted some comments and added skip directives to the whitespace tokens for my own sanity, but feel free to take them out.

Code: Select all

# Comments
Let MComment = "" +
    '/\*' + '([^\*]|\*+[^/\*])*' + '\*+/'  # Multi-line comments
Let SComment = "//[^\r\n]*"                # Single-line comments    

# Whitespace
Skip MultiComm: '{MComment}'
Skip SingleComm: '{SComment}'
Skip Ws: '[ \t]+'
Skip Newline: '(\r\n)|(\r)|(\n)'
Skip Vertical: '\v'

# Literals
Let OctalEscape = '(([0-3][0-7][0-7])|([0-7][0-7])|([0-7]))'
Let EscapeSequence = '(\\(([btnfr"\\''])|({OctalEscape})))'
Let SingleCharacter = "[^\\']"
Let SingleStringCharacter = '[^\\"]'
Let StringCharacters = '{SingleStringCharacter}|{EscapeSequence}'
Let StringConstant = '"{StringCharacters}+"|("")'

Let CharacterConstant = "'(({SingleCharacter})|({EscapeSequence}))'"

Let HexDigits = '('  +
                 '([[:xdigit:]]([[:xdigit:]]|_)+[[:xdigit:]])' + '|'  +
                 '([[:xdigit:]][[:xdigit:]])' + '|'  +
                 '[[:xdigit:]]' +
                 ')'
Let HexNumeral = '0[Xx]{HexDigits}'

Let HexadecimalDigits = '0[Xx]([[:xdigit:]]|' +
                        '([[:xdigit:]][[:xdigit:]])|'  +
                        '([[:xdigit:]]([[:xdigit:]]|_)+[[:xdigit:]])'
Let Underscores       = '[_]+'
#
Let BinaryDigits      = '(([01]([01]|_)+[01])|([01][01])|([01]))'
Let BinaryNumeral     = '0[Bb]{BinaryDigits}'
Let OctalDigits       = '([0-7]([0-7]|_)+[0-7])|([0-7][0-7])|([0-7])'
Let OctalNumeral      = '0{OctalDigits}|0{Underscores}{OctalDigits}'
Let Digits            = '(([[:digit:]([[:digit:]]|[_])+[[:digit:]])|([[:digit:]][[:digit:]])|([[:digit:]]))'
Let DecimalNumeral   =  '((0)|([1-9]{Digits}?)|([1-9]{Underscores}{Digits}))'

Let IntegerConstant = '(' +
                     '({HexNumeral})' + '|' +
                    '({OctalNumeral})' + '|'      + # Octal integer
                    '({BinaryNumeral})' +            # Binary integer
                     '[Ll]?' +
                     ')'
                     
Let Bool = '(true|false)'
Let Null = '(null)'

Capture Constant: '' + 
                  '{IntegerConstant}' + '|' +
                  #'{FloatingPointConstant}' + '|' +
                  '{CharacterConstant}' + '|' +
                  '{StringConstant}' + '|' +
                  '{Bool}' + '|' +
                  '{Null}'

# Keywords
abstract: 'abstract'
assert: 'assert'
boolean: 'boolean'
break: 'break'
byte: 'byte'
case: 'case'
catch: 'catch'
char: 'char'
class: 'class'
const: 'const'
continue: 'continue'
default: 'default'
do: 'do'
double: 'double'
else: 'else'
enum: 'enum'
extends: 'extends'
final: 'final'
finally: 'finally'
float: 'float'
for: 'for'
goto: 'goto'
if: 'if'
implements: 'implements'
import: 'import'
instanceof: 'instanceof'
int: 'int'
interface: 'interface'
long: 'long'
native: 'native'
new: 'new'
package: 'package'
private: 'private'
protected: 'protected'
public: 'public'
return: 'return'
short: 'short'
static: 'static'
strictfp: 'strictfp'
super: 'super'
switch: 'switch'
synchronized: 'synchronized'
this: 'this'
throw: 'throw'
throws: 'throws'
transient: 'transient'
try: 'try'
void: 'void'
volatile: 'volatile'
while: 'while'

# Separators
dots: '\.\.\.'
left_parenthesis: '\('
right_parenthesis: '\)'
left_square_bracket: '\['
right_square_bracket: '\]'
left_curly_bracket: '\{'
right_curly_bracket: '\}'
semicolon: '\;'
comma: ','
dot: '\.'
at: '\@'

# Operators
plus_equals: '\+\='
min_equals: '\-\='
mul_equals: '\*\='
div_equals: '\/\='
and_equals: '\&\='
or_equals: '\|\='
xor_equals: '\^\='
mod_equals: '\%\='
shift_left_equals: '\<\<\='
shift_right_equals: '\>\>\='
shift_right_unsigned_equals: '\>\>\>\='
equ: '\=\='
not_equ: '\!\='
less_than_eq: '\<\='
greater_than_eq: '\>\='
less_than: '\<'
greater_than: '\>'
equals: '\='
logical_not: '\!'
binary_not: '\~'
question: '\?'
colon: '\:'
andand: '\&\&'
oror: '\|\|'
inc_plus: '\+\+'
inc_min: '\-\-'
add: '\+'
minus: '\-'
asterisk: '\*'
divide: '/'
binary_and: '\&' 
binary_or: '\|'
binary_xor: '\^'
modulo: '\%'
unsigned_shift_right: '\>\>\>'
shift_left: '\<\<'
shift_right: '\>\>'
double_colon: '\:\:'

Capture identifier: '([[:alpha:]]|_)[[:word:]]*'

jofers · Post by **jofers** » Nov 08, 2014 23:32

WIth the error lines, that turned out to be a case of lazy initialization. Variables aren't parsed until they're instanced, so the exception was caught while processing the line which actually instanced the variable.

The issue is fixed now in the git repository. I'll replace the installer I posted later today.

jofers · Post by **jofers** » Nov 09, 2014 0:30

Looking at that wiki example I think you just want a parser generator.

With sections, Poodle could be used as a rudimentary parser generator, but a really terrible. It's like writing an OS in a batch file. Yes, it's possible, but yes, it's horrible (and after 16 levels of recursion you'll run out of stack space). I went ahead and fixed the stack size thing, though, because why not?

Sections work best when there's a clear trigger to enter it, and a clear trigger to leave. For instance, with the FreeBASIC example "/*" enters a multi-line comment and "*/" leaves it. In C, "#anything" enters a preprocessor instructoin, and a newline exits. "Switch Section" was added as a last resort, because I couldn't figure out a good syntax to let a rule trigger exiting multiple sections.

If the triggers get more complex than that, then the switching mechanism should probably be left to the parser. I think because of that, I'm going to change "EnterSection", "ExitSection" and "SwitchSection" to public from protected, so that the parser can manually switch to different lexing modes based on more complicated rules.

jofers · Post by **jofers** » Nov 09, 2014 4:21

I applied your optimization for sink states in the FB lexer and submitted it to the git repository.

AGS · Post by **AGS** » Nov 16, 2014 10:25

I tried the new version and got an error when generating xml code. The message I am getting looks like this

Code: Select all

Unable to load language plug-in 'xml': No module named xml.etree.cElementTree

I checked the library.zip you provided (I am guessing that whatever Python packages poodle uses
must be in the package). And xml.etree is not in library.zip. I extracted the zip file and added the following files
to the zip file (taken from the python distribution on my PC)

Code: Select all

PYTHON_INSTALL_PATH/lib/xml/*.*

And I copied the following two files also from the python distribution

Code: Select all

PYTHON_INSTALL_PATH/DLLs/xml/_elementtree.pyd
PYTHON_INSTALL_PATH/DLLs/xml/pyexpat.pyd

to POODLE_LEX_INSTALL_PATH/

And... it worked :) xml generation worked as expected.

A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC