A lexical analyzer generator for FreeBASIC

AGS · Post by **AGS** » May 04, 2014 22:16

I finished the fb lexer and got errors. I found that predefined character classes do not work
any more (regression). When I used [:word:] I got an error on the : character.
Same when using any of the other [:keyword:] combinations (where keyword can be xdigit, lower, upper etc...).

Minimal example

Code: Select all

words: '[:lower:]'

The error I get when running poodle-lex with the above as input

Code: Select all

Unable to parse rule 'words': Invalid Character: ':'

When I define my own character classes all is well again. For now I can work around the issue by using variables that define character classes.

I also got an error when using comments. The following works (no parsing errors)

Code: Select all

sss: '[ \t]'

The following gets me a syntax error

Code: Select all

sss: '[ \t]'
# id: '{idv}'

Error message:

Code: Select all

Error parsing rules file: On line 1, Expected identifier, found 'end of stream'

The previous version worked quite well. But this version (1.0) is broken badly.
It's no fun trying to create a scanner when getting error messages that do not make much sense.

What happened to the poodle-lex code in between the previous release and the release of version 1.0?

jofers · Post by **jofers** » May 05, 2014 0:57

I'll look into the second error ASAP, but the first should be '[[:word:]]'. 0.9 was incorrectly applying the POSIX standard:
http://en.wikipedia.org/wiki/Regular_ex ... d_extended

AGS · Post by **AGS** » May 05, 2014 20:06

Using [[:charclass:]] solved the problem. I found something else.

Using _ in an identifier uses as the lhs of a variable leads to a syntax error.
Example

Code: Select all

Let pattern_var='a'
pattern:'{pattern_var}'

Error message when using poodle-lex

Code: Select all

Unable to parse rule 'pattern': Character 8: Expected "}", found "_"

If you remove the _ all is well again

Code: Select all

Let patternvar='a'
pattern:'{patternvar}'

The lhs can contain underscores (no error when there is an underscore in
the lhs of a rule).

I also found a small 'problem' with the use of newlines (very minor).

Code: Select all

Let patternvar='a'
pattern:'{patternvar}'

Error message I get

Code: Select all

Error parsing rules file: On line 1, Expected newline, found 'end of stream'

If I add a newline to the end of rule pattern the error disappears.
If I add two newlines I get the following error

Code: Select all

Error parsing rules file: On line 1, Expected newline, found 'end of stream'

A newline at the end of a rule is mandatory (to mark the end of
a rule?). But adding more than one newline is prohibited if the rule is the last
rule.

Would it be possible to allow the final rule to end with EOF (eg accept EOF as a valid
rule terminator?) And could a file be allowed to end with many newlines?

jofers · Post by **jofers** » May 06, 2014 1:51

Wow, I can't believe I missed this stuff.

The comment issue and the newline issue seem related. I think I'll probably need to push out a bugfix release, maybe 1.1. I'll be sure to add some unit tests for this stuff as well.

jofers · Post by **jofers** » May 06, 2014 13:23

Okay, fixed what I think are the issues with whitespace and comments at the end of the file, line numbers, underscores, and unclosed strings.
https://github.com/parkertomatoes/poodl ... /tag/1.0.1

I'll integrate these changes into the unstable branch and write some unit tests to cover them (should be easier since I'm rewriting the rules file parser).

jofers · Post by **jofers** » Jul 23, 2014 12:05

Despite the lack of posts, I've made some pretty good progress on version 2 of Poodle-Lex.

Basically, this version brings in recursive rules, which brings the tool to feature parity with lex/flex and finally allows for the creation of a FreeBasic lexer (specifically, nested comments).

While I have yet to harden it for Linux support and add new examples/documentation, I updated the C example to properly parse pre-processor directives using the new capabilities. It's up on github as a pre-release.

https://github.com/parkertomatoes/poodl ... .9.4-alpha

jofers · Post by **jofers** » Aug 21, 2014 4:50

Another 2.0 pre-release:
https://github.com/parkertomatoes/poodl ... .9.6-alpha

This round adds the following changes:
* Numerous fixes to hierarchical lexer generation
* Named codepoint constants for the FB emitter
* Experimental FBC lexer example

The 2.0 release is starting to solidify. I need to fix the DOT generators and document it a bit more. I'd like to add import and export of language plug-ins, which would make them more like actual plug-ins, but that feature is a maybe given how delayed this thing is.

Looking forward to 3.0, the most significant thing I could think of adding is case structures around rules to allow run-time parameters. For instance, think of switches which change behavior halfway through reading it, like "option escape". But, at some point I would like to tackle parsing generation so it's not clear what my next iteration will be.

AGS · Post by **AGS** » Aug 22, 2014 3:25

How about unicode aware character classes? The other day I was trying to create a lexer for the Java programming language.
Using poodle-lex I found it hard to specify the token identifier. I could have used a bunch of character classes but that would get ugly.
You see, an identifier in Java can consist of any character that is considered a letter in unicode. Which goes well beyond what
\w will recognize.

Making poodle-lex character classes unicode - aware would make it easier to specify tokens that need that sort of thing.
jflex supports unicode aware character classes http://jflex.de/manual.html#SpecOptions

jflex basically follows the technical standard regarding regular expressions as published by
http://www.unicode.org/reports/tr18/

Adding unicode aware character classes could be done by using the escape sequence \p followed by a { followed by a pattern followed by another }.
Examples.

Code: Select all

[\p{Letter}]
[\p{Lowercase}]
[\p{alpha}]
[\p{digit}]

What's between the { } depends on how far you want to take it. The standard mentions several levels but \p{Letter} and that sort of thing should be good enoug. A nice list of possible patterns (or rather identifiers) that can be used between the { } can be found here http://www.regular-expressions.info/unicode.html

And maybe \p needn't be inside a character class. An escape sequence (\p) followed by { followed by a pattern followed by } would do nicely.
No [ ] needed to specify a unicode aware 'character class'.

Code: Select all

\p{Letter}
\p{Lowercase}

Below you'll find my java lexer. It's not finished yet (I'm getting errors during generation of the scanner).
I commented out the parts that are keeping it from working (the definition of decimalnumeral breaks things
and I cannot find what is wrong yet).

It was quite easy to get from an 'official' language definition to a poodle-lex rules file. Using variables I could simply
translate non-terminals to variables and take it from there. I copy-pasted part of the c lexer example as Java shares
quite a bit of lexical syntax with C. I removed some rules and added some others to complete it.

The official lexical grammar the scanner definition below is based on can be found at from http://docs.oracle.com/javase/specs/jls ... jls-3.html

Code: Select all

# Comments
Let MComment = "" +
    '/\*' + '([^\*]|\*+[^/\*])*' + '\*+/'  # Multi-line comments
Let SComment = "//[^\r\n]*"                # Single-line comments    

# Whitespace
MultiComm: '{MComment}'
SingleComm: '{SComment}'
Ws: '[ \t]+'
Newline: '(\r\n)|(\r)|(\n)'
Vertical: '\v'

# Literals
Let OctalEscape = '(([0-3][0-7][0-7])|([0-7][0-7])|([0-7]))'
Let EscapeSequence = '(\\(([btnfr"\\''])|({OctalEscape})))'
Let SingleCharacter = "[^\\'']"
Let StringCharacters = '{SingleCharacter}|{EscapeSequence}'
Let StringConstant = '"{StringCharacters}+"|("")'

Let CharacterConstant = "'(({SingleCharacter})|({EscapeSequence}))'"

Let HexDigits = '('  +
                 '([[:xdigit:]]([[:xdigit:]]|_)+[[:xdigit:]])' + '|'  +
                 '([[:xdigit:]][[:xdigit:]])' + '|'  +
                 '[[:xdigit:]]' +
                 ')'
Let HexNumeral = '0[Xx]{HexDigits}'

Let HexadecimalDigits = '0[Xx]([[:xdigit:]]|' +
                        '([[:xdigit:]][[:xdigit:]])|'  +
                        '([[:xdigit:]]([[:xdigit:]]|_)+[[:xdigit:]])'
Let Underscores       = '[_]+'
#
Let BinaryDigits      = '(([01]([01]|_)+[01])|([01][01])|([01]))'
Let BinaryNumeral     = '0[Bb]{BinaryDigits}'
Let OctalDigits       = '([0-7]([0-7]|_)+[0-7])|([0-7][0-7])|([0-7])'
Let OctalNumeral      = '0{OctalDigits}|0{Underscores}{OctalDigits}'
Let Digits            = '(([[:digit:]([[:digit:]]|[_])+[[:digit:]])|([[:digit:]][[:digit:]])|([[:digit:]]))'
Let DecimalNumeral   =  '((0)|([1-9]{Digits}?)|([1-9]{Underscores}{Digits}))'

Let IntegerConstant = '(' +
#                     '({DecimalNumeral})' + '|' +    # Decimal integer
                     '({HexNumeral})' + '|' +
                    '({OctalNumeral})' + '|'      + # Octal integer
                    '({BinaryNumeral})' +            # Binary integer
                     '[Ll]?' +
                     ')'
#
#Let SignedInteger = '[\+\-]?{Digits}'
#Let ExponentPart = '[Ee]{SignedInteger}'
#
#Let DecimalFloatingPointLiteral = '(' + 
#                                  '{Digits}\.{Digits}?{ExponentPart}?+[DdFf]?' + '|' +
#                                  '\.{Digits}{ExponentPart}?[DdFf]?' + '|' +
#                                  '{Digits}{ExponentPart}[DdFf]?' + '|' +
#                                  '{Digits}{ExponentPart}?[DdFf]' +
#                                  ')'
#Let HexSignificand = '(' +
#                     '({HexNumeral}\.)'  + '|' +
#                     '{HexNumeral}' + '|' +
#                     '0[xX]{HexDigits}+\.{HexDigits}' + '|' +
#                     '0[xX]\.{HexDigits}' +
#                     ')'
#
#Let BinaryExponent = '[Pp]{SignedInteger}'
#Let HexadecimalFloatingpointerLiteral = '{HexSignificand}{BinaryExponent}[DdFf]?'
#                                        
#Let FloatingPointConstant = '(' +
#                       '{DecimalFloatingPointLiteral}' + '|' +
#                       '{HexadecimalFloatingPointLiteral}' + 
#                       ')'



Let Bool = '(true|false)'
Let Null = '(null)'

Capture Constant: '' + 
                  '{IntegerConstant}' + '|' +
                  #'{FloatingPointConstant}' + '|' +
                  '{CharacterConstant}' + '|' +
                  '{StringConstant}' + '|' +
                  '{Bool}' + '|' +
                  '{Null}'

# Keywords
abstract: 'abstract'
assert: 'assert'
boolean: 'boolean'
break: 'break'
byte: 'byte'
case: 'case'
catch: 'catch'
char: 'char'
class: 'class'
const: 'const'
continue: 'continue'
default: 'default'
do: 'do'
double: 'double'
else: 'else'
enum: 'enum'
extends: 'extends'
final: 'final'
finally: 'finally'
float: 'float'
for: 'for'
goto: 'goto'
if: 'if'
implements: 'implements'
import: 'import'
instanceof: 'instanceof'
int: 'int'
interface: 'interface'
long: 'long'
native: 'native'
new: 'new'
package: 'package'
private: 'private'
protected: 'protected'
public: 'public'
return: 'return'
short: 'short'
static: 'static'
strictfp: 'strictfp'
super: 'super'
switch: 'switch'
synchronized: 'synchronized'
this: 'this'
throw: 'throw'
throws: 'throws'
transient: 'transient'
try: 'try'
void: 'void'
volatile: 'volatile'
while: 'while'

# Separators
dots: '\.\.\.'
left_parenthesis: '\('
right_parenthesis: '\)'
left_square_bracket: '\['
right_square_bracket: '\]'
left_curly_bracket: '\{'
right_curly_bracket: '\}'
semicolon: '\;'
comma: ','
dot: '\.'
at: '\@'

# Operators
plus_equals: '\+\='
min_equals: '\-\='
mul_equals: '\*\='
div_equals: '\/\='
and_equals: '\&\='
or_equals: '\|\='
xor_equals: '\^\='
mod_equals: '\%\='
shift_left_equals: '\<\<\='
shift_right_equals: '\>\>\='
shift_right_unsigned_equals: '\>\>\>\='
equ: '\=\='
not_equ: '\!\='
less_than_eq: '\<\='
greater_than_eq: '\>\='
less_than: '\<'
greater_than: '\>'
equals: '\='
logical_not: '\!'
binary_not: '\~'
question: '\?'
colon: '\:'
andand: '\&\&'
oror: '\|\|'
inc_plus: '\+\+'
inc_min: '\-\-'
add: '\+'
minus: '\-'
asterisk: '\*'
divide: '/'
binary_and: '\&' 
binary_or: '\|'
binary_xor: '\^'
modulo: '\%'
unsigned_shift_right: '\>\>\>'
shift_left: '\<\<'
shift_right: '\>\>'
double_colon: '\:\:'

identifier: '([[:alpha:]]|_)[[:word:]]*'

jofers · Post by **jofers** » Aug 24, 2014 16:07

I've kind of resisted treating Perl regex as a de facto standard, so with the exception of variable substitution all conventions thus far have come from the POSIX ERE standard.

But, Unicode properties are really handy. Luckily, the Unicode consortium published this handy technical report:
http://www.unicode.org/reports/tr18/

Which i guess I can treat as a standard. Unluckily, the standard is bananas. In addition to categories, scripts, and blocks, you also have to support properties, set operations and a sort of simple query language, pulling in data from Unicode's 135MB XML file. That's just for minimal "level 1" compliance, beyond that it's crazy stuff like case normalization and canonical equivalents.

If you give me a week or two think I can slip something in for you that would get you at least categories and non-extended scripts. So long as it only affects the regex parser it should be pretty light risk to implement and add unit tests.

AGS · Post by **AGS** » Aug 31, 2014 8:13

jofers wrote:
But, Unicode properties are really handy. Luckily, the Unicode consortium published this handy technical report:
http://www.unicode.org/reports/tr18/

Which i guess I can treat as a standard. Unluckily, the standard is bananas. In addition to categories, scripts, and blocks, you also have to support properties, set operations and a sort of simple query language, pulling in data from Unicode's 135MB XML file. That's just for minimal "level 1" compliance, beyond that it's crazy stuff like case normalization and canonical equivalents.

If you give me a week or two think I can slip something in for you that would get you at least categories and non-extended scripts. So long as it only affects the regex parser it should be pretty light risk to implement and add unit tests.

I put in the report on unicode.org to give you an idea of what kind of extensions are possible. Having poodle implement anything near what is specified in the report would be over-the-top.

I left my lexical grammar for Java grammar alone for a while and reading it back after a couple of days found that it was actually quite bad. I combined your clexer with the official Java specification and basically messed things up (rather badly). I hope I will get it right this time.

And I found a few... bugs :)

I was trying to replace the regex for a C comment with a more section based version. Mind you that this specification could be incorrect as I am not 100% percent sure I understand how to use sections. I went for a simple state-machine like approach with sections replacing states.

Code: Select all

Skip: '[\r\n\t ]+'
Comment_start: '/' Switch comment
Some_char: '[^/]+'

Section comment
  comment_star: "\*" Switch comment1
  other_character: '.'  Exit Section
end section

Section comment1
  not_star: '[^\*]+' 
  star: '\*' Switch comment_star
end section

Section comment_star
  star_rep: '\*+'
  comment_end: '/' Exit Section
  other: '.' Switch comment1
end section

Input when using the above rules file could be a C file with restrictions (obviously a rule for strings would be needed to keep the scanner from processing /* within a string as the start of a comment).

When compiling the demo resulting from running poodle-lex on the above rules file (using demo.bat) I got no errors and all seemed to work. But when replacing fbc with fbc -exx I get this error:

Code: Select all

..\Stream\UTF32Stream.bas(52) error 89: Array out-of-bounds, found ')' in 'This.CharacterData[2] = Poodle.UTF16BigEndian
Mark(2) And _'
..\Stream\UTF32Stream.bas(59) error 89: Array out-of-bounds, found ')' in 'This.CharacterData[2] = Poodle.UTF16LittleEnd
ianMark(2) And _'

Upon checking the bounds of arrays Poodle.UTF16BigEndian and Poodle.UTF16LittleEnd I found both to have an upper bound of 1 and a lower bound of 0.
In UTF32stream.bas element 2, 3 and 4 of Poodle.UTF16BigEndian are accessed (same for UTF16LittleEndian). Which leads to the compiler error.

The above example was not just an attempt to replace a regex with an explicit state machine.

I wanted to check whether it is possible to use Poodle-lex as something of a filter generator. This could be helpful to remove stuff from files while copying the rest of the file. An example could be stripping identifiers from a C file. Many C header files contain uppercase words that are macros that expand to some C keyword. An example from hamsterdb

Code: Select all

define HAM_EXPORT extern

HAM_EXPORT expands to extern. In the hamsterdb header file HAM_EXPORT precedes each and every function definition. Stripping HAM_EXPORT from the header file makes converting it to FreeBASIC that much easier. I've seen many examples of C header files that contain some words in UPPERCASE that must be removed from the file to make it easier to convert it to FreeBASIC.

The syntax for C declarations (=what you need to generate fb bindings) is not exactly regular. But support for nested construct as offered by poodle-lex could prove to be enough to parse C declarations.

I'm going to fix the Java grammar (lexical). Should not take too long to fix it.

jofers · Post by **jofers** » Sep 04, 2014 13:21

Yeah, but I'm uncomfortable with the idea of half-assing a standard, so I went ahead and implemented it anyway. I think I have all of the RL's in level 1 compliance, though I couldn't implement unicode line and word boundaries since Poodle doesn't support those.

Adding the Unicode databases shot up the install size to 30MB or so, and the installer will probably sit at 10-12MB. I'll post another pre-release installer once I run all the manual tests.

Thanks for the bug report, I'll make sure it's fixed before the next installer.

AGS · Post by **AGS** » Nov 04, 2014 7:26

This post is long overdure. I finished the java grammar weeks ago. And wrote the rest of the messages round about
the same time. I just did not get around to posting any of it on the forum.

Below you'll find the redone java grammar.
I ran into several issues while creating the grammar (more about that in the next message).

Code: Select all

Let Comment = "" +
    '/\*' + '([^\*]|\*+[^/\*])*' + '\*+/' + '|' + # Multi-line comments
    "//[^\r\n]*"                                  # Single-line comments    

Let WhiteSpace = '[ \t\f]+'
Let lineterminator = '(\r\n)|(\r)|(\n)'

Capture white: '{WhiteSpace}'
Capture comment_: '{Comment}'

Capture line_end: '{lineterminator}'

#keywords
Capture abstract:'abstract'
Capture continue:'continue'
Capture for:'for'
Capture new:'new'
Capture switch: 'switch'
Capture assert:'assert'
Capture default:'default'
Capture if:'if'
Capture package:'package'
Capture synchronized: 'synchronized'
Capture boolean:'boolean'
Capture do:'do'
Capture goto:'goto'
Capture private:'private'
Capture this: 'this'
Capture break:'break'
Capture double:'double'
Capture implements:'implements'
Capture protected_:'protected'
Capture throw: 'throw'
Capture byte:'byte'
Capture else:'else'
Capture import:'import'
Capture public:'public'
Capture throws: 'throws'
Capture case:'case'
Capture enum:'enum'
Capture instanceof:'instanceof'
Capture return:'return'
Capture transient: 'transient'
Capture catch:'catch'
Capture extends:'extends'
Capture int:'int'
Capture short:'short'
Capture try: 'try'
Capture char:'char'
Capture final:'final'
Capture interface:'interface'
Capture static:'static'
Capture void: 'void'
Capture class:'class'
Capture finally:'finally'
Capture long:'long'
Capture strictfp:'strictfp'
Capture volatile:'volatile'
Capture const:'const'
Capture float:'float'
Capture native:'native'
Capture super:'super'
Capture while: 'while'

#null literal
Capture null_: 'null'

#boolean literal
Capture true : 'true'
Capture false: 'false'

#identifier (\w should match unicode characters with character class digit or letter as well)
Capture identifier: '(_|[a-zA-Z])[a-zA-Z0-9_]*'

#operators
Capture equ_op: '=='
Capture not_equ_op: '!='
Capture less_than_equ: '\<='
Capture greater_than_equ: '\>='
Capture plus_asop: '\+='
Capture min_asop: '\-='
Capture mul_asop: '\*='
Capture div_asop: '\/='
Capture and_asop: '&='
Capture or_asop: '\|='
Capture xor_asop: '\^='
Capture mod_asop: '\%='
Capture shift_left_asop: '\<\<='
Capture shift_right_asop: '\>\>='
Capture shift_right_unsigned_asop: '\>\>\>='
Capture shift_right_unsigned: '\>\>\>'
Capture shift_left: '\<\<'
Capture shift_right: '\>\>'
Capture less_than: '\<'
Capture greater_than: '\>'
Capture andand: '&&'
Capture oror: '\|\|'
Capture plusplus: '\+\+'
Capture minmin: '\-\-'
Capture binary_and: '&'
Capture binary_or: '\|'
Capture binary_xor: '\^' 
Capture mul_op: '\*'
Capture div_op: '\\'
Capture mod_op :'%'
Capture add_op: '\+'
Capture min_op: '\-'
Capture deref: '\-\>'
Capture log_not: '\!'
Capture bin_not:  '~'
Capture question: '\?'
Capture as_op: '='

#separators
Capture left_parenthesis: '\('
Capture right_parenthesis: '\)'
Capture left_curly_bracket: '\{'
Capture right_curly_bracket: '\}'
Capture left_square_bracket: '\['
Capture right_square_bracket: '\]'   
Capture semicolon: '\;'
Capture comma: ','
Capture dots: '\.\.\.'
Capture at: '@'
Capture coloncolon: '\:\:'

#tokens
#literals
#integer literal
#decimal integer
Let underscores = '[_]+'
Let digitorunderscore = '{digit}|[_]'
Let digitsandunderscores = '{digitorunderscore}+'
Let nonzerodigit = '[1-9]'
Let digit = '[0]|{nonzerodigit}'
Let digits = '{digit}|{digit}{digitsandunderscores}{digit}'
Let decimalnumeral = '(' +
                      '[0]' +'|' +
                      '{nonzerodigit}{digits}?' + '|' +
                      '{nonzerodigit}{underscores}{digits}' +
                      ')'

#hexadecimal integer
Let hexdigitorunderscore = '[[:xdigit:]]|[_]'
Let hexdigitsandunderscore = '{hexdigitorunderscore}+'
Let hexdigits = '[[:xdigit:]]{hexdigitsandunderscore}?[[:xdigit:]]|[[:xdigit:]]'
Let hexnumeral = '(0[xX]{hexdigits})'

#octal integer
Let octaldigitorunderscore = '[0-7]|[_]'
Let octaldigitsandunderscores = '{octaldigitorunderscore}+'
Let octaldigits = '[0-7]{octaldigitsandunderscores}?[0-7]|[0-7]'
Let octalnumeral = '(0{underscores}{octaldigits}|0{octaldigits})'

#binary integer
Let binarydigitorunderscore = '[01]|[_]'
Let binarydigitsandunderscores = '{binarydigitorunderscore}+'
Let binarydigits = '[01]{binarydigitsandunderscores}?[01]|[01]'
Let binarynumeral = '0[bB]{binarydigits}'

Let integertypesuffix = '[Ll]'

#integer literals
Capture hex_integer_literal:  '{hexnumeral}{integertypesuffix}?'
Capture binary_integer_literal: '{binarynumeral}{integertypesuffix}?'
Capture octal_integer_literal: '{octalnumeral}{integertypesuffix}?'
Capture decimal_integer_literal: '{decimalnumeral}{integertypesuffix}?'

#floating point literals
#decimal floating point
Let signedinteger = '[\+\-]?{digits}'
Let exponentpart = '[Ee]{signedinteger}'
Let floattypesuffix = '[fFdD]'
Capture decimal_floatingpoint_literal: '(' +
                                        '({digits}\.{digits}?{exponentpart}?{floattypesuffix}?)' + '|' +
                                        '(\.{digits}{exponentpart}{floattypesuffix}?)' + '|' +
                                        '({digits}{exponentpart}{floattypesuffix}?)' + '|' + 
                                        '({digits}{exponentpart}?{floattypesuffix})' +
                                        ')'

#hexadecimal floating point
Let binaryexponent = '[pP]{signedinteger}'
Let hexsignificand = '('  +
                      '{hexnumeral}(\.)?' + '|' +
                      '0[xX][[:xdigit:]]*\.[[:xdigit:]]+' +
                      ')'
Capture hexadecimal_floatingpoint_literal: '{hexsignificand}{binaryexponent}{floattypesuffix}?'

#character literal unused 
#Let unicode_escape = '(\\[uU][[:xdigit:]][[:xdigit:]][[:xdigit:]][[:xdigit:]])'
#Let single_character = '.'
#Let octal_escape = '(\\([0-3][0-7][0-7]|[0-7][0-7]|[0-7]))'
#Let simple_escape = '(\\[btnfr\x22\x27\\])'
#Let escape_sequence = '({simple_escape}|{octal_escape}|{unicode_escape})'

#character literal
Capture character_literal_single: "'[^\\']?'"
Capture character_literal_escape: "'((\\[btnfr""''\\])|(\\([0-3][0-7][0-7]|[0-7][0-7]|[0-7]))|(\\[uU][[:xdigit:]][[:xdigit:]][[:xdigit:]][[:xdigit:]]))'"

#string literal
#Let string_character = '({escape_sequence}|[^"\\])'
#Capture string_literal: '"{string_character}*"'
Capture string_literal: '"(([^"\\])|(\\[btnfr"''\\])|(\\([0-3][0-7][0-7]|[0-7][0-7]|[0-7]))|(\\[uU][[:xdigit:]][[:xdigit:]][[:xdigit:]][[:xdigit:]]))*"'

#catch the dot here
Capture dot: '\.'

AGS · Post by **AGS** » Nov 04, 2014 7:30

While working on the java grammar \w* would not match enough characters while [a-zA-Z_][a-zA-Z0-9]* would.
And there was an issue with the definition for string literals and character literals.

The variables I tried to use are part of the grammer definition in the form of comments. I could
not use those variables as I would get parsing errors when doing so.

I could not use the auto-generated demo (demo.bas). The line

Code: Select all

  print Token.ToString()

caused problems (runtime errors) when running the resuling demo.exe. Replacing the line with

Code: Select all

print Token.Id

resulted in a working demo so I used that to test the grammar.

A newline at the end of the rules file is mandatory. A file not ending with a newline results in a poodle-lex generated error message.

Unfortunately I could not reproduce any of the problems using a trivial example. The fact that the rules
that caused problems were part of a larger set of rules somehow matters. Finding out what is wrong in a set of rules tends to get harder as the number of variables increases.

I found that using _ in the name of a variable is not allowed. Is this intentional (as in 'variable names must not contain underscores')? Names of rules can contain underscores so I figured that use of _ in the names of variables would be allowed as well.

The position where the error occurs and the message I get can be somewhat confusing (this is somewhat related to the difficulty in finding what's wrong with a rules file when lots of variables are used).
Take, for example, the following rule

Code: Select all

let namemacro = '(hello'
name_: '{namemacro}'

There is a missing left parenthesis on line 1. The error message I get is

Code: Select all

Error processing rules. On line 2, rule 'name_': Expected ")", but reached end of stream

The missing ) is part of the variable definition on line 1. So the message could be

Code: Select all

Error processing rules. On line 1, variable'namemacro': Expected ")", but reached end of stream

Due to the above 'problem' I sometimes found it hard to find the exact location of the error. Which I think is why I could not got the definition of literals (rules for java lexer) correct. At least not using variables.

There is a small bug in the code at github that has not been fixed. In the file

Code: Select all

Plugins/FreeBasic/Template/Stream/UnicodeConstants.bi

there is a line that looks like this

Code: Select all

Extern UTF16BigEndianMark(0 To 1) As Const Unsigned Byte

In the file

Code: Select all

Plugins/FreeBasic/Template/Stream/UTF32Stream.bas

there is a function

Code: Select all

Sub Poodle.UTF32Stream.DetectByteOrder()

that looks like this

Code: Select all

Sub Poodle.UTF32Stream.DetectByteOrder()
  If This.Size < 4 Then
    This.ByteOrderData = Poodle.UTF32Stream.UTF32BigEndian
    ElseIf _
        This.CharacterData[0] = Poodle.UTF16BigEndianMark(0) And _
        This.CharacterData[1] = Poodle.UTF16BigEndianMark(1) And _
        This.CharacterData[2] = Poodle.UTF16BigEndianMark(2) And _
        This.CharacterData[3] = Poodle.UTF16BigEndianMark(3) Then
        This.ByteOrderData = Poodle.UTF32Stream.UTF32BigEndian
      This.Index += 4
    ElseIf _
        This.CharacterData[0] = Poodle.UTF16LittleEndianMark(0) And _
        This.CharacterData[1] = Poodle.UTF16LittleEndianMark(1) And _
        This.CharacterData[2] = Poodle.UTF16LittleEndianMark(2) And _
        This.CharacterData[3] = Poodle.UTF16LittleEndianMark(3) Then
        This.ByteOrderData = Poodle.UTF32Stream.UTF32LittleEndian
        This.Index += 4
    Else
   This.ByteOrderData = Poodle.UTF32Stream.UTF32BigEndian
  End If
End Sub

Addressing anything beyond Poodle.UTF16LittleEndianMark(1) is a bug as the ubound of
Poodle.UTF16LittleEndianMark is 1. But in Sub Poodle.UTF32Stream.DetectByteOrder() there are
four attempts to read beyond the end of Poodle.UTF16LittleEndianMark (attempt to read entries 2 and 3).
Which leads to errors when compiling the code using -exx. I mentioned this bug earlier (I think?) but as this bug has
gone unfixed I figured that reporting it again was worthwhile.

Looking at the code generated I found that sometimes symbolic names (eg POODLE_UCS_RIGHT_SQUARE_BRACKET) are generated and sometimes constants (eg &h09). &h09 has a unicode name: CHARACTER TABULATION. Same thing goes for CARRIAGE RETURN, LINE FEED, LINE TABULATION, FORM FEED and other constants with a value below &h1F.

The code generated when using sections I found to be very nice. The fixed size of the stack I found less nice :( And it seems as if EnterSection does not take action when there is no more room on the stack to push a state.

Either the user should be notified of the lack of stack space or the stack should be resized dynamically (I would be in favour of the latter option). EnterSection does not read beyond the boundary of the stack (which would cause stack overflow). But behaviour of the scanner will become unpredictable when the scanner runs out of stack space. The scanner will most likely stop behaving predictably when pushing a state fails 'silently' and scanning continues as if there is no problem.

AGS · Post by **AGS** » Nov 04, 2014 8:15

I found an interesting optimization that can be performed using a small program that operates on lexicalanalyzer.bas.

The optimization has to do with replacing the code representing a part of the state machine with a smaller piece of code.
It operates on states in the state machine that are

Code: Select all

-final;
-have one incoming arc;
-have no outgoing arcs.

It would be possible to eliminate the transition to such a final state completely. The final state could be removed to by moving
the action associated with the final state to the state with a transition to that final state. But to do that would take more than
the simple code replacement optimization I came up with. The 'optimization' works like this (it reduces the size of the source
code but not necessarily the size of the resulting binary nor will it result in better performance).

Whenever the following pattern appears in lexicalanalyzer.bas (where {identifier} refers to the usual definition of an identifier)

Code: Select all

            Case StateMachineState.{identifier}
            Select Case This.Character
                Case Else
                Return Poodle.LexicalAnalyzerToken(Poodle.LexicalAnalyzerToken.{identifier}, Poodle.Unicode.Text())
            End Select

replace it with this

Code: Select all

            Case StateMachineState.{identifier}
                Return Poodle.LexicalAnalyzerToken(Poodle.LexicalAnalyzerToken.{identifier}, Poodle.Unicode.Text())

A small freebasic program that performs the above optimization (it's a bit of a hack)

Code: Select all

''search for sequence that looks like
''          Case StateMachineState.{identifier}
''          Select Case This.Character
''              Case Else
''              Return Poodle.LexicalAnalyzerToken(Poodle.LexicalAnalyzerToken.{identifier}, Text)
''          End Select
''and replace that sequence with
''          Case StateMachineState.{identifier}
''              Return Poodle.LexicalAnalyzerToken(Poodle.LexicalAnalyzerToken.{identifier}, Text)

var fh = freefile()
open "lexicalanalyzer.bas" for input as #fh

if (err) then
  close
  end
end if

var gh = freefile()

open "lexicalanalyzer.opt.bas" for output as #gh
if (err()) then
  close
  end
end if

dim s as string
dim v as string

dim select_ as string
dim else_ as string
dim return_ as string
dim end_ as string
dim out_ as string

while (not eof(fh))
  s = ""
  out_ = ""
  line input #fh,s
  v = trim(s)
  out_ = s & !"\r\n"
  ''always output the first line regardless of content
  print #gh,out_;   
  if (left(v,len("Case StateMachineState.")) <> "Case StateMachineState.") then
    continue while
  end if
  line input #fh,s
  v = trim(s)
  if (v <> "Select Case This.Character") then
    out_ = s & !"\r\n"
    print #gh,out_;
    continue while
  end if
  select_ = s
  line input #fh,s
  v = trim(s)
  if v <> "Case Else" then
    out_ = select_ & !"\r\n" & s & !"\r\n"
    print #gh,out_;
    continue while
  end if
  else_ = s
  line input #fh,s
  v = trim(s)
  if (left(v,len("Return Poodle.LexicalAnalyzerToken(Poodle.LexicalAnalyzerToken")) _
      <> _
      "Return Poodle.LexicalAnalyzerToken(Poodle.LexicalAnalyzerToken") then
    out_ = select_ & !"\r\n" & else_ & !"\r\n" & s & !"\r\n"
    print #gh,out_;
    continue while
  end if
  return_ = s
  line input #fh,s
  v = trim(s)
  if (v <> "End Select") then
    out_ = select_ & !"\r\n" & else_ & !"\r\n" & return_ & !"\r\n" & s & !"\r\n"
    print #gh,out_;
    continue while
  end if    
  out_ = return_ & !"\r\n"
  print #gh,out_;
wend

The above 'optimization' does not compare to what can be achieved by actually removing states from the resulting code completely.
This is what a transition to a final state looks like for a separator (in this case a left parenthesis)

Code: Select all

                Case POODLE_UCS_LEFT_PARENTHESIS
                State = StateMachineState.left_parenthesis

The above transition will put the machine into one of the final states referred to earlier. State need not be set to left_parenthesis: the code associated with StateMachineState.left_parenthesis can be executed right away without changing state.

Optimization using the above scheme is not without it's problems. Poodle-lex adds characters to the current token at the end of a loop. That loop gets
cut short when the transition to StateMachineState.left_parenthesis does not take place. Which means the text associated with the current token would be missing the final character. Changing that behaviour of the scanner takes a small rewrite of the routine that returns a token (Poodle.LexicalAnalyzerToken).

That routine should consume another character based upon a parameter (if parameter = true then consume a character). The code added to Poodle.LexicalAnalyzer would be a copy of the code that sits at the end of the main loop

Code: Select all

  if parameter then
    If Capture = 1 Then Text.Append(This.Character)
    This.Character = This.Stream->GetCharacter()
  end if
  This.Id = Id
  This.Text = Text

The above is of course just an idea. I might have misunderstood your code meaning that my rewrite of Poodle.LexicalAnalyzerToken would mess up the scanner.

I wanted to find out how many states of the statemachine could be removed by removing the mentioned final states. So I counted how many of those kind of
final states were in the state machine generated for the java grammar. These are the numbers.
Total number of states : 126
Number of removable states: 45

That's a reduction of the number of generated states that matters.

Final states that have one incoming arc and no outgoing arc are generated mostly for single character punctuators (eg ( ) [ ] { } ) but also for tokens
that consist of several characters (eg -> ++ -- -= +=). If poodle lex is used to scan a programming language chances are the language definition will contain quite a few puncutators (and the above optimization will yield a smaller state machine). If poodle lex is used to scan something other than a programming
language the optimization might not yield that much of an improvement.

I noticed half the size of the java scanner is made up of states that distinguish keywords from identifiers. If all rules for keywords are removed and only the rule for identifier is kept the state machine gets a lot smaller. If there are quite a few keywords in the grammar as much as half the state machine can be removed if rules for keywords are left out. Some scanner generators have a section called 'keywords'. Upon finding an identifier that matches a keyword
the scanner returns a keyword token associated with the keyword found (instead of returning a plain identifier token).
A hashtable or a sorted array would be enough to support that type of feature.

Last but not least I tried to use poodle-lex to scan a wiki - like language.

Code: Select all

bold: '\*\*'
underline: '__'
italic: '//'
monospace: '##'
center: '@@'
header1: '======'
header2: '====='
header3: '===='
header4: '==='
header5: '=='
separator: '----'
line_break: '---'
let space = ' \t'
indent: '(~|    |\t)'

special_tag: '{{' switch section
  skip: '{space}+'
  fbdoc: 'fbdoc' switch section fbdoc_section
end section

link_start: '\[\[' switch section
  skip: '{space}*' 
  link_target: '{identifier}' switch section link_description
end section

section link_description:
  skip: '{space}+'  switch section link_description_text
end section

section link_description_text
  skip: '\]\]' exit section
  link_char: '.'
end section
#and so forth and so on...

The problem is with link_description_text. The rule works but it gets the user
the link description chopped up into x tokens (where x is the number of characters in the link target).
Defining link_char as '.+' would not work as it would match everything up to the end of the input.

link_char: '[^\]+' could work but that assumes ] cannot be used in a link description which may or may not
be true. If a single ] is allowed then '[^\]+' would not work as expected (it would merely cut the link desciption into ] delimited pieces of text).

The problem repeats on a global scale when using a catch-all rule like

Code: Select all

catch_all: '.'

catch_all would match a single character at a time. catch_all would create one token for any character that
isn't matched by any of the other rules. Using the solution of having a rule that matches any character
as a 'last resort' would generate lots and lots of one character tokens. It would slow down scanning considerably.

What I'd want is a special variable (called, for example, {any_token}) that initiates the creation of a any_token when
none of the other rules match. It would act as if defined by
'If none of the rules match add the current character to the any_token (but do not return it yet)'.

When some rule does match it could test whether any_token has content. If it does then the any_token should be returned before the token belonging to the currently matched text is returned.

As an example here is a simple rules file.

Code: Select all

bold: '\*\*'
any: {any_token}

Input

Code: Select all

** bold text is ** good text **

Tokens(with content between parens):

Code: Select all

bold(**)
any_token( bold text is )
bold(**)
any_token( good text )
bold(**)
end_of_input()

It's a bit like the way poodle lex returns Poodle.LexicalAnalyzerToken.InvalidCharacter when it finds a character that
does not fit the definition of any of the rules.
But instead of consuming one character and returning it as an invalidcharacter several characters could be grouped and
returned as an any_token. If the user does not define any_token (any_token does not occur at the right hand
side of any rule) then poodle-lex could simply deal with invalid characters using Poodle.LexicalAnalyzerToken.InvalidCharacter.

In terms of code changes you'd need to add support for the any_token. In terms of ease of use the any_token will prove invaluable
when writing scanners where the user wants to use poodlex-lex as a filter. A filter to, for example, process certain parts of the
input but let all other parts pass through unchanged (copy to the output verbatim).

It's a pity I seem to be the only one interested enough in your project to post messages. Afaik there is no fb equivalent of flex
(or bison or ...) other than my attempt at porting coco/r to FreeBASIC. Perhaps the fact that poodle-lex is written in Python is
what's keeping other fb programmers away from your tool.

Some final words on automata generation: follow automata. Very interesting.
http://blog.kerios.fr/wp-content/upload ... tomata.pdf

A follow automaton (=a NFA) is created using the same kind of construction scheme as Thompson construction (as used by poodle-lex?).
But it generates an epsilon - NFA that is much smaller than the one generated by the Thompson construction. Smaller NFA = less input for subset construction (=faster generation of minimized DFA). The paper also contains an algorithm to turn the generated epsilon - NFA into an epsilon - free NFA. Which brings down the size of the input for subset construction even more.

jofers · Post by **jofers** » Nov 08, 2014 21:11

Been awhile, right? I took a small break to deal with some things, but I've working on and off on Poodle ironing out kinks with Unicode and bringing back GraphViz (.dot) output. I also found some time to add an XML output plug-in.

Here's another pre-release to capture that progress:
https://github.com/parkertomatoes/poodl ... .9.7-alpha

The bugs you mentioned sound pretty bad. I'll play with the Java and wiki examples you wrote to see if I can reproduce them. The non-UTF-8 FreeBASIC code could definitely use more eyes, so thanks for looking at it.

As for the state reduction optimizations you mentioned, I can certainly try that out. In particular, the FreeBasic example in graph form is starting to look like a fractal. It would be interesting if the engine could automatically infer where a hash table would be a better implementation than direct code and mix the two, though I don't really know where to start there. I don't know about replacing Thompson construction though. It's pretty solid and the Python script performance will be crappy no matter what NFA construction algorithm we use.

A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC

Re: A lexical analyzer generator for FreeBASIC