A lexical analyzer generator for FreeBASIC
A lexical analyzer generator for FreeBASIC
Hey all,
It's been awhile since I touched anything FreeBASIC, so I started toying with a project called Poodle-Lex. It's a lexical analyzer generator which emits FreeBASIC code, similar to lex and its ilk:
https://github.com/parkertomatoes/poodl ... /0.5-alpha
It's written in Python, but generates FreeBASIC code for both Linux and Windows which takes in a stream of Unicode code-points and spits out tokens and text. It's been something of a learning project for me, so it will also generate NFA and DFA graphs as GraphViz graphs for anyone that wants to experiment with it.
Anyway, Happy New Year to anyone that can recall me :)
EDIT:
Latest release:
https://github.com/parkertomatoes/poodl ... es/tag/1.0
It's been awhile since I touched anything FreeBASIC, so I started toying with a project called Poodle-Lex. It's a lexical analyzer generator which emits FreeBASIC code, similar to lex and its ilk:
https://github.com/parkertomatoes/poodl ... /0.5-alpha
It's written in Python, but generates FreeBASIC code for both Linux and Windows which takes in a stream of Unicode code-points and spits out tokens and text. It's been something of a learning project for me, so it will also generate NFA and DFA graphs as GraphViz graphs for anyone that wants to experiment with it.
Anyway, Happy New Year to anyone that can recall me :)
EDIT:
Latest release:
https://github.com/parkertomatoes/poodl ... es/tag/1.0
Last edited by jofers on Apr 22, 2014 4:25, edited 2 times in total.
Re: A lexical analyzer generator for FreeBASIC
A couple of questions/remarks.
First one: what does the version number 0.5alpha? Does the alpha mean 'it does not work' or does
the alpha mean 'it works but I need to test a bit more to make sure it works well'.
Secondly: why Python? Why not everything in FreeBASIC? I have experimented with
tools like re2c, byacc, lemon etc... I end up adding a new back end to C code. But
that C code was already written (be it not by me). In your case nothing was written
so why not do everything in fb? I would have to install Python to get the tool working.
I happen to have Python installed but I am guessing most fb programmers would not
be too thrilled about having to install Python just to use your analyser generator.
And where is the goto? Or where are the gotos? I have written some
state machines by hand just by starting with an idea of what I want, turning that
into graphviz and from there create code. Using one label per state and a couple of if... then
statements per transition. Input used is a string.
Identifier becomes something like this (s is input as a string, idx is an integer, slen is the length
of the input and end_ is a ubyte)
check(idx,1,slen) returns -1 when there are no more characters to process.
You can minimize the number of calls to check.
If, for example, there are 5 characters needed to complete the scan of
some lexeme then check(idx,5,slen) could be used.
After check(idx,5,slen) 5 characters can be processed without having to check
bounds. Whether this is very useful depends upon the language recognized.
The replacement of the state = ... and select case with
a goto label: structure is mostly a matter of speed.
There is of course the problem of me writing state machines by hand. I perform
minimization by looking at the picture produced by graphviz. Whether it is
at all possible to generate state machines that resemble what I write by hand
I don't really know (my approach tends to suffer a bit from code replication).
A paper on the subject of using gotos (combined with switches) when creating
a lexical analyser can be found here (it's a link to a paper on the implementation of re2c)
http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf
The paper was written by the creators of re2c (dated 1994) which, even after so many years
(1994 is a looonnnggg time ago), is still the lexing tool to beat.
Few scanner generators can create scanners that run at a pace equal to the ones created by
re2c. Re2c uses a combination of switches and if .. then combined with gotos to implement
a scanner. s
One other thing (in terms of performance) could be interesting.
Would it be possible to store positions of matched text rather than the matched
text itself? You have used memory mapping to access a file. I am guessing that's about
the same as copying the file to memory.
On every occasion a token gets created, some text gets copied from the mmap to
a token. That accumulates to a lot of copying per file.
Only storing the position of the token within the mapping would save lots of copying.
Position could be start and length (two integers) or start and end (again 2 integers).
Setting two integers per token takes a fraction of the time it takes to copy text from
mmap to token.
And finally: the code contains a function minimize that converts a dfa into a minimized dfa.
The file that contains the code for the minimization is called "NaiveDFAMinimizer".
Why is function minimize considered naive, have you considered non - naive methods
of turning the dfa into a minimized dfa and is O(n^2) an indication of worse case
behaviour or average case behaviour of function minimize? And what about... Brzozowski?
First one: what does the version number 0.5alpha? Does the alpha mean 'it does not work' or does
the alpha mean 'it works but I need to test a bit more to make sure it works well'.
Secondly: why Python? Why not everything in FreeBASIC? I have experimented with
tools like re2c, byacc, lemon etc... I end up adding a new back end to C code. But
that C code was already written (be it not by me). In your case nothing was written
so why not do everything in fb? I would have to install Python to get the tool working.
I happen to have Python installed but I am guessing most fb programmers would not
be too thrilled about having to install Python just to use your analyser generator.
And where is the goto? Or where are the gotos? I have written some
state machines by hand just by starting with an idea of what I want, turning that
into graphviz and from there create code. Using one label per state and a couple of if... then
statements per transition. Input used is a string.
Identifier becomes something like this (s is input as a string, idx is an integer, slen is the length
of the input and end_ is a ubyte)
Code: Select all
start:
if (s[idx] >= asc("A") andalso s[idx] <= asc("Z")) then goto s1
if (s[idx] >= asc("a") andalso s[idx] <= asc("z")) then goto s1
if (s[idx] = asc("_")) then goto s1
''not an identifier and identifier is the only possible kind of lexeme
goto error_
s1:
''check to see if there is one more character left to scan
if (check(idx,1,slen)) then
end_ = 1
goto done_
end if
idx += 1
''check bounds first
if (s[idx] < asc("0") orelse s[idx] > asc("z")) then goto done_
''continue with checking upper bound of character classes first
if (s[idx] <= asc("9")) then goto s1
if (s[idx] <= asc("Z") andalso s[idx] >= asc("A")) then goto s1
if (s[idx] = asc("_")) then goto s1
if (s[idx] <= asc("z") andalso s[idx] >= asc("a")) then goto s1
done_:
''add a token to the token_stream
''and either end scanning or goto start
if (end_) then
return token_stream
else
goto start
end if
error_:
You can minimize the number of calls to check.
If, for example, there are 5 characters needed to complete the scan of
some lexeme then check(idx,5,slen) could be used.
After check(idx,5,slen) 5 characters can be processed without having to check
bounds. Whether this is very useful depends upon the language recognized.
The replacement of the state = ... and select case with
a goto label: structure is mostly a matter of speed.
There is of course the problem of me writing state machines by hand. I perform
minimization by looking at the picture produced by graphviz. Whether it is
at all possible to generate state machines that resemble what I write by hand
I don't really know (my approach tends to suffer a bit from code replication).
A paper on the subject of using gotos (combined with switches) when creating
a lexical analyser can be found here (it's a link to a paper on the implementation of re2c)
http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf
The paper was written by the creators of re2c (dated 1994) which, even after so many years
(1994 is a looonnnggg time ago), is still the lexing tool to beat.
Few scanner generators can create scanners that run at a pace equal to the ones created by
re2c. Re2c uses a combination of switches and if .. then combined with gotos to implement
a scanner. s
One other thing (in terms of performance) could be interesting.
Would it be possible to store positions of matched text rather than the matched
text itself? You have used memory mapping to access a file. I am guessing that's about
the same as copying the file to memory.
On every occasion a token gets created, some text gets copied from the mmap to
a token. That accumulates to a lot of copying per file.
Only storing the position of the token within the mapping would save lots of copying.
Position could be start and length (two integers) or start and end (again 2 integers).
Setting two integers per token takes a fraction of the time it takes to copy text from
mmap to token.
And finally: the code contains a function minimize that converts a dfa into a minimized dfa.
The file that contains the code for the minimization is called "NaiveDFAMinimizer".
Why is function minimize considered naive, have you considered non - naive methods
of turning the dfa into a minimized dfa and is O(n^2) an indication of worse case
behaviour or average case behaviour of function minimize? And what about... Brzozowski?
Re: A lexical analyzer generator for FreeBASIC
Extract words from a string (also handles preprocessors starting with # or including ##, but no UTF-8 support here)
Output (one word per line here)
Code: Select all
VAR s = "#DEFINE test(_T_) if idx > A ANDALSO typ <> U#_T_ THEN done"
VAR size = 0, start = 1
FOR idx AS INTEGER = 0 TO LEN(s) - 1
SELECT CASE AS CONST s[idx]
CASE ASC("0") TO ASC("9")
IF size THEN size += 1
CASE ASC("#"), ASC("_"), ASC("A") TO ASC("Z"), ASC("a") TO ASC("z")
size += 1 : IF size = 1 THEN start = idx
CASE ELSE
IF 0 = size THEN CONTINUE FOR
?MID(s, start + 1, size)
size = 0
END SELECT
NEXT
IF size THEN ?MID(s, start + 1)
#DEFINE
test
_T_
if
idx
A
ANDALSO
typ
U#_T_
THEN
done
Re: A lexical analyzer generator for FreeBASIC
Thank you for looking so deeply at it, AGS.
What does the version number mean?
0.5 alpha just means that I'm the only one that's looked at it so far, and I'd be embarrassed if I threw out "1.0" and it didn't even work.
Why Python?
For two reasons. First, scripting languages are a natural fit for little text-parsing and code generation tasks like this because duck typing allows you to get off the ground moving at the expense of performance. Second, FreeBASIC doesn't yet have generics, making it tedious to create complex data structures.
Why not use gotos for the state machine? Why copy the text for each token?
Those are two solid optimizations, but I wanted the generated code to emphasize readability over performance. I think the text copying is a bigger sacrifice than the state machine method, but I didn't want to return encoded text to the user or give them something that was tied to the lifetime of the stream. You're right about that copying a lot of text, though, and may make sense to make an optimized emitter and a readable emitter in the future if that becomes a bottleneck for people.
Why the O(N^2) algorithm for minimizing the DFA? Why not Brzozowski?
I started trying to use Brzozowski's algorithm, because it's such a cool weird algorithm. It worked, but it kept collapsing states from different rules and I couldn't identify which rule it had matched in the final state. I think the code's still there in the DFA module.
So I used a textbook algorithm that was O(N^2) to get it working, with the plan to go back and use Hopcroft's algorithm (O(NLogN)). Buuut, it just worked well enough at that point and I was lazy.
To TJF:
That's a handy tokenizer that can get you started parsing a chunk of text without any dependencies, and getting something going is way more important than most people realize, including myself. I've had projects die because I spent too much time on a lexer and my motivation petered out. Hopefully, this tool can be used to create a more complex set of tokens without adding too much complexity to the life of whoever is using it.
What does the version number mean?
0.5 alpha just means that I'm the only one that's looked at it so far, and I'd be embarrassed if I threw out "1.0" and it didn't even work.
Why Python?
For two reasons. First, scripting languages are a natural fit for little text-parsing and code generation tasks like this because duck typing allows you to get off the ground moving at the expense of performance. Second, FreeBASIC doesn't yet have generics, making it tedious to create complex data structures.
Why not use gotos for the state machine? Why copy the text for each token?
Those are two solid optimizations, but I wanted the generated code to emphasize readability over performance. I think the text copying is a bigger sacrifice than the state machine method, but I didn't want to return encoded text to the user or give them something that was tied to the lifetime of the stream. You're right about that copying a lot of text, though, and may make sense to make an optimized emitter and a readable emitter in the future if that becomes a bottleneck for people.
Why the O(N^2) algorithm for minimizing the DFA? Why not Brzozowski?
I started trying to use Brzozowski's algorithm, because it's such a cool weird algorithm. It worked, but it kept collapsing states from different rules and I couldn't identify which rule it had matched in the final state. I think the code's still there in the DFA module.
So I used a textbook algorithm that was O(N^2) to get it working, with the plan to go back and use Hopcroft's algorithm (O(NLogN)). Buuut, it just worked well enough at that point and I was lazy.
To TJF:
That's a handy tokenizer that can get you started parsing a chunk of text without any dependencies, and getting something going is way more important than most people realize, including myself. I've had projects die because I spent too much time on a lexer and my motivation petered out. Hopefully, this tool can be used to create a more complex set of tokens without adding too much complexity to the life of whoever is using it.
Re: A lexical analyzer generator for FreeBASIC
No, sorry. It cannot handle real code. It's just the part AGS sent.jofers wrote:... parsing a chunk of text ...
To parse real code you also have to separate string literals and comments. (The first are a bit tricky when they contain escape sequences.)
Code: Select all
VAR s = "#DEFINE test(_T_) if idx > A*9ANDALSO typ <> U#_T_ THEN done"
VAR size = 0, start = 1
FOR idx AS INTEGER = 0 TO LEN(s) - 1
SELECT CASE AS CONST s[idx]
CASE ASC("/") ' multi line comments
IF s[idx + 1] <> ASC("'") THEN EXIT SELECT
idx += 2
DO
SELECT CASE AS CONST s[idx]
CASE 0 : EXIT DO
CASE ASC("'")
SELECT CASE AS CONST s[idx + 1]
CASE 0 : EXIT DO
CASE ASC("/") : idx += 1 : EXIT DO
END SELECT
END SELECT : idx += 1
LOOP
CASE ASC("'") ' single line comments
DO
idx += 1
SELECT CASE AS CONST s[idx]
CASE 0, ASC(!"\n") : EXIT DO
END SELECT
LOOP
CASE ASC("""") ' string literals
VAR esc = IIF(s[idx - 1] = ASC("!"), 1, 0)
DO
idx += 1
SELECT CASE AS CONST s[idx]
CASE 0, ASC(!"\n") : idx -= 1 : EXIT DO
CASE ASC("\") : IF esc THEN idx += 1
CASE ASC("""") : IF s[idx + 1] = ASC("""") THEN idx += 1 ELSE EXIT DO
END SELECT
LOOP
CASE ASC("0") TO ASC("9")
IF size THEN size += 1
CASE ASC("#"), ASC("_"), ASC("A") TO ASC("Z"), ASC("a") TO ASC("z")
size += 1 : IF size = 1 THEN start = idx
CASE ELSE
IF 0 = size THEN CONTINUE FOR
?MID(s, start + 1, size)
size = 0
END SELECT
NEXT
IF size THEN ?MID(s, start + 1)
Re: A lexical analyzer generator for FreeBASIC
Cool. But it gets complicated as you scale up! For the sake of demonstration, let's do the same task with Poodle-Lex. First, we define our rules file, "Test.rules":
Then, we build our lexical analyzer
Then, we write a program to print out the tokens:
Then, we test it with a sample, "Test.txt":
Build, and run. Output:
If it needs to parse more kinds of tokens, you just add more rules.
Code: Select all
# Comments
MultilineComment: "/'.*'/"
SingleLineComment: "'[^\n]*"
# Keywords
KwAndAlso: i"AndAlso"
KwIf: i"If"
KwThen: i"Then"
# Characters
LeftParenthesis: "\("
RightParenthesis: "\)"
NotEquals: "\<\>"
GreaterThan: "\>"
Asterisk: "\*"
Newline: "\n|\r\n"
# General tokens
StringLiteral: '"(\\.|[^"])*"'
PreProcessorCommand: "#[A-Za-z_]+"
Number: "[:digit:]+"
Identifier: "[a-zA-Z_][a-zA-Z0-9_#]*"
Whitespace: "[ ]+"
Code: Select all
Poodle-Lex Test.rules OutputDirectory
Code: Select all
#include "UTF8Stream.bi"
#include "LexicalAnalyzer.bi"
Var Stream = Poodle.UTF8Stream("Test.txt")
Var LexicalAnalyzer = Poodle.LexicalAnalyzer(@Stream)
Do While LexicalAnalyzer.IsEndOfStream() = 0
Var Token = LexicalAnalyzer.GetToken()
If Token.Id = Poodle.Token.Newline Then
Print Token.Id, "[Newline]"
ElseIf Token.Id <> Poodle.Token.Whitespace Then
Print Token.Id, Token.Text.ToUtf8String()
End If
Loop
Code: Select all
#DEFINE test(_T_) if idx > A*9ANDALSO typ <> U#_T_ THEN done
"\"Strings, man"
Code: Select all
14 #DEFINE
16 test
7 (
16 _T_
8 )
5 if
16 idx
10 >
16 A
11 *
15 9
4 ANDALSO
16 typ
9 <>
16 U#_T_
6 THEN
16 done
12 [Newline]
13 "\"Strings, man"
Re: A lexical analyzer generator for FreeBASIC
Not really.jofers wrote:But it gets complicated as you scale up!
Use an outer SELECT block to separate the words and an inner one in CASE ELSE to handle the special characters. Then just add character definitions and a keyword function and you're ready to go.
This code also does some pretty printing by formating the FB keywords in camel-case letters. (Still just 100 LOC, using your token numbers.)
Code: Select all
VAR s = "", fnam = "lexeme.txt", fnr = FREEFILE
IF OPEN(fnam FOR INPUT AS fnr) THEN
?"Cannot open " & fnam
ELSE
s = STRING(LOF(fnr), 0)
GET #fnr, ,s
CLOSE #fnr
END IF
'' The characters
#DEFINE TOK_NL 12,"[Newline]"
#DEFINE TOK_BROP 7, "("
#DEFINE TOK_BRCL 8, ")"
#DEFINE TOK_UE 9, "<>"
#DEFINE TOK_GT 10, ">"
#DEFINE TOK_ASTX 11, "*"
FUNCTION Keywords(BYREF W AS STRING) AS ZSTRING PTR
SELECT CASE W
CASE "#DEFINE" : RETURN @!"\014#Define"
CASE "ANDALSO" : RETURN @!"\004AndAlso"
CASE "IF" : RETURN @!"\005If"
CASE "THEN" : RETURN @!"\006Then"
END SELECT
RETURN @!"\016"
END FUNCTION
VAR size = 0, start = 1
FOR idx AS INTEGER = 0 TO LEN(s) - 1
SELECT CASE AS CONST s[idx]
CASE ASC("0") TO ASC("9")
IF size THEN size += 1
CASE ASC("#"), ASC("_"), ASC("A") TO ASC("Z"), ASC("a") TO ASC("z")
size += 1 : IF size = 1 THEN start = idx
CASE ELSE
IF size THEN
VAR x = Keywords(UCASE(MID(s, start + 1, size)))
IF x[0] = 16 THEN ?16, MID(s, start + 1, size) _
ELSE ?ASC(*x), x[1]
size = 0
END IF
SELECT CASE AS CONST s[idx]
CASE ASC(!"\r") : IF s[idx + 1] = ASC(!"\n") THEN idx += 1 : ?TOK_NL
CASE ASC(!"\n") : ?TOK_NL
CASE ASC("(") : ?TOK_BROP
CASE ASC(")") : ?TOK_BRCL
CASE ASC("*")
SELECT CASE AS CONST s[idx + 1]
CASE ASC(!"=") : idx += 1 : ?999,"Not yet defined"
CASE ELSE : ?TOK_ASTX
END SELECT
CASE ASC(">") : ?TOK_GT
CASE ASC("<")
SELECT CASE AS CONST s[idx + 1]
CASE ASC(!">") : idx += 1 : ?TOK_UE
CASE ASC(!"=") : idx += 1 : ?999,"Not yet defined"
CASE ELSE : ?999,"Not yet defined"
END SELECT
CASE ASC("""") ' string literals
VAR esc = IIF(ASC(s, idx) = ASC("!"), 1, 0)
start = idx
DO
idx += 1
SELECT CASE AS CONST s[idx]
CASE 0, ASC(!"\n") : idx -= 1 : EXIT DO
CASE ASC("\") : IF esc THEN idx += 1
CASE ASC("""") : IF s[idx + 1] = ASC("""") THEN idx += 1 ELSE EXIT DO
END SELECT
LOOP
?13,MID(s, start + 1, idx - start + 1)
start = 0
CASE ASC("/") ' multi line comments
IF s[idx + 1] <> ASC("'") THEN EXIT SELECT
idx += 2
DO
SELECT CASE AS CONST s[idx]
CASE 0 : EXIT DO
CASE ASC("'")
SELECT CASE AS CONST s[idx + 1]
CASE 0 : EXIT DO
CASE ASC("/") : idx += 1 : EXIT DO
END SELECT
END SELECT : idx += 1
LOOP
CASE ASC("'") ' single line comments
DO
idx += 1
SELECT CASE AS CONST s[idx]
CASE 0, ASC(!"\n") : EXIT DO
END SELECT
LOOP
END SELECT
END SELECT
NEXT
IF size THEN
VAR x = Keywords(UCASE(MID(s, start + 1)))
IF x[0] = 16 THEN ?16, MID(s, start + 1) _
ELSE ?ASC(*x), x[1]
END IF
Strings should only get escaped in FB source when they are prepended by an ! character.
Output (keywords without leading white space)lexeme.txt wrote: #DEFINE test(_T_) if idx > A*9ANDALSO typ <> U#_T_ THEN done
!"\"Strings, man" ' escape
"\"Strings, man" ' no escape
Code: Select all
14 #Define
16 test
7 (
16 _T_
8 )
5 If
16 idx
10 >
16 A
11 *
4 AndAlso
16 typ
9 <>
16 U#_T_
6 Then
16 done
12 [Newline]
13 "\"Strings, man"
13 "\"
16 Strings
16 man
13 " ' no escape
12 [Newline]
Re: A lexical analyzer generator for FreeBASIC
What a wonderful answer, Jofers. I'm a bit 'flabbergasted' (really).
I am installing python 2.7 (I had 2.6 installed) as I write this ( (done :) ) and
I will test your tool.
I have written a very bad C scanner myself using Python (I used python as the editor
I use comes with python scripting). I used re.Scanner to tokenize a C function declaration
(http://code.activestate.com/recipes/457 ... re-module/)
It does a reasonable job on translating single C function declarations to single FB declarations
(lots of room for improvement as it fails on numerous, syntactically correct C declarations).
I don't use Python that often so the code looks a bit 'funny'.
Code
(import pn, import scintilla, and from pypn.decorators import script are
editor - specific)
Back to poodle. And I found some typos (in Python comments).
They're in
Regex.py, lines 26 and 30 (sting should be string?).
and there are some occurrences of epsilong (epsilon?) in
the following files
DeterministicFiniteAutomataBuilder.py line 27, 28 and 47
_NonDeterministicFiniteAutomata.py line 146
I am installing python 2.7 (I had 2.6 installed) as I write this ( (done :) ) and
I will test your tool.
I have written a very bad C scanner myself using Python (I used python as the editor
I use comes with python scripting). I used re.Scanner to tokenize a C function declaration
(http://code.activestate.com/recipes/457 ... re-module/)
It does a reasonable job on translating single C function declarations to single FB declarations
(lots of room for improvement as it fails on numerous, syntactically correct C declarations).
I don't use Python that often so the code looks a bit 'funny'.
Code
(import pn, import scintilla, and from pypn.decorators import script are
editor - specific)
Code: Select all
##scanit tokenizes input using re.scanner
##newlines are ignored. output is one C function declaration
##the words union, struct and enum are dropped from the parameter list
##whitespace and newline are dropped as well.
##The declaration created either
##-- fits on one line if there is one parameter
##-- takes as many lines as there are parameters (one line per parameter)
## parameters all look like byval declarator as type,_newline
import pn
import scintilla
import string
import re
from pypn.decorators import script
table = {}
table["int"] = "integer"
table["float"] = "single"
table["double"] = "double"
table["char"] = "zstring"
table["*"] = "ptr"
table["long"] = "long"
table["struct"] = ""
table["union"] = ""
table["enum"] = ""
table["void"] = "any"
table["const"] = "const"
table["unsigned"] = "unsigned"
table["signed"] = "signed"
@script("Select Declaration","Cparse")
def select_declaration():
editor = scintilla.Scintilla(pn.CurrentDoc())
editor.BeginUndoAction()
startpos = editor.CurrentPos
endpos = editor.TextLength
editor.TargetStart = startpos
editor.TargetEnd = editor.Length
editor.SearchFlags = 0
findpos = editor.SearchInTarget(1,";")
if findpos != -1:
editor.SelectionStart = startpos
#cut of the ; (not needed)
editor.SelectionEnd = findpos+1
lsSelection = editor.SelText#GetTextRange(editor.SelectionStart,editor.SelectionEnd)
#lsSelection = "int **(*b)(void);"
lsSelection = editor.SelText
declaration = convert_declaration(lsSelection)#editor.SelText)
editor.ReplaceSel(declaration)
#pn.AddOutput(lsSelection)
#editor.ReplaceSel(lsSelection)
editor.EndUndoAction()
def convert_declaration(lsSelection):
myscanner = re.Scanner(
[(r"[ \t]+",s_space),
(r"[a-zA-Z_]\w*", s_ident),
(r"\*",s_pointer),
(r",", s_comma),
(r"\r\n",s_crlf),
(r"\r", s_cr),
(r"\n",s_lf),
(r"/\*.*?\*/",s_comment),
(r"[(]",s_lparen),
(r"[)]",s_rparen),
(r"\[[^\]]*?\]",s_brackets),
(r"\.\.\.",s_dot),
(r";",s_semi)])
tokens = myscanner.scan(lsSelection)
ct = 0
tfunc = "declare function {name}("
tsub = "declare sub {name}("
tfuncptr = "as function cdecl {name}("
tsubptr = "as sub cdecl {name}("
tparameter = "byval {name} as {type}"
tanonparameter = "byval as {type}"
tfuncsuf = ") as {type}"
tsubsuf = ")"
return_list = []
return_value = []
#get the type of newline by searching for first newline
#FIXME this is not very efficient!
newline = ""
for j in range(len(tokens[0])):
token = tokens[0][j]
if token[0] == 'lf':
newline = '\x0a'
break
elif token[0] == 'cr':
newline = '\x0d'
elif token[0] == 'crlf':
newline = '\x0a\x0d'
elif token[0] == 'pb':
# replace a[] ==> ptr a ([] <==> () does not hold)
tmp = tokens[0][j-1]
tokens[0][j-1] = token
tokens[0][j] = tmp
newline = '\x0a'
while tokens[0][ct][0] != "(":
if tokens[0][ct][0] == "spc":
pass
elif tokens[0][ct][0] == "std":
return_list.append(tokens[0][ct])
return_value.append(tokens[0][ct])
elif tokens[0][ct][0] == "p":
return_list.append(tokens[0][ct])
return_value.append(tokens[0][ct])
elif tokens[0][ct][0] == "id":
return_list.append(tokens[0][ct])
return_value.append(tokens[0][ct])
index = len(return_list) - 1
ct = ct + 1
parameters = []
parameter = []
#skip the (
ct = ct + 1
mark = ct
sname = ""
first = 0
fp = 0
#check for function/sub ptr (looks like (id|[*]|spc)+([*][spc]*id)(parameter_list) but ([*] id) can also be [*]!)
if tokens[0][ct][0] == 'spc':
ct = ct + 1
if tokens[0][ct][0] == 'p':
ct = ct + 1
if tokens[0][ct][0] == 'spc':
ct = ct + 1
if tokens[0][ct][0] == 'id':
sname = tokens[0][ct][1]
fp = 1
ct = ct + 1
while tokens[0][ct][0] != "(":
ct = ct + 1
#skip the "("
ct = ct + 1
#check for function pointer failed: backtrack
else:
ct = mark
#remove last item (it should be the declarator)
#don't remove when it's a pointer to a proc as
#the declarator will not be part of the proc then
if fp != 1:
return_value = return_value[0:len(return_value)-1]
#fp = 1 ==> function pointer (simple declaration otherwise)
if fp != 1:
#proctype = 1 ==> function
#proctype = 0 ==> sub
proctype = 1
if len(return_value) == 1 and return_value[0][1] == "any" :
proc = tsub.format(name=return_list[index][1])
proctype = 0
else:
proc = tfunc.format(name=return_list[index][1])
#function/sub pointer
elif (fp == 1):
proctype = 1
if len(return_value) == 1 and return_value[0][1] == "any":
proc=tsubptr.format(name=sname)
proctype = 0
else:
proc = tfuncptr.format(name=sname)
proctype = 1
indent = ""
indent = indent.rjust(len(proc),' ')
#mark position for backtracking
mark = ct
#check for parameterlist without content first
empty = 0
if tokens[0][ct][0] == 'spc':
ct = ct + 1
if tokens[0][ct][0] == ')':
empty = 1
elif tokens[0][ct][1] == 'any':
ct = ct + 1
if tokens[0][ct][0] == 'spc':
ct = ct + 1
if tokens[0][ct][0] == ')':
empty = 1
#empty parameterlist (early exit)
if empty == 1:
tdeclaration = "{prefix}{parameters}{suffix}"
retval = ""
if proctype == 0:
declaration = tdeclaration.format(prefix=proc,parameters="",suffix=tsubsuf)
return declaration
else:
for declspec in return_value:
retval = retval + " " + declspec[1]
retval = retval.strip()
suf = tfuncsuf.format(type = retval)
declaration = tdeclaration.format(prefix=proc,parameters="",suffix=suf)
return declaration
ct = mark
while 1:
if tokens[0][ct][0] == ')':
break
while 1:
if tokens[0][ct][0] == ',':
ct = ct + 1
break
elif tokens[0][ct][0] == ')':
break
elif tokens[0][ct][0] == 'spc':
pass
elif tokens[0][ct][0] == 'std':
parameter.append(tokens[0][ct])
elif tokens[0][ct][0] == 'dots':
parameter.append(tokens[0][ct])
elif tokens[0][ct][0] == 'p' or tokens[0][ct][0] == 'pb':
#pb is result of conversion [] -> ptr
#you cannot use this in struct declarations because of [e]
#(the e needs to be translated/copied)
parameter.append(tokens[0][ct])
elif tokens[0][ct][0] == 'id':
parameter.append(tokens[0][ct])
index = len(parameter) - 1
ct = ct + 1
parameter.append(('idx',index))
parameters.append(parameter)
parameter = []
typ = ""
#sparams contains string representation of individual parameters
sparams = []
for j in range(len(parameters)):
param = parameters[j]
idx = param[len(param) - 1][1]
for i in range(len(param)-1):
#idx should denote the position of the declarator
if i == idx:
name = param[idx][1]
else:
typ = typ + param[i][1]
typ = typ + " "
typ = typ.rstrip()
#byval as ...? No, use ...
if typ == "...":
sparam = "..."
else:
sparam = tparameter.format(name=name,type=typ)
sparams.append(sparam)
param = []
sparam = ""
typ = ""
string_params = ""
if len(sparams) == 1:
#only 1 parameter --> strip leading spaces
string_params = sparams[0].strip()
else:
for j in range(len(sparams)):
s = sparams[j]
#first parameter
if j == 0:
string_params = string_params + "_" + newline + indent + s + ",_" + newline
#parameter [1..number of parameters - 1]
elif j != len(sparams) - 1:
string_params = string_params + indent + s + ",_" + newline
else:
#final parameter
string_params = string_params + indent + s
s = ""
tdeclaration = "{prefix}{parameters}{suffix}"
retval = ""
#proctype = 0 ==> void return type
if proctype == 0:
declaration = tdeclaration.format(prefix=proc,parameters=string_params,suffix=tsubsuf)
else:
for declspec in return_value:
retval = retval + " " + declspec[1]
retval = retval.strip()
suf = tfuncsuf.format(type = retval)
declaration = tdeclaration.format(prefix=proc,parameters=string_params,suffix=suf)
return declaration
def s_ident(scanner, token):
if token in table:
if table[token] != "":
return ("std",table[token])
else:
return ("id",token)
def s_comma(scanner, token):
return (",",token)
def s_cr(scanner,token):
return ("cr",token)
def s_crlf(scanner,token):
return ("crlf",token)
def s_lf(scanner,token):
return ("lf",token)
def s_space(scanner,token):
return ("spc",token)
def s_pointer(scanner,token):
return ("p",table["*"])
def s_lparen(scanner,token):
return ("(",token)
def s_rparen(scanner,token):
return (")",token)
def s_semi(scanner,token):
return (";",token)
def s_comment(scanner,token):
token = token.replace("/*","/'")
token = token.replace("*/","'/")
return ("mc",token)
def s_brackets(scanner,token):
# suffix handling (int a(int b[]))
return ("pb","ptr")
def s_dot(scanner,token):
#handling of ...
return("dots","...")
They're in
Regex.py, lines 26 and 30 (sting should be string?).
and there are some occurrences of epsilong (epsilon?) in
the following files
DeterministicFiniteAutomataBuilder.py line 27, 28 and 47
_NonDeterministicFiniteAutomata.py line 146
Re: A lexical analyzer generator for FreeBASIC
Thanks for finding the typos, should be updated now.
Converting declarations without having to use Swig sounds handy. I'll have to check Scintilla out. I'm using Notepad++ on Windows and Geany on Linux at the moment.
Converting declarations without having to use Swig sounds handy. I'll have to check Scintilla out. I'm using Notepad++ on Windows and Geany on Linux at the moment.
Re: A lexical analyzer generator for FreeBASIC
I don't know about Notepad++, but Geany uses Scintilla lexers (and will come with an improved FB lexer soon).jofers wrote:I'll have to check Scintilla out. I'm using Notepad++ on Windows and Geany on Linux at the moment.
I'm using Both can act as custom commands in Geany: select the relevant code and translate with a single keystroke (<control>-[1 | 2 | 3] -- declarations only).jofers wrote:Converting declarations without having to use Swig sounds handy.
Re: A lexical analyzer generator for FreeBASIC
I have tried your package with a set of regular expressions that represent a large subset of
the C preprocessing tokens. String/char literals are kinda hard to express as regular
expressions as the content of a string literal can be any of ASCII, utf8, utf16, utf32 or wchar_t.
The prefix of a string literal (either none,L,u8,u,U or L) determines the type of it's content.
What I have thus far looks like this
Whether the above will work (eg whether it will tokenize C source) I don't know
yet as I ran into a serious problem: poodle did not generate code for the above
file. I get a NFA and a (minimized) DFA but no code.
There is something wrong with the regex on line 37 of FileTemplate.py
The exact message I get reads
_simple can be found in file sre_compile.py line 354 (it's part of python 2.7).
I'm sure it's an easy to fix problem.
I also found that identifier (the name of one of the rule) gets changed into
dentifier (in the .dot output). Why is that?
And I found a bug. When a pattern ends with a single \ followed by ' I get a string index
error. First I thought it was just an exception thrown by your program but that was not the
case. An example.
The above yields an index error. I don't think an error in the input should kill the parser.
I also found some typos and I found those by accident (I was not looking for them).
I'll go through all the code and pick out all the typos I can find and then post a
bug report on typos.
Now for some ideas or, if you will, observations.
First up some small things.
Currently there can be no space between the identifier at the lhs of a rule and the colon.
It would be nice if that could be loosened up a bit. Allow space between the rulename
and the colon (just like space is allowed between the colon and the rhs).
And what about line continuations? A fairly simple scheme would utilize the same line
continuation syntax as FreeBASIC.
Ultimately the rule will end with a newline but allowing _ at the end of the line and
some space at the start of the next line would be nice. Something like
should be legal. A line ending with a line continuation can end in error
(for example when one line ends with [^' and the next starts with 'a-z]).
A first pass to expand physical lines into logical lines could be a way to
deal with line continuations.
It's a bit time consuming to first expand lines but looking at the rate at which poodle
spits out code I don't think time is an issue.
Over to the harder to incorporate changes.
It gets kinda hard to define patterns like a C comment. What would be nice to have are
explicit states. That way it becomes somewhat easier to implement
non - greedy matching. Which is what you want in the case of matching multiline C comments.
Defined using non - greedy matching a comment in C looks like
I tried the above regex using Kodos and it seems to work (you have to remove the
whitespace to make it work in Kodos but it's easier to read the regex with the whitespace).
What would help as well is the ability to use previously defined rules on the rhs of rules.
It can get tedious to copy-paste patterns (and allowing rules on the rhs can help to keep the
regular expressions nice and readable).
That's it for me (for now). A few last words on poodle related things.
About the DFA minimizer: I think that despite it being called 'naieve' the DFA
minimizer does a nice job on minimizing the dfa. It tends to shave of quite a number of
states. And generation of the dfa is fast as well so that's good.
Great idea to have the scanner generate graphviz files (very informative).
the C preprocessing tokens. String/char literals are kinda hard to express as regular
expressions as the content of a string literal can be any of ASCII, utf8, utf16, utf32 or wchar_t.
The prefix of a string literal (either none,L,u8,u,U or L) determines the type of it's content.
What I have thus far looks like this
Code: Select all
#A scanner for the C programming language.
#It needs more elaborate escape sequences
#\u[:xdigit+:][:xdigit:][:xdigit+:][:xdigit:]
#\U[:xdigit+:][:xdigit:][:xdigit+:][:xdigit:][:xdigit+:][:xdigit:][:xdigit+:][:xdigit:]
#\x[:xdigit:]+
#\[0-7]
#\[0-7][0-7]
#\[0-7][0-7][0-7]
#should also support u8, u, U and L prefixes
string: '""((\\([\\''""abfnrtv]))|([^""]))*""'
character: '''((\\([\\''""abfnrtv]))|([^'']))*""'
pp_number: '((\.[0-9])|([0-9]))(([0-9])|([a-df-zA-DF-Z])|(E[\+\-])|(e[\+\-])|(P[\+\-])|(p[\+\-])|([Pp])|([Ee])|(\.))*'
line_continuation: '\\\n'
multi_comment: '/\*([^\*]|[\r\n]|(\*+([^\*]|[\r\n])))*\*+/'
single_comment: '//([^\r\n]*)([\r\n]+)'
identifier: '[a-zA-Z_][a-zA-Z0-9_]*'
whitespace: '[ \t]+'
#in case of punctuators that share a prefix put longest punctuator first
shiftleft_eq:'<<='
shiftright_eq: '>>='
shift_left:'<<'
shfit_right:'>>'
lt_eq:'<='
gt_eq:'>='
gt:'>'
lt:'<'
plusplus:'\+\+'
minmin:'--'
assign_plus:'\+='
assign_min:'-='
assign_mul:'\*='
assign_xor:'\^='
assign_and:'&='
assign_or:'\|='
assign_not:'~='
not_eq:'!='
eq: '=='
log_and:'&&'
log_or:'\|\|'
mul:'\*'
not_:'~'
lbrack:'\['
rbrack:'\]'
lparen:'\('
rparen: '\)'
lcurly: '\{'
rcurly: '\}'
dots: '\.\.\.'
dot:'\.'
deref: '->'
min:'-'
assign:'='
plus: '\+'
bang: '!'
question: '\?'
xor_: '\^'
colon: ':'
semi: ';'
comma: ','
concat: '##'
sharp_include_string: '#[ \t]*include[ \t]*<[^>]+>'
sharp_include_lt_gt: '#[ \t]*include[ \t]*"[^"]+"'
sharp_define: '#[ \t]*define[ \t]*[a-zA-Z_][a-zA-Z0-9_]*'
yet as I ran into a serious problem: poodle did not generate code for the above
file. I get a NFA and a (minimized) DFA but no code.
There is something wrong with the regex on line 37 of FileTemplate.py
Code: Select all
match = re.search(r"(^([ \t]*))?\$([a-zA-Z0-9_]+)", line)
The origin of the error:raise error, v # invalid expression
sre_constants.error: nothing to repeat
Code: Select all
def _simple(av):
# check if av is a "simple" operator
lo, hi = av[2].getwidth()
if lo == 0 and hi == MAXREPEAT:
raise error, "nothing to repeat"
return lo == hi == 1 and av[2][0][0] != SUBPATTERN
I'm sure it's an easy to fix problem.
I also found that identifier (the name of one of the rule) gets changed into
dentifier (in the .dot output). Why is that?
And I found a bug. When a pattern ends with a single \ followed by ' I get a string index
error. First I thought it was just an exception thrown by your program but that was not the
case. An example.
Code: Select all
pattern: 'a\'
I also found some typos and I found those by accident (I was not looking for them).
I'll go through all the code and pick out all the typos I can find and then post a
bug report on typos.
Now for some ideas or, if you will, observations.
First up some small things.
Currently there can be no space between the identifier at the lhs of a rule and the colon.
It would be nice if that could be loosened up a bit. Allow space between the rulename
and the colon (just like space is allowed between the colon and the rhs).
And what about line continuations? A fairly simple scheme would utilize the same line
continuation syntax as FreeBASIC.
Code: Select all
rule: "first part of rule and"_
"second part of rule"
some space at the start of the next line would be nice. Something like
Code: Select all
rule: "first|second|third|fifth" _
"sixth|seventh|eight"
(for example when one line ends with [^' and the next starts with 'a-z]).
A first pass to expand physical lines into logical lines could be a way to
deal with line continuations.
It's a bit time consuming to first expand lines but looking at the rate at which poodle
spits out code I don't think time is an issue.
Over to the harder to incorporate changes.
It gets kinda hard to define patterns like a C comment. What would be nice to have are
explicit states. That way it becomes somewhat easier to implement
non - greedy matching. Which is what you want in the case of matching multiline C comments.
Defined using non - greedy matching a comment in C looks like
Code: Select all
comment: "/\* .*? \*/"
whitespace to make it work in Kodos but it's easier to read the regex with the whitespace).
What would help as well is the ability to use previously defined rules on the rhs of rules.
It can get tedious to copy-paste patterns (and allowing rules on the rhs can help to keep the
regular expressions nice and readable).
That's it for me (for now). A few last words on poodle related things.
About the DFA minimizer: I think that despite it being called 'naieve' the DFA
minimizer does a nice job on minimizing the dfa. It tends to shave of quite a number of
states. And generation of the dfa is fast as well so that's good.
Great idea to have the scanner generate graphviz files (very informative).
Re: A lexical analyzer generator for FreeBASIC
Just to clarify the import scintilla in my Python code.
I am using Programmer's Notepad. And Programmer's Notepad uses
scintilla.
About Geany: you can write a Lua script (using the geanylua plugin)
that accesses scintilla.
geanylua also comes with some gui support (you can create a dialog
using Lua scripting).
Geanylua is absolutely brilliant. If only it were called geanyBASIC.
Then it would have been almost perfect.
Geany also comes with a GDB plugin nowadays which al-most works
on windows (almost but not quite).
I am using Programmer's Notepad. And Programmer's Notepad uses
scintilla.
About Geany: you can write a Lua script (using the geanylua plugin)
that accesses scintilla.
geanylua also comes with some gui support (you can create a dialog
using Lua scripting).
Geanylua is absolutely brilliant. If only it were called geanyBASIC.
Then it would have been almost perfect.
Geany also comes with a GDB plugin nowadays which al-most works
on windows (almost but not quite).
Re: A lexical analyzer generator for FreeBASIC
Cool, thanks for stress-testing it. I really need to set up some unit tests for this thing and target something like your example for the next release.
Re: A lexical analyzer generator for FreeBASIC
I have fixed the bug that prevented poodle from generating code.
Removing the ? from the regular expression was enough to get
the code running.
The regex on line 38 now reads:
I just dropped the ? from the regular expression. I figured it's the
whitespace the program is after. And using the above regex the
whitespace gets captured.
Not all lines containing a $ start with nothing but whitespace.
There is one line that starts with lots of code before the $ occurs:
That line does not get emitted correctly. So I replaced $INITIAL_STATE by
hand guessing that it should be Poodle.InitialState (InitialState is a member
of enum LexicalAnalyzerState).
The generated code is quite clear. And I feel like something more could be done.
I'll give an example of a piece of code generated by poodle.
What are 65 to 90, 95, 97 and 122? Why not write
It's much more readable albeit a little verbose.
Poodle can use the names from the Unicode standard.
You can use two functions to find out the name of a codepoint
(a program needs to import unicodedata to get access to functions category and
name (unichr is built-in)).
where codepoint is a number.
Category should be used first to check whether Python can
return a name for the given codepoint.
For codepoints in the category Cc (Control Characters) you cannot
get Python to give a name as name(unichr(codepoint) fails on codepoints
from category Cc (Control character).
The unicode standard does define names for codepoints from category
Cc. You could create a small dictionary to look up the names for
codes 0 to 1F and 80 to 9F.
Unicode names on the web:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
As an example of what is possible I turned the entire unicode file
into a mass of #define lines (codepoints that fall within category Cc
do have a name but that name is not in the same place as the name of
codepoints that are not control characters)
The input file at unicode.org is a line structured file.
A line starts with a codepoint.
After that comes a semicolon followd by the name of the
codepoint followed by another semicolon followed by the
rest of the line. For control characters a 'translated' line
would look like this:
Looking at the original line from the unicode standard a line
that contains a codepoint from the catergory Cc looks like so
As you can see the codepoint at the start is there but the description is
not in the second field.
The name of a codepoint cannot start with a < so if a name starts with a
< then that codepoint is either a ControlCharacter or 'something else'.
Its a codepoint when the name says so (<codepoint>).
Otherwise (if the name starts with a < but does not continue with codepoint)
the name starts with <plane. And a codepoint with a name that starts with
<plane does not have a name.
I think it's very doable to turn the file from unicode.org into #define lines
(but beware of lines that do not conform to the standard format of codepoint;name;rest_of_line).
Adding symbol names for characters to poodle should be doable. But only if the scanner is to
be used on either an ascii file or a unicode file (for a binary file it would make no sense to
emit unicode names).
As I said earlier: the output of poodle is readable. I can see the state machine doing it's work.
Nothing gets obscured, it's all perfectly understandable. That's good, I like code generators
that generate readable code.
Removing the ? from the regular expression was enough to get
the code running.
The regex on line 38 now reads:
Code: Select all
match = re.match(r"(^([ \t]*))\$([a-zA-Z0-9_]+)", line)
whitespace the program is after. And using the above regex the
whitespace gets captured.
Not all lines containing a $ start with nothing but whitespace.
There is one line that starts with lots of code before the $ occurs:
Code: Select all
Dim State As Poodle.LexicalAnalyzerState = Poodle.$INITIAL_STATE
hand guessing that it should be Poodle.InitialState (InitialState is a member
of enum LexicalAnalyzerState).
The generated code is quite clear. And I feel like something more could be done.
I'll give an example of a piece of code generated by poodle.
Code: Select all
Select Case This.Character
Case 65 To 90, 95, 97 To 122
Code: Select all
Case LATIN_CAPITAL_LETTER_A To LATIN_CAPITAL_LETTER_Z, _
LOW_LINE,_
LATIN_SMALL_LETTER_A To LATIN_SMALL_LETTER_Z
Poodle can use the names from the Unicode standard.
You can use two functions to find out the name of a codepoint
(a program needs to import unicodedata to get access to functions category and
name (unichr is built-in)).
Code: Select all
name(unichr(codepoint))
category(unichr(codepoint))
Category should be used first to check whether Python can
return a name for the given codepoint.
For codepoints in the category Cc (Control Characters) you cannot
get Python to give a name as name(unichr(codepoint) fails on codepoints
from category Cc (Control character).
The unicode standard does define names for codepoints from category
Cc. You could create a small dictionary to look up the names for
codes 0 to 1F and 80 to 9F.
Unicode names on the web:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
As an example of what is possible I turned the entire unicode file
into a mass of #define lines (codepoints that fall within category Cc
do have a name but that name is not in the same place as the name of
codepoints that are not control characters)
Code: Select all
#define DIGIT_ZERO &h0030
#define DIGIT_ONE &h0031
#define DIGIT_TWO &h0032
#define DIGIT_THREE &h0033
#define DIGIT_FOUR &h0034
#define DIGIT_FIVE &h0035
#define DIGIT_SIX &h0036
#define DIGIT_SEVEN &h0037
#define DIGIT_EIGHT &h0038
#define DIGIT_NINE &h0039
#define LATIN_CAPITAL_LETTER_A &h0041
#define LATIN_CAPITAL_LETTER_B &h0042
#define LATIN_CAPITAL_LETTER_C &h0043
#define LATIN_CAPITAL_LETTER_D &h0044
#define LATIN_CAPITAL_LETTER_E &h0045
#define LATIN_CAPITAL_LETTER_F &h0046
#define LATIN_CAPITAL_LETTER_G &h0047
#define LATIN_CAPITAL_LETTER_H &h0048
#define LATIN_CAPITAL_LETTER_I &h0049
#define LATIN_CAPITAL_LETTER_J &h004A
#define LATIN_CAPITAL_LETTER_K &h004B
#define LATIN_CAPITAL_LETTER_L &h004C
#define LATIN_CAPITAL_LETTER_M &h004D
#define LATIN_CAPITAL_LETTER_N &h004E
#define LATIN_CAPITAL_LETTER_O &h004F
#define LATIN_CAPITAL_LETTER_P &h0050
#define LATIN_CAPITAL_LETTER_Q &h0051
#define LATIN_CAPITAL_LETTER_R &h0052
#define LATIN_CAPITAL_LETTER_S &h0053
#define LATIN_CAPITAL_LETTER_T &h0054
#define LATIN_CAPITAL_LETTER_U &h0055
#define LATIN_CAPITAL_LETTER_V &h0056
#define LATIN_CAPITAL_LETTER_W &h0057
#define LATIN_CAPITAL_LETTER_X &h0058
#define LATIN_CAPITAL_LETTER_Y &h0059
#define LATIN_CAPITAL_LETTER_Z &h005A
#define LEFT_SQUARE_BRACKET &h005B
#define REVERSE_SOLIDUS &h005C
#define RIGHT_SQUARE_BRACKET &h005D
#define CIRCUMFLEX_ACCENT &h005E
#define LOW_LINE &h005F
#define GRAVE_ACCENT &h0060
A line starts with a codepoint.
After that comes a semicolon followd by the name of the
codepoint followed by another semicolon followed by the
rest of the line. For control characters a 'translated' line
would look like this:
Code: Select all
#define <control> &h\d+
that contains a codepoint from the catergory Cc looks like so
Code: Select all
0001;<control>;Cc;0;BN;;;;;N;START OF HEADING;;;;
not in the second field.
The name of a codepoint cannot start with a < so if a name starts with a
< then that codepoint is either a ControlCharacter or 'something else'.
Its a codepoint when the name says so (<codepoint>).
Otherwise (if the name starts with a < but does not continue with codepoint)
the name starts with <plane. And a codepoint with a name that starts with
<plane does not have a name.
I think it's very doable to turn the file from unicode.org into #define lines
(but beware of lines that do not conform to the standard format of codepoint;name;rest_of_line).
Adding symbol names for characters to poodle should be doable. But only if the scanner is to
be used on either an ascii file or a unicode file (for a binary file it would make no sense to
emit unicode names).
As I said earlier: the output of poodle is readable. I can see the state machine doing it's work.
Nothing gets obscured, it's all perfectly understandable. That's good, I like code generators
that generate readable code.
Re: A lexical analyzer generator for FreeBASIC
I've taken a closer look at the generated code and found
something interesting.
Whenever a transition can only lead to one target state (
which is a final state without outgoing edges) four statements
get executed
--> A new state number is assigned to the current state.
--> The current character is appended to the text.
--> A new character is read (update of This.Character)
--> On the next iteration the transition is completed and
a token is returned using a block of code that looks like
For single character tokens (and some other tokens) no further iteration
is needed. As the character is a single character token the token can
immediately be returned as a token. With a second loop around the first one
it would be possible to set the token using a variable (token_type).
As an example of what I think could work lets look at the code for recognizing
a single colon (leaving out all other code):
The above code can be used instead of blocks that look like
The state assignment becomes a token type assignment.
And an exit do,do is added to jump to code that appends the token text
to the output and updates the lookahead character.
This cuts down the size of the code a bit (removes of all blocks
that deal with a final state that has no outgoing edges).
For errors in the input a similar pattern as the one used for final states
without outgoing edges can be used. Right now poodle deals with
errors in the following manner:
Code after the else can be replaced with
exit do,do makes the code jump to the same block as used before
Put in the context of LexicalAnalyzer.bas this becomes:
When the above optimizations would be used on 'my' C scanner example
it would shave of quite a few lines of code.
I don't think the code would run faster using the optimization I am proposing.
But cutting down the number of lines of code is always a good idea?
something interesting.
Whenever a transition can only lead to one target state (
which is a final state without outgoing edges) four statements
get executed
--> A new state number is assigned to the current state.
--> The current character is appended to the text.
--> A new character is read (update of This.Character)
--> On the next iteration the transition is completed and
a token is returned using a block of code that looks like
Code: Select all
Case Poodle.TokenName
Select Case This.Character
Case Else
Return Poodle.Token(Poodle.Token.TokenName, Text)
End Select
is needed. As the character is a single character token the token can
immediately be returned as a token. With a second loop around the first one
it would be possible to set the token using a variable (token_type).
As an example of what I think could work lets look at the code for recognizing
a single colon (leaving out all other code):
Code: Select all
Do
Do
Select Case State:
Case Poodle.InitialState
Select Case This.Character
case asc(":")
token_type = Poodle.Token.Colon
exit do,do
''lots of other case statements....
Loop
Loop
Text.Append(This.Character)
This.Character = This.Stream->GetCharacter()
Return Poodle.Token(token_type,Text)
Code: Select all
Case Poodle.StateName
Select Case This.Character
Case Else
Return Poodle.Token(Poodle.Token.TokenName, Text)
End Select
And an exit do,do is added to jump to code that appends the token text
to the output and updates the lookahead character.
This cuts down the size of the code a bit (removes of all blocks
that deal with a final state that has no outgoing edges).
For errors in the input a similar pattern as the one used for final states
without outgoing edges can be used. Right now poodle deals with
errors in the following manner:
Code: Select all
Case Else
Text.Append(This.Character)
This.Character = This.Stream->GetCharacter()
Return Poodle.Token(Poodle.Token.InvalidCharacter, Text)
end select
Code: Select all
Case Else
token_type = Poodle.Token.InvalidCharacter
exit do,do
end select
Code: Select all
Text.Append(This.Character)
This.Character = This.Stream->GetCharacter()
Return Poodle.Token(token_type,Text)
Code: Select all
dim token_type as integer
Do
Do
Select Case State:
Case Poodle.InitialState
Select case This.Character
''lots of code
End Select
''lots of code
End Select
Text.Append(This.Character)
This.Character = This.Stream->GetCharacter()
Loop
Loop
Text.Append(This.Character)
This.Character = This.Stream->GetCharacter()
Return Poodle.Token(token_type,Text)
it would shave of quite a few lines of code.
I don't think the code would run faster using the optimization I am proposing.
But cutting down the number of lines of code is always a good idea?