I found an interesting optimization that can be performed using a small program that operates on lexicalanalyzer.bas.
The optimization has to do with replacing the code representing a part of the state machine with a smaller piece of code.
It operates on states in the state machine that are
Code: Select all
-final;
-have one incoming arc;
-have no outgoing arcs.
It would be possible to eliminate the transition to such a final state completely. The final state could be removed to by moving
the action associated with the final state to the state with a transition to that final state. But to do that would take more than
the simple code replacement optimization I came up with. The 'optimization' works like this (it reduces the size of the source
code but not necessarily the size of the resulting binary nor will it result in better performance).
Whenever the following pattern appears in lexicalanalyzer.bas (where {identifier} refers to the usual definition of an identifier)
Code: Select all
Case StateMachineState.{identifier}
Select Case This.Character
Case Else
Return Poodle.LexicalAnalyzerToken(Poodle.LexicalAnalyzerToken.{identifier}, Poodle.Unicode.Text())
End Select
replace it with this
Code: Select all
Case StateMachineState.{identifier}
Return Poodle.LexicalAnalyzerToken(Poodle.LexicalAnalyzerToken.{identifier}, Poodle.Unicode.Text())
A small freebasic program that performs the above optimization (it's a bit of a hack)
Code: Select all
''search for sequence that looks like
'' Case StateMachineState.{identifier}
'' Select Case This.Character
'' Case Else
'' Return Poodle.LexicalAnalyzerToken(Poodle.LexicalAnalyzerToken.{identifier}, Text)
'' End Select
''and replace that sequence with
'' Case StateMachineState.{identifier}
'' Return Poodle.LexicalAnalyzerToken(Poodle.LexicalAnalyzerToken.{identifier}, Text)
var fh = freefile()
open "lexicalanalyzer.bas" for input as #fh
if (err) then
close
end
end if
var gh = freefile()
open "lexicalanalyzer.opt.bas" for output as #gh
if (err()) then
close
end
end if
dim s as string
dim v as string
dim select_ as string
dim else_ as string
dim return_ as string
dim end_ as string
dim out_ as string
while (not eof(fh))
s = ""
out_ = ""
line input #fh,s
v = trim(s)
out_ = s & !"\r\n"
''always output the first line regardless of content
print #gh,out_;
if (left(v,len("Case StateMachineState.")) <> "Case StateMachineState.") then
continue while
end if
line input #fh,s
v = trim(s)
if (v <> "Select Case This.Character") then
out_ = s & !"\r\n"
print #gh,out_;
continue while
end if
select_ = s
line input #fh,s
v = trim(s)
if v <> "Case Else" then
out_ = select_ & !"\r\n" & s & !"\r\n"
print #gh,out_;
continue while
end if
else_ = s
line input #fh,s
v = trim(s)
if (left(v,len("Return Poodle.LexicalAnalyzerToken(Poodle.LexicalAnalyzerToken")) _
<> _
"Return Poodle.LexicalAnalyzerToken(Poodle.LexicalAnalyzerToken") then
out_ = select_ & !"\r\n" & else_ & !"\r\n" & s & !"\r\n"
print #gh,out_;
continue while
end if
return_ = s
line input #fh,s
v = trim(s)
if (v <> "End Select") then
out_ = select_ & !"\r\n" & else_ & !"\r\n" & return_ & !"\r\n" & s & !"\r\n"
print #gh,out_;
continue while
end if
out_ = return_ & !"\r\n"
print #gh,out_;
wend
The above 'optimization' does not compare to what can be achieved by actually removing states from the resulting code completely.
This is what a transition to a final state looks like for a separator (in this case a left parenthesis)
Code: Select all
Case POODLE_UCS_LEFT_PARENTHESIS
State = StateMachineState.left_parenthesis
The above transition will put the machine into one of the final states referred to earlier. State need not be set to left_parenthesis: the code associated with StateMachineState.left_parenthesis can be executed right away without changing state.
Optimization using the above scheme is not without it's problems. Poodle-lex adds characters to the current token at the end of a loop. That loop gets
cut short when the transition to StateMachineState.left_parenthesis does not take place. Which means the text associated with the current token would be missing the final character. Changing that behaviour of the scanner takes a small rewrite of the routine that returns a token (Poodle.LexicalAnalyzerToken).
That routine should consume another character based upon a parameter (if parameter = true then consume a character). The code added to Poodle.LexicalAnalyzer would be a copy of the code that sits at the end of the main loop
Code: Select all
if parameter then
If Capture = 1 Then Text.Append(This.Character)
This.Character = This.Stream->GetCharacter()
end if
This.Id = Id
This.Text = Text
The above is of course just an idea. I might have misunderstood your code meaning that my rewrite of Poodle.LexicalAnalyzerToken would mess up the scanner.
I wanted to find out how many states of the statemachine could be removed by removing the mentioned final states. So I counted how many of those kind of
final states were in the state machine generated for the java grammar. These are the numbers.
Total number of states : 126
Number of removable states: 45
That's a reduction of the number of generated states that matters.
Final states that have one incoming arc and no outgoing arc are generated mostly for single character punctuators (eg ( ) [ ] { } ) but also for tokens
that consist of several characters (eg -> ++ -- -= +=). If poodle lex is used to scan a programming language chances are the language definition will contain quite a few puncutators (and the above optimization will yield a smaller state machine). If poodle lex is used to scan something other than a programming
language the optimization might not yield that much of an improvement.
I noticed half the size of the java scanner is made up of states that distinguish keywords from identifiers. If all rules for keywords are removed and only the rule for identifier is kept the state machine gets a lot smaller. If there are quite a few keywords in the grammar as much as half the state machine can be removed if rules for keywords are left out. Some scanner generators have a section called 'keywords'. Upon finding an identifier that matches a keyword
the scanner returns a keyword token associated with the keyword found (instead of returning a plain identifier token).
A hashtable or a sorted array would be enough to support that type of feature.
Last but not least I tried to use poodle-lex to scan a wiki - like language.
Code: Select all
bold: '\*\*'
underline: '__'
italic: '//'
monospace: '##'
center: '@@'
header1: '======'
header2: '====='
header3: '===='
header4: '==='
header5: '=='
separator: '----'
line_break: '---'
let space = ' \t'
indent: '(~| |\t)'
special_tag: '{{' switch section
skip: '{space}+'
fbdoc: 'fbdoc' switch section fbdoc_section
end section
link_start: '\[\[' switch section
skip: '{space}*'
link_target: '{identifier}' switch section link_description
end section
section link_description:
skip: '{space}+' switch section link_description_text
end section
section link_description_text
skip: '\]\]' exit section
link_char: '.'
end section
#and so forth and so on...
The problem is with link_description_text. The rule works but it gets the user
the link description chopped up into x tokens (where x is the number of characters in the link target).
Defining link_char as '.+' would not work as it would match everything up to the end of the input.
link_char: '[^\]+' could work but that assumes ] cannot be used in a link description which may or may not
be true. If a single ] is allowed then '[^\]+' would not work as expected (it would merely cut the link desciption into ] delimited pieces of text).
The problem repeats on a global scale when using a catch-all rule like
catch_all would match a single character at a time. catch_all would create one token for any character that
isn't matched by any of the other rules. Using the solution of having a rule that matches any character
as a 'last resort' would generate lots and lots of one character tokens. It would slow down scanning considerably.
What I'd want is a special variable (called, for example, {any_token}) that initiates the creation of a any_token when
none of the other rules match. It would act as if defined by
'If none of the rules match add the current character to the any_token (but do not return it yet)'.
When some rule does match it could test whether any_token has content. If it does then the any_token should be returned before the token belonging to the currently matched text is returned.
As an example here is a simple rules file.
Input
Tokens(with content between parens):
Code: Select all
bold(**)
any_token( bold text is )
bold(**)
any_token( good text )
bold(**)
end_of_input()
It's a bit like the way poodle lex returns Poodle.LexicalAnalyzerToken.InvalidCharacter when it finds a character that
does not fit the definition of any of the rules.
But instead of consuming one character and returning it as an invalidcharacter several characters could be grouped and
returned as an any_token. If the user does not define any_token (any_token does not occur at the right hand
side of any rule) then poodle-lex could simply deal with invalid characters using Poodle.LexicalAnalyzerToken.InvalidCharacter.
In terms of code changes you'd need to add support for the any_token. In terms of ease of use the any_token will prove invaluable
when writing scanners where the user wants to use poodlex-lex as a filter. A filter to, for example, process certain parts of the
input but let all other parts pass through unchanged (copy to the output verbatim).
It's a pity I seem to be the only one interested enough in your project to post messages. Afaik there is no fb equivalent of flex
(or bison or ...) other than my attempt at porting coco/r to FreeBASIC. Perhaps the fact that poodle-lex is written in Python is
what's keeping other fb programmers away from your tool.
Some final words on automata generation: follow automata. Very interesting.
http://blog.kerios.fr/wp-content/upload ... tomata.pdf
A follow automaton (=a NFA) is created using the same kind of construction scheme as Thompson construction (as used by poodle-lex?).
But it generates an epsilon - NFA that is much smaller than the one generated by the Thompson construction. Smaller NFA = less input for subset construction (=faster generation of minimized DFA). The paper also contains an algorithm to turn the generated epsilon - NFA into an epsilon - free NFA. Which brings down the size of the input for subset construction even more.