A lexical analyzer generator for FreeBASIC

User projects written in or related to FreeBASIC.
jofers
Posts: 1525
Joined: May 27, 2005 17:18

A lexical analyzer generator for FreeBASIC

Post by jofers »

Hey all,

It's been awhile since I touched anything FreeBASIC, so I started toying with a project called Poodle-Lex. It's a lexical analyzer generator which emits FreeBASIC code, similar to lex and its ilk:
https://github.com/parkertomatoes/poodl ... /0.5-alpha

It's written in Python, but generates FreeBASIC code for both Linux and Windows which takes in a stream of Unicode code-points and spits out tokens and text. It's been something of a learning project for me, so it will also generate NFA and DFA graphs as GraphViz graphs for anyone that wants to experiment with it.

Anyway, Happy New Year to anyone that can recall me :)

EDIT:

Latest release:
https://github.com/parkertomatoes/poodl ... es/tag/1.0
Last edited by jofers on Apr 22, 2014 4:25, edited 2 times in total.
AGS
Posts: 1284
Joined: Sep 25, 2007 0:26
Location: the Netherlands

Re: A lexical analyzer generator for FreeBASIC

Post by AGS »

A couple of questions/remarks.

First one: what does the version number 0.5alpha? Does the alpha mean 'it does not work' or does
the alpha mean 'it works but I need to test a bit more to make sure it works well'.

Secondly: why Python? Why not everything in FreeBASIC? I have experimented with
tools like re2c, byacc, lemon etc... I end up adding a new back end to C code. But
that C code was already written (be it not by me). In your case nothing was written
so why not do everything in fb? I would have to install Python to get the tool working.
I happen to have Python installed but I am guessing most fb programmers would not
be too thrilled about having to install Python just to use your analyser generator.

And where is the goto? Or where are the gotos? I have written some
state machines by hand just by starting with an idea of what I want, turning that
into graphviz and from there create code. Using one label per state and a couple of if... then
statements per transition. Input used is a string.

Identifier becomes something like this (s is input as a string, idx is an integer, slen is the length
of the input and end_ is a ubyte)

Code: Select all

start: 
     if (s[idx] >= asc("A") andalso s[idx] <= asc("Z")) then goto s1
     if (s[idx] >= asc("a") andalso s[idx] <= asc("z")) then goto s1
     if (s[idx] = asc("_")) then goto s1
     ''not an identifier and identifier is the only possible kind of lexeme
     goto error_
s1: 
     ''check to see if there is one more character left to scan
     if (check(idx,1,slen)) then
       end_ = 1
       goto done_
     end if
     idx += 1
     ''check bounds first
     if (s[idx] < asc("0") orelse s[idx] > asc("z")) then goto done_
     ''continue with checking upper bound of character classes first
     if (s[idx] <= asc("9")) then goto s1
     if (s[idx] <= asc("Z") andalso s[idx] >= asc("A")) then goto s1
     if (s[idx] = asc("_")) then goto s1
     if (s[idx] <= asc("z") andalso s[idx] >= asc("a")) then goto s1
     
done_:
    ''add a token to the token_stream
    ''and either end scanning or goto start
    if (end_) then
      return token_stream
    else
      goto start
    end if

error_:
check(idx,1,slen) returns -1 when there are no more characters to process.

You can minimize the number of calls to check.
If, for example, there are 5 characters needed to complete the scan of
some lexeme then check(idx,5,slen) could be used.

After check(idx,5,slen) 5 characters can be processed without having to check
bounds. Whether this is very useful depends upon the language recognized.

The replacement of the state = ... and select case with
a goto label: structure is mostly a matter of speed.

There is of course the problem of me writing state machines by hand. I perform
minimization by looking at the picture produced by graphviz. Whether it is
at all possible to generate state machines that resemble what I write by hand
I don't really know (my approach tends to suffer a bit from code replication).

A paper on the subject of using gotos (combined with switches) when creating
a lexical analyser can be found here (it's a link to a paper on the implementation of re2c)
http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf

The paper was written by the creators of re2c (dated 1994) which, even after so many years
(1994 is a looonnnggg time ago), is still the lexing tool to beat.
Few scanner generators can create scanners that run at a pace equal to the ones created by
re2c. Re2c uses a combination of switches and if .. then combined with gotos to implement
a scanner. s

One other thing (in terms of performance) could be interesting.

Would it be possible to store positions of matched text rather than the matched
text itself? You have used memory mapping to access a file. I am guessing that's about
the same as copying the file to memory.

On every occasion a token gets created, some text gets copied from the mmap to
a token. That accumulates to a lot of copying per file.

Only storing the position of the token within the mapping would save lots of copying.
Position could be start and length (two integers) or start and end (again 2 integers).
Setting two integers per token takes a fraction of the time it takes to copy text from
mmap to token.

And finally: the code contains a function minimize that converts a dfa into a minimized dfa.
The file that contains the code for the minimization is called "NaiveDFAMinimizer".
Why is function minimize considered naive, have you considered non - naive methods
of turning the dfa into a minimized dfa and is O(n^2) an indication of worse case
behaviour or average case behaviour of function minimize? And what about... Brzozowski?
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: A lexical analyzer generator for FreeBASIC

Post by TJF »

Extract words from a string (also handles preprocessors starting with # or including ##, but no UTF-8 support here)

Code: Select all

VAR s = "#DEFINE test(_T_) if idx > A ANDALSO typ <> U#_T_ THEN done"

VAR size = 0, start = 1
FOR idx AS INTEGER = 0 TO LEN(s) - 1
  SELECT CASE AS CONST s[idx]
  CASE ASC("0") TO ASC("9")
    IF size THEN size += 1
  CASE ASC("#"), ASC("_"), ASC("A") TO ASC("Z"), ASC("a") TO ASC("z")
    size += 1 : IF size = 1 THEN start = idx
  CASE ELSE
    IF 0 = size THEN CONTINUE FOR
    ?MID(s, start + 1, size)
    size = 0
  END SELECT
NEXT
IF size THEN ?MID(s, start + 1)
Output (one word per line here)
#DEFINE
test
_T_
if
idx
A
ANDALSO
typ
U#_T_
THEN
done
jofers
Posts: 1525
Joined: May 27, 2005 17:18

Re: A lexical analyzer generator for FreeBASIC

Post by jofers »

Thank you for looking so deeply at it, AGS.

What does the version number mean?
0.5 alpha just means that I'm the only one that's looked at it so far, and I'd be embarrassed if I threw out "1.0" and it didn't even work.

Why Python?
For two reasons. First, scripting languages are a natural fit for little text-parsing and code generation tasks like this because duck typing allows you to get off the ground moving at the expense of performance. Second, FreeBASIC doesn't yet have generics, making it tedious to create complex data structures.

Why not use gotos for the state machine? Why copy the text for each token?
Those are two solid optimizations, but I wanted the generated code to emphasize readability over performance. I think the text copying is a bigger sacrifice than the state machine method, but I didn't want to return encoded text to the user or give them something that was tied to the lifetime of the stream. You're right about that copying a lot of text, though, and may make sense to make an optimized emitter and a readable emitter in the future if that becomes a bottleneck for people.

Why the O(N^2) algorithm for minimizing the DFA? Why not Brzozowski?
I started trying to use Brzozowski's algorithm, because it's such a cool weird algorithm. It worked, but it kept collapsing states from different rules and I couldn't identify which rule it had matched in the final state. I think the code's still there in the DFA module.

So I used a textbook algorithm that was O(N^2) to get it working, with the plan to go back and use Hopcroft's algorithm (O(NLogN)). Buuut, it just worked well enough at that point and I was lazy.

To TJF:
That's a handy tokenizer that can get you started parsing a chunk of text without any dependencies, and getting something going is way more important than most people realize, including myself. I've had projects die because I spent too much time on a lexer and my motivation petered out. Hopefully, this tool can be used to create a more complex set of tokens without adding too much complexity to the life of whoever is using it.
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: A lexical analyzer generator for FreeBASIC

Post by TJF »

jofers wrote:... parsing a chunk of text ...
No, sorry. It cannot handle real code. It's just the part AGS sent.

To parse real code you also have to separate string literals and comments. (The first are a bit tricky when they contain escape sequences.)

Code: Select all

VAR s = "#DEFINE test(_T_) if idx > A*9ANDALSO typ <> U#_T_ THEN done"

VAR size = 0, start = 1
FOR idx AS INTEGER = 0 TO LEN(s) - 1
  SELECT CASE AS CONST s[idx]
  CASE ASC("/") '                                    multi line comments
    IF s[idx + 1] <> ASC("'") THEN EXIT SELECT
    idx += 2
    DO
      SELECT CASE AS CONST s[idx]
      CASE 0 : EXIT DO
      CASE ASC("'")
        SELECT CASE AS CONST s[idx + 1]
        CASE 0 : EXIT DO
        CASE ASC("/") : idx += 1 : EXIT DO
        END SELECT
      END SELECT : idx += 1
    LOOP
  CASE ASC("'") '                                   single line comments
    DO
      idx += 1
      SELECT CASE AS CONST s[idx]
      CASE 0, ASC(!"\n") : EXIT DO
      END SELECT
    LOOP
  CASE ASC("""") '                                       string literals
    VAR esc = IIF(s[idx - 1] = ASC("!"), 1, 0)
    DO
      idx += 1
      SELECT CASE AS CONST s[idx]
      CASE 0, ASC(!"\n") : idx -= 1 : EXIT DO
      CASE ASC("\") : IF esc THEN idx += 1
      CASE ASC("""") : IF s[idx + 1] = ASC("""") THEN idx += 1 ELSE EXIT DO
      END SELECT
    LOOP
  CASE ASC("0") TO ASC("9")
    IF size THEN size += 1
  CASE ASC("#"), ASC("_"), ASC("A") TO ASC("Z"), ASC("a") TO ASC("z")
    size += 1 : IF size = 1 THEN start = idx
  CASE ELSE
    IF 0 = size THEN CONTINUE FOR
    ?MID(s, start + 1, size)
    size = 0
  END SELECT
NEXT
IF size THEN ?MID(s, start + 1)
jofers
Posts: 1525
Joined: May 27, 2005 17:18

Re: A lexical analyzer generator for FreeBASIC

Post by jofers »

Cool. But it gets complicated as you scale up! For the sake of demonstration, let's do the same task with Poodle-Lex. First, we define our rules file, "Test.rules":

Code: Select all

# Comments 
MultilineComment: "/'.*'/"
SingleLineComment: "'[^\n]*"

# Keywords
KwAndAlso: i"AndAlso"
KwIf: i"If"
KwThen: i"Then"

# Characters
LeftParenthesis: "\("
RightParenthesis: "\)"
NotEquals: "\<\>"
GreaterThan: "\>"
Asterisk: "\*"
Newline: "\n|\r\n"

# General tokens
StringLiteral: '"(\\.|[^"])*"'
PreProcessorCommand: "#[A-Za-z_]+"
Number: "[:digit:]+"
Identifier: "[a-zA-Z_][a-zA-Z0-9_#]*"
Whitespace: "[ ]+"
Then, we build our lexical analyzer

Code: Select all

Poodle-Lex Test.rules OutputDirectory
Then, we write a program to print out the tokens:

Code: Select all

#include "UTF8Stream.bi"
#include "LexicalAnalyzer.bi"

Var Stream = Poodle.UTF8Stream("Test.txt")
Var LexicalAnalyzer = Poodle.LexicalAnalyzer(@Stream)

Do While LexicalAnalyzer.IsEndOfStream() = 0
    Var Token = LexicalAnalyzer.GetToken()
    If Token.Id = Poodle.Token.Newline Then
        Print Token.Id, "[Newline]"
    ElseIf Token.Id <> Poodle.Token.Whitespace Then
        Print Token.Id, Token.Text.ToUtf8String()
    End If
Loop
Then, we test it with a sample, "Test.txt":

Code: Select all

#DEFINE test(_T_) if idx > A*9ANDALSO typ <> U#_T_ THEN done
"\"Strings, man"
Build, and run. Output:

Code: Select all

 14           #DEFINE
 16           test
 7            (
 16           _T_
 8            )
 5            if
 16           idx
 10           >
 16           A
 11           *
 15           9
 4            ANDALSO
 16           typ
 9            <>
 16           U#_T_
 6            THEN
 16           done
 12           [Newline]
 13           "\"Strings, man"
If it needs to parse more kinds of tokens, you just add more rules.
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: A lexical analyzer generator for FreeBASIC

Post by TJF »

jofers wrote:But it gets complicated as you scale up!
Not really.

Use an outer SELECT block to separate the words and an inner one in CASE ELSE to handle the special characters. Then just add character definitions and a keyword function and you're ready to go.

This code also does some pretty printing by formating the FB keywords in camel-case letters. (Still just 100 LOC, using your token numbers.)

Code: Select all

VAR s = "", fnam = "lexeme.txt", fnr = FREEFILE
IF OPEN(fnam FOR INPUT AS fnr) THEN
  ?"Cannot open " & fnam
ELSE
  s = STRING(LOF(fnr), 0)
  GET #fnr, ,s
  CLOSE #fnr
END IF

'' The characters
#DEFINE   TOK_NL 12,"[Newline]"
#DEFINE TOK_BROP  7, "("
#DEFINE TOK_BRCL  8, ")"
#DEFINE   TOK_UE  9, "<>"
#DEFINE   TOK_GT 10, ">"
#DEFINE TOK_ASTX 11, "*"

FUNCTION Keywords(BYREF W AS STRING) AS ZSTRING PTR
  SELECT CASE W
  CASE "#DEFINE" : RETURN @!"\014#Define"
  CASE "ANDALSO" : RETURN @!"\004AndAlso"
  CASE      "IF" : RETURN @!"\005If"
  CASE    "THEN" : RETURN @!"\006Then"
  END SELECT
  RETURN @!"\016"
END FUNCTION

VAR size = 0, start = 1
FOR idx AS INTEGER = 0 TO LEN(s) - 1
  SELECT CASE AS CONST s[idx]
  CASE ASC("0") TO ASC("9")
    IF size THEN size += 1
  CASE ASC("#"), ASC("_"), ASC("A") TO ASC("Z"), ASC("a") TO ASC("z")
    size += 1 : IF size = 1 THEN start = idx
  CASE ELSE
    IF size THEN
      VAR x = Keywords(UCASE(MID(s, start + 1, size)))
      IF x[0] = 16 THEN ?16, MID(s, start + 1, size) _
                   ELSE ?ASC(*x), x[1]
      size = 0
    END IF

    SELECT CASE AS CONST s[idx]
    CASE ASC(!"\r") : IF s[idx + 1] = ASC(!"\n") THEN idx += 1 : ?TOK_NL
    CASE ASC(!"\n") : ?TOK_NL
    CASE ASC("(") : ?TOK_BROP
    CASE ASC(")") : ?TOK_BRCL
    CASE ASC("*")
      SELECT CASE AS CONST s[idx + 1]
      CASE ASC(!"=") : idx += 1 : ?999,"Not yet defined"
      CASE ELSE : ?TOK_ASTX
      END SELECT
    CASE ASC(">") : ?TOK_GT
    CASE ASC("<")
      SELECT CASE AS CONST s[idx + 1]
      CASE ASC(!">") : idx += 1 : ?TOK_UE
      CASE ASC(!"=") : idx += 1 : ?999,"Not yet defined"
      CASE ELSE : ?999,"Not yet defined"
      END SELECT
    CASE ASC("""") '                                     string literals
      VAR esc = IIF(ASC(s, idx) = ASC("!"), 1, 0)
      start = idx
      DO
        idx += 1
        SELECT CASE AS CONST s[idx]
        CASE 0, ASC(!"\n") : idx -= 1 : EXIT DO
        CASE ASC("\") : IF esc THEN idx += 1
        CASE ASC("""") : IF s[idx + 1] = ASC("""") THEN idx += 1 ELSE EXIT DO
        END SELECT
      LOOP
      ?13,MID(s, start + 1, idx - start + 1)
      start = 0
    CASE ASC("/") '                                  multi line comments
      IF s[idx + 1] <> ASC("'") THEN EXIT SELECT
      idx += 2
      DO
        SELECT CASE AS CONST s[idx]
        CASE 0 : EXIT DO
        CASE ASC("'")
          SELECT CASE AS CONST s[idx + 1]
          CASE 0 : EXIT DO
          CASE ASC("/") : idx += 1 : EXIT DO
          END SELECT
        END SELECT : idx += 1
      LOOP
    CASE ASC("'") '                                 single line comments
      DO
        idx += 1
        SELECT CASE AS CONST s[idx]
        CASE 0, ASC(!"\n") : EXIT DO
        END SELECT
      LOOP
    END SELECT
  END SELECT
NEXT

IF size THEN
  VAR x = Keywords(UCASE(MID(s, start + 1)))
  IF x[0] = 16 THEN ?16, MID(s, start + 1) _
               ELSE ?ASC(*x), x[1]
END IF
BTW:
Strings should only get escaped in FB source when they are prepended by an ! character.
lexeme.txt wrote: #DEFINE test(_T_) if idx > A*9ANDALSO typ <> U#_T_ THEN done
!"\"Strings, man" ' escape
"\"Strings, man" ' no escape
Output (keywords without leading white space)

Code: Select all

14            #Define
 16           test
 7            (
 16           _T_
 8            )
5             If
 16           idx
 10           >
 16           A
 11           *
4             AndAlso
 16           typ
 9            <>
 16           U#_T_
6             Then
 16           done
 12           [Newline]
 13           "\"Strings, man"
 13           "\"
 16           Strings
 16           man
 13           " ' no escape
 12           [Newline]
AGS
Posts: 1284
Joined: Sep 25, 2007 0:26
Location: the Netherlands

Re: A lexical analyzer generator for FreeBASIC

Post by AGS »

What a wonderful answer, Jofers. I'm a bit 'flabbergasted' (really).

I am installing python 2.7 (I had 2.6 installed) as I write this ( (done :) ) and
I will test your tool.

I have written a very bad C scanner myself using Python (I used python as the editor
I use comes with python scripting). I used re.Scanner to tokenize a C function declaration
(http://code.activestate.com/recipes/457 ... re-module/)

It does a reasonable job on translating single C function declarations to single FB declarations
(lots of room for improvement as it fails on numerous, syntactically correct C declarations).

I don't use Python that often so the code looks a bit 'funny'.

Code
(import pn, import scintilla, and from pypn.decorators import script are
editor - specific)

Code: Select all

##scanit tokenizes input using re.scanner
##newlines are ignored. output is one C function declaration
##the words union, struct and enum are dropped from the parameter list
##whitespace and newline are dropped as well.
##The declaration created either
##-- fits on one line if there is one parameter
##-- takes as many lines as there are parameters (one line per parameter)
##   parameters all look like byval declarator as type,_newline

import pn
import scintilla
import string
import re
from pypn.decorators import script

table = {}
table["int"] = "integer"
table["float"] = "single"
table["double"] = "double"
table["char"] = "zstring"
table["*"] = "ptr"
table["long"] = "long"
table["struct"] = ""
table["union"] = ""
table["enum"] = ""
table["void"] = "any"
table["const"] = "const"
table["unsigned"] = "unsigned"
table["signed"] = "signed"

@script("Select Declaration","Cparse")
def select_declaration():
  editor = scintilla.Scintilla(pn.CurrentDoc())
  editor.BeginUndoAction()
  startpos = editor.CurrentPos
  endpos = editor.TextLength
  editor.TargetStart = startpos
  editor.TargetEnd = editor.Length
  editor.SearchFlags = 0
  findpos = editor.SearchInTarget(1,";")
  if findpos != -1:
    editor.SelectionStart = startpos
    #cut of the ; (not needed)
    editor.SelectionEnd = findpos+1
    lsSelection = editor.SelText#GetTextRange(editor.SelectionStart,editor.SelectionEnd)
  #lsSelection = "int **(*b)(void);"
  lsSelection = editor.SelText
  declaration = convert_declaration(lsSelection)#editor.SelText)
  editor.ReplaceSel(declaration)
  #pn.AddOutput(lsSelection)
  #editor.ReplaceSel(lsSelection)
  editor.EndUndoAction()


def convert_declaration(lsSelection):
  myscanner = re.Scanner(
  [(r"[ \t]+",s_space),
  (r"[a-zA-Z_]\w*", s_ident),
  (r"\*",s_pointer),
  (r",", s_comma),
  (r"\r\n",s_crlf),
  (r"\r", s_cr),
  (r"\n",s_lf),
  (r"/\*.*?\*/",s_comment),
  (r"[(]",s_lparen),
  (r"[)]",s_rparen),
  (r"\[[^\]]*?\]",s_brackets),
  (r"\.\.\.",s_dot),
  (r";",s_semi)])
  tokens = myscanner.scan(lsSelection)
  ct = 0
  tfunc = "declare function {name}("
  tsub = "declare sub {name}("
  tfuncptr = "as function cdecl {name}("
  tsubptr = "as sub cdecl {name}("
  tparameter = "byval {name} as {type}"
  tanonparameter = "byval as {type}"
  tfuncsuf = ") as {type}"
  tsubsuf = ")"
  return_list = []
  return_value = []
  #get the type of newline by searching for first newline
  #FIXME this is not very efficient!
  newline = ""
  for j in range(len(tokens[0])):
    token = tokens[0][j]
    if token[0] == 'lf':
        newline = '\x0a'
        break
    elif token[0] == 'cr':
        newline = '\x0d'
    elif token[0] == 'crlf':
        newline = '\x0a\x0d'
    elif token[0] == 'pb':
        # replace a[] ==> ptr a ([] <==> () does not hold)
        tmp = tokens[0][j-1]
        tokens[0][j-1] = token
        tokens[0][j] = tmp
  newline = '\x0a'
  while tokens[0][ct][0] != "(":
    if tokens[0][ct][0] == "spc":
        pass
    elif tokens[0][ct][0] == "std":
        return_list.append(tokens[0][ct])
        return_value.append(tokens[0][ct])
    elif tokens[0][ct][0] == "p":
        return_list.append(tokens[0][ct])
        return_value.append(tokens[0][ct])
    elif tokens[0][ct][0] == "id":
        return_list.append(tokens[0][ct])
        return_value.append(tokens[0][ct])
        index = len(return_list) - 1
    ct = ct + 1
  parameters = []
  parameter = []
  #skip the (
  ct = ct + 1
  mark = ct
  sname = ""
  first = 0
  fp = 0
  #check for function/sub ptr (looks like (id|[*]|spc)+([*][spc]*id)(parameter_list) but ([*] id) can also be [*]!)
  if tokens[0][ct][0] == 'spc':
    ct = ct + 1
  if tokens[0][ct][0] == 'p':
    ct = ct + 1
    if tokens[0][ct][0] == 'spc':
      ct = ct + 1
    if tokens[0][ct][0] == 'id':
        sname = tokens[0][ct][1]
        fp = 1
        ct = ct + 1
        while tokens[0][ct][0] != "(":
          ct = ct + 1
        #skip the "("
        ct = ct + 1
  #check for function pointer failed: backtrack
  else:
    ct = mark

  #remove last item (it should be the declarator)
  #don't remove when it's a pointer to a proc as
  #the declarator will not be part of the proc then
  if fp != 1:
    return_value = return_value[0:len(return_value)-1]

  #fp = 1 ==> function pointer (simple declaration otherwise)
  if fp != 1:
    #proctype = 1 ==> function
    #proctype = 0 ==> sub
    proctype = 1
    if len(return_value) == 1 and return_value[0][1] == "any" :
      proc = tsub.format(name=return_list[index][1])
      proctype = 0
    else:
      proc = tfunc.format(name=return_list[index][1])
  #function/sub pointer
  elif (fp == 1):
    proctype = 1
    if len(return_value) == 1 and return_value[0][1] == "any":
      proc=tsubptr.format(name=sname)
      proctype = 0
    else:
      proc = tfuncptr.format(name=sname)
      proctype = 1

  indent = ""
  indent = indent.rjust(len(proc),' ')

  #mark position for backtracking
  mark = ct
  #check for parameterlist without content first
  empty = 0
  if tokens[0][ct][0] == 'spc':
    ct = ct + 1
  if tokens[0][ct][0] == ')':
    empty = 1
  elif tokens[0][ct][1] == 'any':
    ct = ct + 1
    if tokens[0][ct][0] == 'spc':
        ct = ct + 1
    if tokens[0][ct][0] == ')':
        empty = 1

  #empty parameterlist (early exit)
  if empty == 1:
    tdeclaration = "{prefix}{parameters}{suffix}"
    retval = ""
    if proctype == 0:
        declaration = tdeclaration.format(prefix=proc,parameters="",suffix=tsubsuf)
        return declaration
    else:
        for declspec in return_value:
            retval = retval + " " + declspec[1]
        retval = retval.strip()
        suf = tfuncsuf.format(type = retval)
        declaration = tdeclaration.format(prefix=proc,parameters="",suffix=suf)
        return declaration

  ct = mark
  while 1:
    if tokens[0][ct][0] == ')':
        break
    while 1:
        if tokens[0][ct][0] == ',':
            ct = ct + 1
            break
        elif tokens[0][ct][0] == ')':
            break
        elif tokens[0][ct][0] == 'spc':
            pass
        elif tokens[0][ct][0] == 'std':
            parameter.append(tokens[0][ct])
        elif tokens[0][ct][0] == 'dots':
            parameter.append(tokens[0][ct])
        elif tokens[0][ct][0] == 'p' or tokens[0][ct][0] == 'pb':
            #pb is result of conversion [] -> ptr
            #you cannot use this in struct declarations because of [e]
            #(the e needs to be translated/copied)
            parameter.append(tokens[0][ct])
        elif tokens[0][ct][0] == 'id':
            parameter.append(tokens[0][ct])
            index = len(parameter) - 1
        ct = ct + 1
    parameter.append(('idx',index))
    parameters.append(parameter)
    parameter = []
  typ = ""
  #sparams contains string representation of individual parameters
  sparams = []
  for j in range(len(parameters)):
    param = parameters[j]
    idx = param[len(param) - 1][1]
    for i in range(len(param)-1):
        #idx should denote the position of the declarator
        if i == idx:
            name = param[idx][1]
        else:
            typ = typ + param[i][1]
            typ = typ + " "
    typ = typ.rstrip()
    #byval as ...? No, use ...
    if typ == "...":
      sparam = "..."
    else:
      sparam = tparameter.format(name=name,type=typ)
    sparams.append(sparam)
    param = []
    sparam = ""
    typ = ""
  string_params = ""
  if len(sparams) == 1:
    #only 1 parameter --> strip leading spaces
    string_params = sparams[0].strip()
  else:
    for j in range(len(sparams)):
        s = sparams[j]
        #first parameter
        if j == 0:
            string_params = string_params + "_" + newline + indent + s + ",_" + newline
        #parameter [1..number of parameters - 1]
        elif j != len(sparams) - 1:
            string_params = string_params + indent + s + ",_" + newline
        else:
        #final parameter
            string_params = string_params + indent + s
        s = ""
  tdeclaration = "{prefix}{parameters}{suffix}"
  retval = ""
  #proctype = 0 ==> void return type
  if proctype == 0:
    declaration = tdeclaration.format(prefix=proc,parameters=string_params,suffix=tsubsuf)
  else:
    for declspec in return_value:
        retval = retval + " " + declspec[1]
    retval = retval.strip()
    suf = tfuncsuf.format(type = retval)
    declaration = tdeclaration.format(prefix=proc,parameters=string_params,suffix=suf)
  return declaration

def s_ident(scanner, token):
  if token in table:
    if table[token] != "":
      return ("std",table[token])
  else:
    return ("id",token)

def s_comma(scanner, token):
  return (",",token)

def s_cr(scanner,token):
  return ("cr",token)

def s_crlf(scanner,token):
  return ("crlf",token)

def s_lf(scanner,token):
  return ("lf",token)

def s_space(scanner,token):
  return ("spc",token)

def s_pointer(scanner,token):
  return ("p",table["*"])

def s_lparen(scanner,token):
  return ("(",token)

def s_rparen(scanner,token):
  return (")",token)

def s_semi(scanner,token):
  return (";",token)

def s_comment(scanner,token):
  token = token.replace("/*","/'")
  token = token.replace("*/","'/")
  return ("mc",token)

def s_brackets(scanner,token):
  # suffix handling (int a(int b[]))
  return ("pb","ptr")

def s_dot(scanner,token):
  #handling of ...
  return("dots","...")
Back to poodle. And I found some typos (in Python comments).

They're in
Regex.py, lines 26 and 30 (sting should be string?).

and there are some occurrences of epsilong (epsilon?) in
the following files

DeterministicFiniteAutomataBuilder.py line 27, 28 and 47
_NonDeterministicFiniteAutomata.py line 146
jofers
Posts: 1525
Joined: May 27, 2005 17:18

Re: A lexical analyzer generator for FreeBASIC

Post by jofers »

Thanks for finding the typos, should be updated now.

Converting declarations without having to use Swig sounds handy. I'll have to check Scintilla out. I'm using Notepad++ on Windows and Geany on Linux at the moment.
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: A lexical analyzer generator for FreeBASIC

Post by TJF »

jofers wrote:I'll have to check Scintilla out. I'm using Notepad++ on Windows and Geany on Linux at the moment.
I don't know about Notepad++, but Geany uses Scintilla lexers (and will come with an improved FB lexer soon).
jofers wrote:Converting declarations without having to use Swig sounds handy.
I'm using Both can act as custom commands in Geany: select the relevant code and translate with a single keystroke (<control>-[1 | 2 | 3] -- declarations only).
AGS
Posts: 1284
Joined: Sep 25, 2007 0:26
Location: the Netherlands

Re: A lexical analyzer generator for FreeBASIC

Post by AGS »

I have tried your package with a set of regular expressions that represent a large subset of
the C preprocessing tokens. String/char literals are kinda hard to express as regular
expressions as the content of a string literal can be any of ASCII, utf8, utf16, utf32 or wchar_t.
The prefix of a string literal (either none,L,u8,u,U or L) determines the type of it's content.

What I have thus far looks like this

Code: Select all

#A scanner for the C programming language.
#It needs more elaborate escape sequences 
#\u[:xdigit+:][:xdigit:][:xdigit+:][:xdigit:]  
#\U[:xdigit+:][:xdigit:][:xdigit+:][:xdigit:][:xdigit+:][:xdigit:][:xdigit+:][:xdigit:]
#\x[:xdigit:]+
#\[0-7]
#\[0-7][0-7]
#\[0-7][0-7][0-7]
#should also support u8, u, U and L prefixes
string:            '""((\\([\\''""abfnrtv]))|([^""]))*""'
character:         '''((\\([\\''""abfnrtv]))|([^'']))*""'
pp_number:         '((\.[0-9])|([0-9]))(([0-9])|([a-df-zA-DF-Z])|(E[\+\-])|(e[\+\-])|(P[\+\-])|(p[\+\-])|([Pp])|([Ee])|(\.))*'
line_continuation: '\\\n'
multi_comment:     '/\*([^\*]|[\r\n]|(\*+([^\*]|[\r\n])))*\*+/'
single_comment:    '//([^\r\n]*)([\r\n]+)'
identifier:        '[a-zA-Z_][a-zA-Z0-9_]*'
whitespace:        '[ \t]+'
#in case of punctuators that share a prefix put longest punctuator first
shiftleft_eq:'<<='
shiftright_eq: '>>='
shift_left:'<<'
shfit_right:'>>'
lt_eq:'<='
gt_eq:'>='
gt:'>'
lt:'<'
plusplus:'\+\+'
minmin:'--'
assign_plus:'\+='
assign_min:'-='
assign_mul:'\*='
assign_xor:'\^='
assign_and:'&='
assign_or:'\|='
assign_not:'~='
not_eq:'!='
eq: '=='
log_and:'&&'
log_or:'\|\|'
mul:'\*'
not_:'~'
lbrack:'\['
rbrack:'\]'
lparen:'\('
rparen: '\)'
lcurly: '\{'
rcurly: '\}'
dots: '\.\.\.'
dot:'\.'
deref: '->'
min:'-'
assign:'='
plus: '\+'
bang: '!'
question: '\?'
xor_: '\^'
colon: ':'
semi: ';'
comma: ','
concat: '##'
sharp_include_string: '#[ \t]*include[ \t]*<[^>]+>'
sharp_include_lt_gt:  '#[ \t]*include[ \t]*"[^"]+"'
sharp_define: '#[ \t]*define[ \t]*[a-zA-Z_][a-zA-Z0-9_]*'
Whether the above will work (eg whether it will tokenize C source) I don't know
yet as I ran into a serious problem: poodle did not generate code for the above
file. I get a NFA and a (minimized) DFA but no code.
There is something wrong with the regex on line 37 of FileTemplate.py

Code: Select all

match = re.search(r"(^([ \t]*))?\$([a-zA-Z0-9_]+)", line)
The exact message I get reads
raise error, v # invalid expression
sre_constants.error: nothing to repeat
The origin of the error:

Code: Select all

def _simple(av):
    # check if av is a "simple" operator
    lo, hi = av[2].getwidth()
    if lo == 0 and hi == MAXREPEAT:
        raise error, "nothing to repeat"
    return lo == hi == 1 and av[2][0][0] != SUBPATTERN
_simple can be found in file sre_compile.py line 354 (it's part of python 2.7).
I'm sure it's an easy to fix problem.

I also found that identifier (the name of one of the rule) gets changed into
dentifier (in the .dot output). Why is that?

And I found a bug. When a pattern ends with a single \ followed by ' I get a string index
error. First I thought it was just an exception thrown by your program but that was not the
case. An example.

Code: Select all

pattern: 'a\'
The above yields an index error. I don't think an error in the input should kill the parser.

I also found some typos and I found those by accident (I was not looking for them).
I'll go through all the code and pick out all the typos I can find and then post a
bug report on typos.

Now for some ideas or, if you will, observations.

First up some small things.

Currently there can be no space between the identifier at the lhs of a rule and the colon.
It would be nice if that could be loosened up a bit. Allow space between the rulename
and the colon (just like space is allowed between the colon and the rhs).

And what about line continuations? A fairly simple scheme would utilize the same line
continuation syntax as FreeBASIC.

Code: Select all

rule: "first part of rule and"_
"second part of rule"
Ultimately the rule will end with a newline but allowing _ at the end of the line and
some space at the start of the next line would be nice. Something like

Code: Select all

rule: "first|second|third|fifth" _
      "sixth|seventh|eight"      
should be legal. A line ending with a line continuation can end in error
(for example when one line ends with [^' and the next starts with 'a-z]).
A first pass to expand physical lines into logical lines could be a way to
deal with line continuations.
It's a bit time consuming to first expand lines but looking at the rate at which poodle
spits out code I don't think time is an issue.

Over to the harder to incorporate changes.
It gets kinda hard to define patterns like a C comment. What would be nice to have are
explicit states. That way it becomes somewhat easier to implement
non - greedy matching. Which is what you want in the case of matching multiline C comments.

Defined using non - greedy matching a comment in C looks like

Code: Select all

comment: "/\* .*? \*/" 
I tried the above regex using Kodos and it seems to work (you have to remove the
whitespace to make it work in Kodos but it's easier to read the regex with the whitespace).

What would help as well is the ability to use previously defined rules on the rhs of rules.
It can get tedious to copy-paste patterns (and allowing rules on the rhs can help to keep the
regular expressions nice and readable).

That's it for me (for now). A few last words on poodle related things.
About the DFA minimizer: I think that despite it being called 'naieve' the DFA
minimizer does a nice job on minimizing the dfa. It tends to shave of quite a number of
states. And generation of the dfa is fast as well so that's good.

Great idea to have the scanner generate graphviz files (very informative).
AGS
Posts: 1284
Joined: Sep 25, 2007 0:26
Location: the Netherlands

Re: A lexical analyzer generator for FreeBASIC

Post by AGS »

Just to clarify the import scintilla in my Python code.

I am using Programmer's Notepad. And Programmer's Notepad uses
scintilla.

About Geany: you can write a Lua script (using the geanylua plugin)
that accesses scintilla.

geanylua also comes with some gui support (you can create a dialog
using Lua scripting).

Geanylua is absolutely brilliant. If only it were called geanyBASIC.
Then it would have been almost perfect.

Geany also comes with a GDB plugin nowadays which al-most works
on windows (almost but not quite).
jofers
Posts: 1525
Joined: May 27, 2005 17:18

Re: A lexical analyzer generator for FreeBASIC

Post by jofers »

Cool, thanks for stress-testing it. I really need to set up some unit tests for this thing and target something like your example for the next release.
AGS
Posts: 1284
Joined: Sep 25, 2007 0:26
Location: the Netherlands

Re: A lexical analyzer generator for FreeBASIC

Post by AGS »

I have fixed the bug that prevented poodle from generating code.
Removing the ? from the regular expression was enough to get
the code running.

The regex on line 38 now reads:

Code: Select all

                match = re.match(r"(^([ \t]*))\$([a-zA-Z0-9_]+)", line)
I just dropped the ? from the regular expression. I figured it's the
whitespace the program is after. And using the above regex the
whitespace gets captured.
Not all lines containing a $ start with nothing but whitespace.
There is one line that starts with lots of code before the $ occurs:

Code: Select all

    Dim State As Poodle.LexicalAnalyzerState = Poodle.$INITIAL_STATE
That line does not get emitted correctly. So I replaced $INITIAL_STATE by
hand guessing that it should be Poodle.InitialState (InitialState is a member
of enum LexicalAnalyzerState).

The generated code is quite clear. And I feel like something more could be done.
I'll give an example of a piece of code generated by poodle.

Code: Select all

            Select Case This.Character
                Case 65 To 90, 95, 97 To 122
What are 65 to 90, 95, 97 and 122? Why not write

Code: Select all

Case LATIN_CAPITAL_LETTER_A To LATIN_CAPITAL_LETTER_Z, _
LOW_LINE,_
LATIN_SMALL_LETTER_A To LATIN_SMALL_LETTER_Z
It's much more readable albeit a little verbose.

Poodle can use the names from the Unicode standard.

You can use two functions to find out the name of a codepoint
(a program needs to import unicodedata to get access to functions category and
name (unichr is built-in)).

Code: Select all

name(unichr(codepoint))
category(unichr(codepoint))
where codepoint is a number.

Category should be used first to check whether Python can
return a name for the given codepoint.
For codepoints in the category Cc (Control Characters) you cannot
get Python to give a name as name(unichr(codepoint) fails on codepoints
from category Cc (Control character).

The unicode standard does define names for codepoints from category
Cc. You could create a small dictionary to look up the names for
codes 0 to 1F and 80 to 9F.

Unicode names on the web:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

As an example of what is possible I turned the entire unicode file
into a mass of #define lines (codepoints that fall within category Cc
do have a name but that name is not in the same place as the name of
codepoints that are not control characters)

Code: Select all

#define DIGIT_ZERO &h0030 
#define DIGIT_ONE &h0031 
#define DIGIT_TWO &h0032 
#define DIGIT_THREE &h0033 
#define DIGIT_FOUR &h0034 
#define DIGIT_FIVE &h0035 
#define DIGIT_SIX &h0036 
#define DIGIT_SEVEN &h0037 
#define DIGIT_EIGHT &h0038 
#define DIGIT_NINE &h0039 
#define LATIN_CAPITAL_LETTER_A &h0041 
#define LATIN_CAPITAL_LETTER_B &h0042 
#define LATIN_CAPITAL_LETTER_C &h0043 
#define LATIN_CAPITAL_LETTER_D &h0044 
#define LATIN_CAPITAL_LETTER_E &h0045 
#define LATIN_CAPITAL_LETTER_F &h0046 
#define LATIN_CAPITAL_LETTER_G &h0047 
#define LATIN_CAPITAL_LETTER_H &h0048 
#define LATIN_CAPITAL_LETTER_I &h0049 
#define LATIN_CAPITAL_LETTER_J &h004A 
#define LATIN_CAPITAL_LETTER_K &h004B 
#define LATIN_CAPITAL_LETTER_L &h004C 
#define LATIN_CAPITAL_LETTER_M &h004D 
#define LATIN_CAPITAL_LETTER_N &h004E 
#define LATIN_CAPITAL_LETTER_O &h004F 
#define LATIN_CAPITAL_LETTER_P &h0050 
#define LATIN_CAPITAL_LETTER_Q &h0051 
#define LATIN_CAPITAL_LETTER_R &h0052 
#define LATIN_CAPITAL_LETTER_S &h0053 
#define LATIN_CAPITAL_LETTER_T &h0054 
#define LATIN_CAPITAL_LETTER_U &h0055 
#define LATIN_CAPITAL_LETTER_V &h0056 
#define LATIN_CAPITAL_LETTER_W &h0057 
#define LATIN_CAPITAL_LETTER_X &h0058 
#define LATIN_CAPITAL_LETTER_Y &h0059 
#define LATIN_CAPITAL_LETTER_Z &h005A 
#define LEFT_SQUARE_BRACKET &h005B 
#define REVERSE_SOLIDUS &h005C 
#define RIGHT_SQUARE_BRACKET &h005D 
#define CIRCUMFLEX_ACCENT &h005E 
#define LOW_LINE &h005F 
#define GRAVE_ACCENT &h0060 
The input file at unicode.org is a line structured file.
A line starts with a codepoint.
After that comes a semicolon followd by the name of the
codepoint followed by another semicolon followed by the
rest of the line. For control characters a 'translated' line
would look like this:

Code: Select all

#define <control> &h\d+
Looking at the original line from the unicode standard a line
that contains a codepoint from the catergory Cc looks like so

Code: Select all

0001;<control>;Cc;0;BN;;;;;N;START OF HEADING;;;;
As you can see the codepoint at the start is there but the description is
not in the second field.

The name of a codepoint cannot start with a < so if a name starts with a
< then that codepoint is either a ControlCharacter or 'something else'.
Its a codepoint when the name says so (<codepoint>).

Otherwise (if the name starts with a < but does not continue with codepoint)
the name starts with <plane. And a codepoint with a name that starts with
<plane does not have a name.

I think it's very doable to turn the file from unicode.org into #define lines
(but beware of lines that do not conform to the standard format of codepoint;name;rest_of_line).
Adding symbol names for characters to poodle should be doable. But only if the scanner is to
be used on either an ascii file or a unicode file (for a binary file it would make no sense to
emit unicode names).

As I said earlier: the output of poodle is readable. I can see the state machine doing it's work.
Nothing gets obscured, it's all perfectly understandable. That's good, I like code generators
that generate readable code.
AGS
Posts: 1284
Joined: Sep 25, 2007 0:26
Location: the Netherlands

Re: A lexical analyzer generator for FreeBASIC

Post by AGS »

I've taken a closer look at the generated code and found
something interesting.
Whenever a transition can only lead to one target state (
which is a final state without outgoing edges) four statements
get executed
--> A new state number is assigned to the current state.
--> The current character is appended to the text.
--> A new character is read (update of This.Character)
--> On the next iteration the transition is completed and
a token is returned using a block of code that looks like

Code: Select all

            Case Poodle.TokenName
            Select Case This.Character
                Case Else
                Return Poodle.Token(Poodle.Token.TokenName, Text)
            End Select
For single character tokens (and some other tokens) no further iteration
is needed. As the character is a single character token the token can
immediately be returned as a token. With a second loop around the first one
it would be possible to set the token using a variable (token_type).

As an example of what I think could work lets look at the code for recognizing
a single colon (leaving out all other code):

Code: Select all

Do
  Do
        Select Case State:
        Case Poodle.InitialState
          Select Case This.Character
          case asc(":")
            token_type = Poodle.Token.Colon
            exit do,do
          ''lots of other case statements....
  Loop
Loop
Text.Append(This.Character)
This.Character = This.Stream->GetCharacter()
Return Poodle.Token(token_type,Text)
The above code can be used instead of blocks that look like

Code: Select all

            Case Poodle.StateName
            Select Case This.Character
                Case Else
                Return Poodle.Token(Poodle.Token.TokenName, Text)
            End Select
The state assignment becomes a token type assignment.
And an exit do,do is added to jump to code that appends the token text
to the output and updates the lookahead character.
This cuts down the size of the code a bit (removes of all blocks
that deal with a final state that has no outgoing edges).

For errors in the input a similar pattern as the one used for final states
without outgoing edges can be used. Right now poodle deals with
errors in the following manner:

Code: Select all

                Case Else
                Text.Append(This.Character)
                This.Character = This.Stream->GetCharacter()
                Return Poodle.Token(Poodle.Token.InvalidCharacter, Text)
              end select
Code after the else can be replaced with

Code: Select all

Case Else
  token_type = Poodle.Token.InvalidCharacter
  exit do,do
end select
exit do,do makes the code jump to the same block as used before

Code: Select all

Text.Append(This.Character)
This.Character = This.Stream->GetCharacter()
Return Poodle.Token(token_type,Text)
Put in the context of LexicalAnalyzer.bas this becomes:

Code: Select all

  dim token_type as integer
  Do
    Do
        Select Case State:
            Case Poodle.InitialState
            Select case This.Character
            ''lots of code
            End Select
          ''lots of code
        End Select
        Text.Append(This.Character)
        This.Character = This.Stream->GetCharacter()
    Loop
  Loop
  Text.Append(This.Character)
  This.Character = This.Stream->GetCharacter()
  Return Poodle.Token(token_type,Text)
When the above optimizations would be used on 'my' C scanner example
it would shave of quite a few lines of code.
I don't think the code would run faster using the optimization I am proposing.
But cutting down the number of lines of code is always a good idea?
Post Reply