CaSe-InSensitive InStr using extended regexp

Post your FreeBASIC source, examples, tips and tricks here. Please don’t post code without including an explanation.
Post Reply
Zippy
Posts: 1295
Joined: Feb 10, 2006 18:05

CaSe-InSensitive InStr using extended regexp

Post by Zippy »

PInStr: "Patterned" case-insensitive InStr using extended regular expressions.
PInStr usage: wrote: first = PInStr( [ start, ] str, pattern [ ,flags ] )


Parameters:

str

..The string to be searched.

pattern

..The pattern to find. Use extended regular expressions.

start

..The position in str at which the search will begin. Optional.

flags

..Only one currently, integer, optional:
....1. 128 = turn on case-sensitivity

Return Values:

first

..The integer return value from PInStr, where:

....<0 = an error occurred
......if = -4 then the regexp compilation failed (your regexp is bad) and
........the error/reason string will be in PInStrData.errStr

....0 = match not found

....>0 = the position of the matched pattern and
......the matched substring will be in PInStrData.match

..Aso returns a UDT which is populated in the global namespace. See pinstr.bi, "tagPInStrData"

In its simplest form PInStr is nearly identical to instr, with the exception that the default search with PInStr is case-insensitive whereas InStr is always case-sensitive. This PInStr default can be reverted, see "flags" above.

Simple:

Code: Select all

#include once "pinstr.bi" 
print PInStr("John Doe","doe")
returns "6". Note the case-insensitive search where "doe" matches "Doe". Turn on case-sensitivity like this:

Code: Select all

#include once "pinstr.bi" 
print PInStr("John Doe","Doe",128) '128 is defined as PInStrCaseSenSe
This again will return "6", where "Doe" matches "Doe".

Moving beyond simple, PInStr can utilize the power of regular expressions. It's beyond the current scope and my patience to try to explain regular expressions - you can start your education if needed with wikipedia:

http://en.wikipedia.org/wiki/Regular_expression

There are regexp examples on wiki, also, but they are perl-oriented. PInStr uses regexec() which is (theoretically) POSIX 2 compliant ==> perl, python, and other langs expand POSIX 2 regexps in unconforming ways.

I'll try to offer better details of how to use grouping, (), later, as PInStr can use grouping and the usage is not obvious. Hopefully the example progs below will provide enough clues to get someone started who is familiar with regexps.

Need pinstr.bi:

Code: Select all

'pinstr.bi
'CaSe-InSensitive InStr using extended regexp
'PInStr, "Patterned" case-insensitive InStr using extended regular expressions
'UpDate:
'   2009-04-24 Added PInput,PSplit_csv,PSearchReplace
'
'lib:
#include once "regex.bi"
'flags:
#define PInStrCaseSenSe 128 'override using default case-insensitivity
'
'====================================== pinput()
type tagPInStrData
    match as string            'the portion of the target string that matched
    groups as integer          'the number of groups, (), in your regexp
    group(1 to 12) as string   'the matched groups/substrings
    position as integer        'the position of the first match in target,
                               '  or errorcode
    errCode as integer         'same as .position but errorcode only
    errStr as string*64        'if .errCode = -4 then then your regexp is flawed,
                               '  it wouldn't compile, this is an English
                               '  explanation of the error 
end type
dim shared PInStrData as tagPInStrData
'
declare function PInStr overload (_
    ByRef targStr as string,_
    ByRef pattern as string,_
    ByVal flags as integer=0) as integer
'
declare function PInStr (_
    ByVal start as integer,_   'same as instr(start,
    ByRef targStr as string,_  'same as instr(,str,
    ByRef pattern as string,_  'same as instr(,,substring
    ByVal flags as integer=0) as integer '128 = CaSe-Sensitive search
'
function PInStr overload (_
    ByRef targStr as string,_
    ByRef pattern as string,_
    ByVal flags as integer=0) as integer
'
    return PInStr(1,targStr,pattern,flags)
'
end function
'
function PInStr (_
    ByVal start as integer,_
    ByRef targStr as string,_
    ByRef pattern as string,_
    ByVal flags as integer=0) as integer
'
    With PInStrData
    '
    if targStr="" then
        .errStr="Target string is BLANK"
        .errCode= -1 
        return -1
    end if
    '
    if pattern="" then 
        .errStr="Pattern string is BLANK"
        .errCode= -2        
        return -2
    end if
    '
    .match=""
    '.group(1 to remax)
    .position=0
    .errCode=0
    .errStr=space(128)
    '
    dim as integer c,l,remax=12,res,rflag,tn,tp,tres
    dim as string zs
    static as string lastpattern 
    dim as zstring ptr pbuff
    static as regex_t re
    dim pm(0 to remax) as regmatch_t
    '
    if start<1 then start=1
    l=*(cptr(integer ptr,@targStr)+1)
    if start>l then 
        .errStr="START must be <= len(Target)"
        .errCode= -3        
        return -3
    end if
    '
    if not flags and 128 then rflag or=REG_ICASE
    rflag or= 1
    '
    if lastpattern<>pattern then
        regfree(@re)
        lastpattern=pattern
        res=regcomp(@re,pattern,rflag)
        if res<>0 then
            'print res;" ==> regcomp failure: ";pattern
            .errCode=res
            tres=regerror(res,@re,strptr(.errStr),64)
            return -4
        end if
    end if
    '
    pbuff=strptr(targStr)
    if start>1 then pbuff+=start
    '
    tn=re.re_nsub
    if tn>remax then tn=remax
    remax=tn
    .groups=tn
    if tn=0 then tn=1 else tn+=1
    '
    res=regexec(@re,pbuff,tn,@pm(0),0)
    '
    if res<>0 then return 0 'aw, not found..
    '
    zs=mid(*pbuff,1+pm(0).rm_so,pm(0).rm_eo-pm(0).rm_so)
    tp=pm(0).rm_so
    if start>1 then tp+=start
    '
    for c=0 to remax
        .group(c+1)=mid(*pbuff,1+pm(c+1).rm_so,pm(c+1).rm_eo-pm(c+1).rm_so)
    next
    '
    tp+=1
    .match=zs
    .position=tp
    '
    return tp
    '
    end With
'
end function
'
'====================================== PInput()
declare function _
    PInput(_
        pt as string,_      'pattern for entry
        r as integer,_      'row for entry
        c as integer,_      'column for entry
        imax as integer,_   'max len of entry not including prompt
        iprompt as string,_ 'prompt for entry
        caseS as integer=0_ 'if 128 then match is Case-SenSitive
        ) as string
'
function _
    PInput(_
        pt as string,_
        r as integer,_
        c as integer,_
        imax as integer,_
        iprompt as string,_
        caseS as integer=0_
        ) as string
'
    dim as integer f,k,lm,res,t
    dim as string ks,ts
    '
    if caseS>0 then caseS=128
    '
    locate r,c
    print iprompt;
    locate r,c+len(iprompt)
    print string(imax,"_");
    locate r,c
    print"Phone: ";
    lm=pos
    while (1)
        k=getkey
        select case k
            case 8
                if pos>k then
                    locate r,pos-1
                    print " ";
                    locate r,pos-1
                    ts=left(ts,len(ts)-1)
                end if
            case 32 to 122
                if pos<=imax+lm then
                    print chr(k);
                    ts+=chr(k)
                else
                    beep
                end if
            case 13
                ts=trim(ts)
                if ts>"" then
                    res=pinstr(ts,pt,caseS)
                    if res then
                        exit while
                    else
                        beep
                    end if
                else
                    beep
                end if
            case 27
                f=1
                exit while
            case else
                beep
        end select
    wend
    '
    if f=1 then
        return "*Entry Canceled*"
    else
        return ts
    end if
'
end function
'
'====================================== psplit_csv()
declare function _
    PSplit_csv(_
        targStr as string,_     'string to parse/delmit
        targArray() as string,_ 'array to parse targStr TO
        delimiter as string=","_'delimiter used as.. delimiter
    ) as integer 'returns number of array els used, ONE-based
'
function _
    PSplit_csv(_
        targStr as string,_
        targArray() as string,_
        delimiter as string=","_
    ) as integer
'
    if TargStr="" then return -1
    if delimiter="" or len(delimiter)>1 then return -2
    if delimiter=!"""or delimiter="" then return -3
    '
    dim as integer c,p,maxarray,res
    dim as string pt,ts
    '
    maxarray=ubound(targArray)
    '
    pt= !"(^|" _
        & "" & delimiter & _
        !")("([^"]+|"")*"|[^" _
        & "" & delimiter & _
        "]*)"
    '
    with PInStrData
    '
    c=1
    p=pinstr(targStr,pt)
    while p>0

        if left(.match,1)=delimiter and len(.match)>1 then
            targArray(c)=""
            c+=1
            targArray(c)=trim(.match,ANY delimiter & " ")
        else
            targArray(c)=.match
        end if

        if .match=delimiter then
            targArray(c)=""
            c+=1
            targArray(c)=""
        end if

        c+=1:if c>=maxarray then exit while
        p+=len(.match)

        p=pinstr(p,targStr,pt)
    wend
    '
    end with
    '
    return iif(c>1,c-1,0)
'
end function
'
'====================================== PSearchReplace()
declare function _
    PSearchReplace(_
        targStr as string,_             'string to parse
        params as string,_              'parameter string, replacements
        paramdelimiter as string="=",_  'char used to delimit param string
        caseS as integer=0_             'if 128 then is Case-SenSitive search/replace
    ) as string
'
function _
    PSearchReplace(_
        targStr as string,_
        params as string,_
        paramdelimiter as string="=",_
        caseS as integer=0_
    ) as string
'
    dim as integer c,cta,lp,op,p,res,tres
    dim as string ns,pt,ts
    dim as string ma(1 to 2,256),pa(24),ta(256)
    '
    if paramdelimiter="" then paramdelimiter="="
    '
    res=PSplit_csv(params,pa())
    if res=0 then return targStr
    for c=1 to res
        tres=PSplit_csv(trim(pa(c),chr(34)),ta(),paramdelimiter)
        ma(1,c)=ta(1)':print ta(1)
        ma(2,c)=ta(2)':print ta(2)
    next
    '
    ns=targStr
    '
 with PInStrData
    '
    for cta=1 to res
        pt=ma(1,cta)
        ts=""
        lp=1
        op=0
        p=pinstr(ns,pt,caseS)':print p
        if p=0 then continue for
        while p>0
            if p then
                ts+=mid(ns,lp,p-lp)
                if len(ma(2,cta))>0 then
                    ts+=ma(2,cta)
                end if
                lp=p+len(.match)
            end if
            p+=len(.match)
            op=p
            p=pinstr(p,ns,pt,caseS)
        wend
        '    
        if len(ts)<op then
            ts+=mid(ns,op,255)
        end if
        '    
        ns=ts
    next cta
    '
    return ns
    '
 end with
'
end function
No separate dll is needed, libtre.a is distributed with fb.

Examples:

Code: Select all

'pinstr.bas Example
'CaSe-InSensitive InStr using extended regexp
'PInStr, "Patterned" case-insensitive InStr using extended regular expressions
'
#include once "pinstr.bi"
'
dim as integer p,res
dim as string targStr,pattern,uf
'
'return all matches
targStr="Now is the time for all good men to come to the aid of the party"
print targStr
pattern="no|go|to|party"
'
res=99:p=1
uf="At: ### found: \          \  using: &"
while res>0
    'search from position p, no grouping, default case insensitive
    res=PInstr(p,targStr,pattern)
    if res>0 then
        print using uf;res,PInStrData.match,pattern
        p=res+1
    end if
wend
'
'
print
print "==========================================="
targStr="Brown, Robert Z. 808-555-1212 123 Easy Street  Lahaina, HI 96761"
print targStr
'
'get last name, first &
pattern="^([^0-9]*) "
'   search from default start position 1, return 1st matching group
res=PInStr(targStr,pattern)
print PInStrData.group(1)
'
'get last name
pattern="^[^0-9,]*"
'   search using defaults, start position 1 no grouping
res=PInStr(targStr,pattern)
print PInStrData.match
'
'get first name
pattern="(, )(\D*)"
'   search from default start position 1, return 2nd matching group
res=PInStr(targStr,pattern)
print PInStrData.group(2)
'
'get phone number US format no country code
pattern="[0-9]{3}-[0-9]{3}-[0-9]{4}"
res=PInStr(targStr,pattern)
print PInStrData.match
'
'get state code/abbreviation, trims leading space
pattern=" ([A-Z]{2} .*)$"
'   default start position 1, return 1st matching group, CASE SENSITIVE search
res=PInStr(targStr,pattern,128)
print PInStrData.group(1)
'
'get zip code
pattern="[-0-9]{5,9}$" 
res=PInStr(targStr,pattern)
print PInStrData.match
'
'using one regexp
dim z as tagPInStrData ptr=@PInStrData 'shorten calls..
print
pattern="^([^0-9,]*), ([^0-9,]*) +([0-9]{3}-[0-9]{3}-[0-9]{4}) +" & _
        "(.*)  (.*)$"
'
res=PInStr(targStr,pattern)
if res then
    print z->groups;" Groupings found in string/pattern:"
    print
    for c as integer=1 to z->groups
        print c,z->group(c)
    next
    print
    print "2+1 ";z->group(2);" ";z->group(1)
    print "  4 ";z->group(4)
    print "  5 ";z->group(5)
    print
    print "  3 ";z->group(3)
else
    print "Match not found"
end if

sleep
end
More examples to follow.
ETA: New version 2009-04-21
ETA: New Version 2009-04-24 PInstr(),PInput(),PSplit_csv() and PSearchReplace() all moved
.. into pinstr.bi, examples modified accordingly
Last edited by Zippy on Apr 25, 2009 19:23, edited 2 times in total.
AGS
Posts: 1284
Joined: Sep 25, 2007 0:26
Location: the Netherlands

Re: CaSe-InSensitive InStr using extended regexp

Post by AGS »

Zippy wrote: No separate dll is needed, libtre.a is distributed with fb. I'm not certain about tre support on Unix-like operating systems.
libtre.a is available on Linux (Debian/Ubuntu).
Zippy wrote: PInStr is not a speed demon. If you don't have a use for regexps and/or case-insensitive searches then InStr is a better choice, or strstr() from the CRT.
Why is it not a speed demon and what (if anything) could be changed to make it a speed demon?
Zippy
Posts: 1295
Joined: Feb 10, 2006 18:05

Re: CaSe-InSensitive InStr using extended regexp

Post by Zippy »

AGS wrote: <snip>
Zippy wrote: PInStr is not a speed demon. If you don't have a use for regexps and/or case-insensitive searches then InStr is a better choice, or strstr() from the CRT.
Why is it not a speed demon and what (if anything) could be changed to make it a speed demon?
[Windows] Where I noticed the speed difference (from instr and a separate grep util) was when attempting matches line-by-line against a 5000 line test file. Worst-case, PInStr was 3x slower than instr using a simple regexp.

The speed slug is regcomp() from lib tre. I was able to reduce the overhead from 300% to 50% slower than instr by reusing the already-compiled regexp across iterations. If the pattern/regexp is the same for 5000 iterations through a file then there's no need to recompile (regcomp()) the regexp.

Then, the second bottleneck is regexec() itself (again from lib tre).

I've minimized the regcomp() hit, I'm ok with a bit slower than instr to allow regexps. regexec(), the lesser hit, can't be circumvented.
Zippy
Posts: 1295
Joined: Feb 10, 2006 18:05

Post by Zippy »

Another example:

Code: Select all

'pinstr example, "parsing commandline", test for each param
#include once "pinstr.bi"
'
dim as integer c,p,res
dim as string ts,pt
'
print "Usage: filemonitor -ddirspec -s0|1 -f[1-383] -r"".*"" -q"
'
ts="-d""e:\wall-lo gs\temp"" -s1 -f8 -r""\.log$"" -q"
print using "CmdLn: &";ts
print
'
with pinstrdata

if (pinstr(ts,"[^ ]?-d ?""")) then
    res=pinstr(ts,!"[^ ]?-(d) ?[\"]{0,1}([^\"]+)[\"]{0,1}[- ]|$")
else
    res=pinstr(ts,"[^ ]?-(d) ?([^ ]+)")
end if
'
if .match > "" then
print using "Match: &";.match
print using "Param: &   Value: & &";.group(1),.group(2)
print
end if

res=pinstr(ts,"-(s) ?(\d{1})")
if .match > "" then
print using "Match: &";.match
print using "Param: &   Value: &";.group(1),.group(2)
print
end if

res=pinstr(ts,"-(f) ?(\d{1,3})")
if .match > "" then
print using "Match: &";.match
print using "Param: &   Value: &";.group(1),.group(2)
print
end if

res=pinstr(ts,!"-(r) ?[\"]{0,1}([^\"]+)[\"]{0,1}[- ]|$")
if .match > "" then
print using "Match: &";.match
print using "Param: &   Value: &";.group(1),.group(2)
print
end if

res=pinstr(ts,"-(q)")
if .match > "" then
print using "Match: &";.match
print using "Param: &   Value: &";.group(1),.group(2)
print
end if

end with

sleep
end
Zippy
Posts: 1295
Joined: Feb 10, 2006 18:05

Post by Zippy »

Another:

Code: Select all

'pinstr example, parsing commandline, parameter iteration
#include once "pinstr.bi"
'
dim as integer c,np,p,res
dim as string ts,pt
'
print "Usage: filemonitor -ddirspec -s0|1 -f[1-383] -r"".*"" -q"
'
ts="-d""e:\wall-lo gs\temp"" -s1 -f8 -r""\.log$"" -q"
print using "CmdLn: &";ts
print
'
with pinstrdata
'
pt=!" {0,}-([a-z]) {0,}(\"([^\"]+)\")|(([^ -]{1}) {0,}([^ ]{0,}))"
'
p=pinstr(ts,pt)
'
while p>0    
' 
    if .match > "" then
        print using "Match: &";.match
        print using "Param: &   Value: & &";_
            .group(1) & .group(5),_
            .group(3) & .group(6)
        print
    end if
    '
    p+=len(.match)
    p=pinstr(p,ts,pt)
'
wend
'
end with
'
sleep
end
Zippy
Posts: 1295
Joined: Feb 10, 2006 18:05

Post by Zippy »

And another:

Code: Select all

'pinstr example, parsing "file data"
' data is Citizen Weather stations
#include once "pinstr.bi"
'
dim as integer res
dim as string pt,ts,uf
'
with pinstrdata
'
print "Data parse/grouping test, press a key to start"
print
sleep
'
uf="& \                 \ \ \ \ \###.#### ####.####"
'
pt="^[^|]+\|[^|]+\|([^ ]+) (.{1,27}) *([A-Z]{2}) ?" & _
   "([A-Z]{0,2})?\|[.0-9]+\| {0,}([.0-9-]+)\| {0,}([.0-9-]+)\|"
'
restore MyData
read ts
while ts<>"Zippy"
    res=pinstr(ts,pt,128)
    if res then    
        if .group(4)="" then swap .group(3),.group(4)    
        print using uf;.group(1),.group(2),.group(3),.group(4),_
                       val(.group(5)),val(.group(6))
    else
        print "eRroR"
    end if
    read ts
wend
'

print
print "Get latitude range >=40 (field 5), press a key"
sleep
print
'
pt="[^\|]+\|[^\|]+\|[^\|]+\|[^\|]+\| {0,}([4-9][0-9]\.\d+)\|"
'
restore MyData
read ts
while ts<>"Zippy"
    res=pinstr(ts,pt)
    if res then 
        print using "####.##### &";val(.group(1)),mid(ts,14,64)
    end if
    read ts
wend
'
print
print "Get altitude >199 meters (field 4) in US, press a key"
sleep
print
'
pt="[^\|]+\|[^\|]+\|[^\|]+US\|(\d?[2-9]\d{2})"
'
restore MyData
read ts
while ts<>"Zippy"
    res=pinstr(ts,pt)
    if res then 
        print using "##### &";val(.group(1)),mid(ts,14,64)
    end if
    read ts
wend
'
end with
'
print
print "Sleeping to Exit.."
sleep
end
'
MyData:
data "CW0001|C0001|CW0001 Evansville                 IN US|125.9|37.97597|-87.56822|GMT|||1||||"
data "CW0003|C0003|CW0003 Carlisle                   MA US|61|  42.5445|  -71.3735|GMT|||1||||"
data "CW0004|C0004|CW0004 Denver                     CO US|1615|39.7257|  -104.95904|GMT|||1||||"
data "CW0005|C0005|CW0005 Des Moines                 IA US|287| 41.60651| -93.69564|GMT|||1||||"
data "CW0007|C0007|CW0007 Amsterdam                     NL|2|   52.51833| 4.97333|GMT|||1||||"
data "CW0008|C0008|CW0008 Franklin                   NC US|677| 35.20033| -83.41917|GMT|||1||||"
data "CW0009|C0009|CW0009 Antwerp                       BE|32|  51.1395|  4.48483|GMT|||1||||"
data "CW0013|C0013|CW0013 Inverness                  CA US|289.6|38.07338|-122.84375|GMT|||1||||"
data "CW0018|C0018|CW0018 Smithville                 TX US|140| 29.985|   -97.30333|GMT|||1||||"
data "CW0020|C0020|CW0020 Trondheim                     NO|215| 63.42402| 10.75459|GMT|||1||||"
data "CW0028|C0028|CW0028 Muttontown                 NY US|46.9|40.82662| -73.56199|GMT|||1||||"
data "CW0041|C0041|CW0041 Brunswick                  GA US|4.6| 31.23733| -81.49517|GMT|||1||||"
data "CW0045|C0045|CW0045 Northfield                 NJ US|10|  39.3675|  -74.54833|GMT|||1||||"
data "CW0046|C0046|CW0046 Vienna                     VA US|119| 38.8915|  -77.294|GMT|||1||||"
data "CW0052|C0052|CW0052 Bath                       ME US|10|  43.84917| -69.815|GMT|||1||||"
data "CW0053|C0053|CW0053 Wamic                      OR US|694.5|45.21846|-121.3974|GMT|||1||||"
data "CW0060|C0060|CW0060 Wurzburg                      DE|270| 49.80583| 9.65467|GMT|||1||||"
data "CW0062|C0062|CW0062 Decatur                    TX US|249.9|33.26342|-97.66513|GMT|||1||||"
data "CW0065|C0065|CW0065 Olive Branch               MS US|116| 34.8805|  -89.83077|GMT|||1||||"
data "CW0066|C0066|CW0066 Hoffman Estates            IL US|253| 42.05829| -88.12915|GMT|||1||||"
data "CW0069|C0069|CW0069 Heath                      TX US|155.4|32.8208| -96.4662|GMT|||1||||"
data "CW0080|C0080|CW0080 Lawrence Sta - SWSWC       NB CA|132.1|45.46324|-67.12402|GMT|||1||||"
data "CW0082|C0082|CW0082 Albuquerque                NM US|1768|35.04648| -106.49206|GMT|||1||||"
data "CW0099|C0099|CW0099 Affton                     MO US|146| 38.54958| -90.31689|GMT|||1||||"
data "CW0104|C0104|CW0104 Dedham                     MA US|49|  42.24336| -71.19785|GMT|||1||||"
data "CW0115|C0115|CW0115 Kyle                       TX US|204| 29.97131| -97.8492|GMT|||1||||"
data "CW0117|C0117|CW0117 Thornlie                      AU|16|  -32.057|  115.95333|GMT|||1||||"
data "CW0118|C0118|CW0118 Scituate                   MA US|12|  42.17943| -70.71956|GMT|||1||||"
data "CW0120|C0120|CW0120 Sandy                      UT US|1579|40.54278| -111.811667|GMT|||1||||"
data "CW0121|C0121|CW0121 Groton                     SD US|401| 45.44083| -98.09217|GMT|||1||||"
data "CW0126|C0126|CW0126 Machynlleth                   UK|92|  52.66883| -3.80917|GMT|||1||||"
data "CW0133|C0133|CW0133 Austin                     TX US|235| 30.18115| -97.86115|GMT|||1||||"
data "CW0134|C0134|CW0134 Oulu                          FI|5|   65.101|   25.40467|GMT|||1||||"
data "CW0136|C0136|CW0136 Stockton                   CA US|12.5|38.03317| -121.34733|GMT|||1||||"
data "CW0146|C0146|CW0146 Round Rock                 TX US|216| 30.54861| -97.62608|GMT|||1||||"
data "CW0149|C0149|CW0149 Richmond                   MN US|332| 45.439|   -94.5085|GMT|||1||||"
data "CW0150|C0150|CW0150 Colonie                    NY US|94.2|42.75164| -73.87899|GMT|||1||||"
data "Zippy"
Zippy
Posts: 1295
Joined: Feb 10, 2006 18:05

Post by Zippy »

And an example using PInStr to editcheck user entry:

Code: Select all

'pinstr example
'PInput, "patterned input", international format phone number
' this function incorporated into pinstr.bi 2009-04-24
' The #ifndef..#endif portions 
'   may be removed from this example if using newer pinstr.bi
'
#include once "pinstr.bi"
'
dim as integer imax,c,r
dim as string iprompt,pt,ts
'
#ifndef PInput
declare function _
    PInput(_
        pt as string,_      'pattern for entry
        r as integer,_      'row for entry
        c as integer,_      'column for entry
        imax as integer,_   'max len of entry not including prompt
        iprompt as string,_ 'prompt for entry
        caseS as integer=0_ 'if 128 then match is Case-SenSitive
        ) as string
'
#endif
'pattern for entry
pt="(\+d{1,3} )?\d{2,3} \d{3} \d{4}"
'row/col to place entry
r=4:c=1
'max len of entry
imax=17
'prompt for entry
iprompt="Phone: "
'
'country code must be prefixed with +, may have 1-3 digits, optional
'area/city code may have 1-3 digits?, is not optional
'local number is 7 digits, 3 + space + 4, not optional
locate 1,1
print "Example intl phone entry, () means part is optional:"
print "(+041 )022 730 5989"
'
ts=PInput(pt,r,c,imax,iprompt)
'
print
print
print "Entered: ";ts
'
print
print
print "Sleeping to Exit.."
sleep
end
'
#ifndef PInput
'====================================== PInput()
function _
    PInput(_
        pt as string,_
        r as integer,_
        c as integer,_
        imax as integer,_
        iprompt as string,_
        caseS as integer=0_
        ) as string
'
    dim as integer f,k,lm,res,t
    dim as string ks,ts
    '
    if caseS>0 then caseS=128
    '
    locate r,c
    print iprompt;
    locate r,c+len(iprompt)
    print string(imax,"_");
    locate r,c
    print"Phone: ";
    lm=pos
    while (1)
        k=getkey
        select case k
            case 8
                if pos>k then
                    locate r,pos-1
                    print " ";
                    locate r,pos-1
                    ts=left(ts,len(ts)-1)
                end if
            case 32 to 122
                if pos<=imax+lm then
                    print chr(k);
                    ts+=chr(k)
                else
                    beep
                end if
            case 13
                ts=trim(ts)
                if ts>"" then
                    res=pinstr(ts,pt,caseS)
                    if res then
                        exit while
                    else
                        beep
                    end if
                else
                    beep
                end if
            case 27
                f=1
                exit while
            case else
                beep
        end select
    wend
    '
    if f=1 then
        return "*Entry Canceled*"
    else
        return ts
    end if
'
end function
#endif
ETA: mod to accomodate new pinstr.bi 2009-04-24
Last edited by Zippy on Apr 25, 2009 19:27, edited 1 time in total.
AGS
Posts: 1284
Joined: Sep 25, 2007 0:26
Location: the Netherlands

Re: CaSe-InSensitive InStr using extended regexp

Post by AGS »

Zippy wrote:The speed slug is regcomp() from lib tre. I was able to reduce the overhead from 300% to 50% slower than instr by reusing the already-compiled regexp across iterations. If the pattern/regexp is the same for 5000 iterations through a file then there's no need to recompile (regcomp()) the regexp.

Then, the second bottleneck is regexec() itself (again from lib tre).

I've minimized the regcomp() hit, I'm ok with a bit slower than instr to allow regexps. regexec(), the lesser hit, can't be circumvented.
From 300% downto 50% is an enormous improvement. I'm not so sure about the performance of libtre. Have you tried PCRE?
Zippy
Posts: 1295
Joined: Feb 10, 2006 18:05

Post by Zippy »

Introducing PSplit_csv, "patterned split of delimited data"

I could have called it "Split".. Or just "PSplit".. But I didn't. what it is, is, a method using PInstr to parse and return values from delimited data. The data must be in CSV format, but the delimiter (the "C" in CSV) is user-selectable - it defaults to Comma but may be anything (any printable character) other than a double quote (") or a backslash (\).

This is standard format CSV:

"Young","George",32,1.888.555.1212,"123 Easy Street",96761,HI"

Other delimiters are often used, like a vertical bar:

"Young"|"George"|32|1.888.555.1212|"123 Easy Street"|96761|HI"

Either of these formats/examples are trivial to parse using regular expressions. What is more difficult to parse is when the CSV data contains null values, where 2 or more delimiters occur sequentially or the delimiter appears as first or last "value":

"Young","George",,,"123 Easy Street",96761,"

I resorted to looking for examples of parsing nulls on the internets, What I found worked for nulls but then compromised quoting. The regular expression I ended up with is partly from the wild and partly mine. And may not be perfect.. I know I haven't addressed malformed data.

PSplit_csv example 1:

Code: Select all

'pinstr example
'PSplit_csv, "patterned split of delimited data"
' this function incorporated into pinstr.bi 2009-04-24
' The #ifndef..#endif portions 
'   may be removed from this example if using newer pinstr.bi
'
#include once "pinstr.bi"
'
dim as integer c,res
dim as string pt,ts
dim as string ta(64) 'array for delimited values, num fields!
'
#ifndef PSplit_csv
declare function _
    PSplit_csv(_
        targStr as string,_     'string to parse/delmit
        targArray() as string,_ 'array to parse targStr TO
        delimiter as string=","_'delimiter used as.. delimiter
    ) as integer 'returns number of array els used, One-based
'
#endif
ts="""now"",123,""is the time"",,456.78,""for, all, good"",-37.24"
print ts
res=PSplit_csv(ts,ta(),",")
if res>0 then
    for c=1 to res
        print using "### &";c,ta(c)
    next
    print
end if
'
ts="""now""|123|""is the time""||456.78|""for, all, good""|-37.24"
print ts
res=PSplit_csv(ts,ta(),"|")
if res>0 then
    for c=1 to res
        print using "### &";c,ta(c)
    next
    print
end if
'
sleep
end
'
#ifndef PSplit_csv
'====================================== psplit_csv()
function _
    PSplit_csv(_
        targStr as string,_
        targArray() as string,_
        delimiter as string=","_
    ) as integer
'
    if TargStr="" then return -1
    if delimiter="" or len(delimiter)>1 then return -2
    if delimiter=!"\""or delimiter="\" then return -3
    '
    dim as integer c,p,res
    dim as string pt,ts
    '
    pt= !"(^|" _
        & "\" & delimiter & _
        !")(\"([^\"]+|\"\")*\"|[^" _
        & "\" & delimiter & _
        "]*)"
    '
    with PInStrData
    '
    c=1
    p=pinstr(targStr,pt)
    while p>0
        '
        if left(.match,1)=delimiter and len(.match)>1 then
            targArray(c)=""
            c+=1
            targArray(c)=trim(.match,ANY delimiter & " ")
        else
            targArray(c)=trim(.match," ")
        end if
        '
        if .match=delimiter then
            targArray(c)=""
            c+=1
            targArray(c)=""
        end if
        '
        c+=1
        p+=len(.match)
        '
        p=pinstr(p,targStr,pt)
        '
    wend
    '
    end with
    '
    return iif(c>1,c-1,0)
'
end function
#endif

PSplit_csv example 2:

Code: Select all

'pinstr example
'PSplit_csv, "patterned split of delimited data"
' this function incorporated into pinstr.bi 2009-04-24
' The #ifndef..#endif portions 
'   may be removed from this example if using newer pinstr.bi
'
#include once "pinstr.bi"
'
dim as integer c,res
dim as string ts
dim as string ta(64) 'array for delimited values, num fields!
'
#ifndef PSplit_csv
declare function _
    PSplit_csv(_
        targStr as string,_     'string to parse/delmit
        targArray() as string,_ 'array to parse targStr TO
        delimiter as string=","_ 'delimiter used as.. delimiter
    ) as integer 'returns number of array els used, ONE-based
'
#endif
restore MyData
'
read ts
while ts<>"Zippy"
res=PSplit_csv(ts,ta())
print using "####-##-## ###.#F &";val(ta(3)),val(ta(2)),val(ta(1)),val(ta(7)),ta(16)
read ts
wend
'
print
print "Sleeping to Exit.."
sleep
end
'
#ifndef PSplit_csv
'====================================== psplit_csv()
function _
    PSplit_csv(_
        targStr as string,_
        targArray() as string,_
        delimiter as string=","_
    ) as integer
'
    if TargStr="" then return -1
    if delimiter="" or len(delimiter)>1 then return -2
    if delimiter=!"\""or delimiter="\" then return -3
    '
    dim as integer c,p,res
    dim as string pt,ts
    '
    pt= !"(^|" _
        & "\" & delimiter & _
        !")(\"([^\"]+|\"\")*\"|[^" _
        & "\" & delimiter & _
        "]*)"
    '
    with PInStrData
    '
    c=1
    p=pinstr(targStr,pt)
    while p>0

        if left(.match,1)=delimiter and len(.match)>1 then
            targArray(c)=""
            c+=1
            targArray(c)=trim(.match,ANY delimiter & " ")
        else
            targArray(c)=trim(.match," ")
        end if

        if .match=delimiter then
            targArray(c)=""
            c+=1
            targArray(c)=""
        end if

        c+=1
        p+=len(.match)

        p=pinstr(p,targStr,pt)
    wend
    '
    end with
    '
    return iif(c>1,c-1,0)
'
end function
#endif
'
'
'PARM = MON,DAY,YEAR,HR,MIN,TMZN,TMPF,DWPF,RELH,SKNT,DRCT,QFLG,PMSL,
'ALTI,P03D,WNUM,VSBY,P01I,P03I,P06I,P24I,CIG,HI6,LO6,HI24,LO24
'
MyData:
DATA "4,23,2009, 3,54,HST, 70,64,81,12,20,OK,30.03,30.02,,overcast,10,,,,,10000,,,,"
DATA "4,23,2009, 2,54,HST, 70,64,81,14,30,OK,30.04,30.03,,overcast,10,,,,,7000,,,,"
DATA "4,23,2009, 1,54,HST, 70,63,78,14,30,OK,30.06,30.04,236.68,overcast,10,,,,0.02,6000,71.1,66,,"
DATA "4,23,2009, 0,54,HST, 66,61,84,0,,OK,30.08,30.07,,overcast,10,,,,,9500,,,,"
DATA "4,22,2009, 23,54,HST, 70,62.1,76,10,40,OK,30.09,30.07,,mostly cloudy,10,,,,,9500,,,77,66"
DATA "4,22,2009, 22,54,HST, 70,62.1,76,13,30,OK,30.1,30.09,0.18,partly cloudy,10,,,,,,,,,"
DATA "4,22,2009, 21,54,HST, 70,61,73,13,20,OK,30.11,30.09,,partly cloudy,10,,,,,,,,,"
DATA "4,22,2009, 20,54,HST, 70,61,73,14,20,OK,30.1,30.08,,clear,10,,,,,,,,,"
DATA "4,22,2009, 19,54,HST, 69.1,61,75,10,40,OK,30.08,30.07,89.18,mostly cloudy,10,,,,,4500,77,69.1,,"
DATA "4,22,2009, 18,54,HST, 70,61,73,10,50,OK,30.06,30.05,,mostly cloudy,10,,,,,4500,,,,"
DATA "4,22,2009, 17,54,HST, 71.1,61,71,13,40,OK,30.04,30.03,,mostly cloudy,10,,,,,3500,,,,"
DATA "4,22,2009, 16,54,HST, 72,61,68,18,30,OK,30.03,30.01,147.89,mostly cloudy,10,,,,,3000,,,,"
DATA "4,22,2009, 15,54,HST, 73,61,66,18,30,OK,30.02,30.01,,mostly cloudy,10,,,,,3000,,,,"
DATA "4,22,2009, 14,54,HST, 73.9,60.1,62,20,30,OK,30.03,30.01,,partly cloudy,10,,,,,,,,,"
DATA "4,22,2009, 13,54,HST, 75,61,62,17,30,OK,30.05,30.04,236.71,partly cloudy,10,,,,,,75.9,68,,"
DATA "4,22,2009, 12,54,HST, 73.9,61,64,16,30,OK,30.08,30.06,,mostly cloudy,10,,,,,3900,,,,"
DATA "4,22,2009, 11,54,HST, 73.9,61,64,16,30,OK,30.09,30.08,,mostly cloudy,10,,,,,3100,,,,"
DATA "4,22,2009, 10,54,HST, 75,61,62,15,30,OK,30.09,30.08,236.36,partly cloudy,10,,,,,,,,,"
DATA "4,22,2009, 9,54,HST, 73.9,62.1,66,15,20,OK,30.11,30.1,,partly cloudy,10,,,,,,,,,"
DATA "4,22,2009, 8,54,HST, 71.1,64,79,14,30,OK,30.11,30.1,,mostly cloudy,10,,,,,5000,,,,"
DATA "4,22,2009, 7,54,HST, 68,64.4,88,10,30,OK,30.11,30.09,89.24,overcast,10,,,0.02,,3000,69.1,66,,"
DATA "4,22,2009, 7,40,HST, 68,64.4,88,9,30,OK,,30.09,,overcast,10,,,,,3000,,,,"
DATA "4,22,2009, 7,02,HST, 66.2,62.6,88,8,40,OK,,30.08,,overcast,10,,,,,2600,,,,"
DATA "4,22,2009, 6,54,HST, 66.9,64,90,10,40,OK,30.09,30.07,,overcast,10,,,,,3000,,,,"
DATA "4,22,2009, 5,54,HST, 66,64,93,8,40,OK,30.06,30.05,,overcast,10,0,,,,2200,,,,"
DATA "4,22,2009, 4,54,HST, 66,64,93,10,50,OK,30.04,30.03,147.71,lt rain,10,0.02,0.02,,,1800,,,,"
DATA "4,22,2009, 4,25,HST, 66.2,64.4,94,12,50,OK,,30.03,,lt rain,10,0.02,,,,1800,,,,"
DATA "Zippy"
ETA: mods to accomodate new pinstr.bi 2009-04-24
Last edited by Zippy on Apr 25, 2009 19:37, edited 1 time in total.
Zippy
Posts: 1295
Joined: Feb 10, 2006 18:05

Post by Zippy »

AGS wrote:
Zippy wrote:The speed slug is regcomp() from lib tre. I was able to reduce the overhead from 300% to 50% slower than instr by reusing the already-compiled regexp across iterations. If the pattern/regexp is the same for 5000 iterations through a file then there's no need to recompile (regcomp()) the regexp.

Then, the second bottleneck is regexec() itself (again from lib tre).

I've minimized the regcomp() hit, I'm ok with a bit slower than instr to allow regexps. regexec(), the lesser hit, can't be circumvented.
From 300% downto 50% is an enormous improvement. I'm not so sure about the performance of libtre. Have you tried PCRE?
I looked briefly at PCRE then abandoned that thought because it requires a separate dll. I would prefer - this all would have been easier for me if I used PCRE as I've more experience with perl regexps.

I'm ok with PInstr being slower than instr(). I patterned PInstr on instr(), but the comparison of the 2 beyond the most trivial usage isn't.. a valid comparison.

I wanted a case-insensitive instr (PInStr), then PInput and PSplit, for my use. I got'em.
Post Reply