(mis)understanding Isalnum

New to FreeBASIC? Post your questions here.
Post Reply
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

(mis)understanding Isalnum

Post by fzabkar »

Why does the Isalnum function in crt/ctype.bi not return 0 for all ubyte values greater than 0x7F? Why is it not limited to 7-bit ASCII characters?

This table lists Hex(ubytvar) and Hex(Isalnum(ubytvar)):

Code: Select all

00 00    01 00    02 00    03 00    04 00    05 00    06 00    07 00    
08 00    09 00    0A 00    0B 00    0C 00    0D 00    0E 00    0F 00    
10 00    11 00    12 00    13 00    14 00    15 00    16 00    17 00    
18 00    19 00    1A 00    1B 00    1C 00    1D 00    1E 00    1F 00    
20 00    21 00    22 00    23 00    24 00    25 00    26 00    27 00    
28 00    29 00    2A 00    2B 00    2C 00    2D 00    2E 00    2F 00    
30 04    31 04    32 04    33 04    34 04    35 04    36 04    37 04    
38 04    39 04    3A 00    3B 00    3C 00    3D 00    3E 00    3F 00    
40 00    41 01    42 01    43 01    44 01    45 01    46 01    47 01    
48 01    49 01    4A 01    4B 01    4C 01    4D 01    4E 01    4F 01    
50 01    51 01    52 01    53 01    54 01    55 01    56 01    57 01    
58 01    59 01    5A 01    5B 00    5C 00    5D 00    5E 00    5F 00    
60 00    61 02    62 02    63 02    64 02    65 02    66 02    67 02    
68 02    69 02    6A 02    6B 02    6C 02    6D 02    6E 02    6F 02    
70 02    71 02    72 02    73 02    74 02    75 02    76 02    77 02    
78 02    79 02    7A 02    7B 00    7C 00    7D 00    7E 00    7F 00    
80 00    81 00    82 00    83 02    84 00    85 00    86 00    87 00    
88 00    89 00    8A 01    8B 00    8C 01    8D 00    8E 01    8F 00    
90 00    91 00    92 00    93 00    94 00    95 00    96 00    97 00    
98 00    99 00    9A 02    9B 00    9C 02    9D 00    9E 02    9F 01    
A0 00    A1 00    A2 00    A3 00    A4 00    A5 00    A6 00    A7 00    
A8 00    A9 00    AA 02    AB 00    AC 00    AD 00    AE 00    AF 00    
B0 00    B1 00    B2 04    B3 04    B4 00    B5 02    B6 00    B7 00    
B8 00    B9 04    BA 02    BB 00    BC 00    BD 00    BE 00    BF 00    
C0 01    C1 01    C2 01    C3 01    C4 01    C5 01    C6 01    C7 01    
C8 01    C9 01    CA 01    CB 01    CC 01    CD 01    CE 01    CF 01    
D0 01    D1 01    D2 01    D3 01    D4 01    D5 01    D6 01    D7 00    
D8 01    D9 01    DA 01    DB 01    DC 01    DD 01    DE 01    DF 02    
E0 02    E1 02    E2 02    E3 02    E4 02    E5 02    E6 02    E7 02    
E8 02    E9 02    EA 02    EB 02    EC 02    ED 02    EE 02    EF 02    
F0 02    F1 02    F2 02    F3 02    F4 02    F5 02    F6 02    F7 00    
F8 02    F9 02    FA 02    FB 02    FC 02    FD 02    FE 02    FF 02
adeyblue
Posts: 300
Joined: Nov 07, 2019 20:08

Re: (mis)understanding Isalnum

Post by adeyblue »

Because FreeBasic sets the CType locale settings to your computers default locale instead of the C locale, which does work as you describe.
Or, as the person who made it do that wrote:
/**
* With the default "C" locale (which is just plain 7-bit ASCII),
* our mbstowcs() calls (from fb_wstr_ConvFromA()) fail to convert
* zstrings specific to the user's locale to Unicode wstrings.
*
* To fix this we must tell the CRT to use the user's locale setting,
* i.e. the locale given by LC_* or LANG environment variables.
*
* We should change the LC_CTYPE setting only, to affect the behaviour
* of the codepage <-> Unicode conversion functions, but not for
* example LC_NUMERIC, which would affect things like the decimal
* separator used by float <-> string conversion functions.
*
* Don't bother doing it under DJGPP - there we don't really support
* wstrings anyways, and the setlocale() reference increases .exe size.
*/
Because of this, and of how complicated all this is on Windows, If you print out the character each byte represents, and the alnum values alongside, the alnum values won't necessarily correspond to the characters printed out.

For instance, for me this code:

Code: Select all

#define _WIN32_WINNT &h0601
#include "windows.bi"
#include "crt/ctype.bi"
#include "crt/locale.bi"

dim as zstring ptr curCTypeLocale = setlocale(2, 0) '' LC_CTYPE value is wrong in the bi, it's 2 on Windows, not 0
dim as zstring * 90 locname
GetLocaleInfoA(LOCALE_USER_DEFAULT, LOCALE_SNAME, @locname, 90)
dim as ulong defCP, defAnsiCP
GetLocaleInfoA(LOCALE_USER_DEFAULT, LOCALE_IDEFAULTCODEPAGE Or LOCALE_RETURN_NUMBER, @defCP, 4)
GetLocaleInfoA(LOCALE_USER_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE Or LOCALE_RETURN_NUMBER, @defAnsiCP, 4)

'' SetConsoleOutputCP(1252)
'' SetConsoleCP(1252)

for i as long = 0 to 255
    dim chstr as zstring * 2
    chstr[0] = i
    dim as ushort chtype
    GetStringTypeA(LOCALE_USER_DEFAULT, CT_CTYPE1, @chstr, 1, @chType)
    Print Using "hex = &, char = &, alnum = &, alp = &, num = &, chType = &"; Hex(i); chstr; Hex(isalnum(i)); Hex(isalpha(i)); Hex(isdigit(i)); Hex(chType)
next
If curCTypeLocale <> 0 Then Print "Current CType locale = " & *curCTypeLocale
Print "Current user default locale = " & locname
Print Using "Locale default codepage = &, default ansi code page = &"; defCP; defAnsiCP
Print Using "Actual output cp = &, input cp = &"; GetConsoleOutputCP(); GetConsoleCP()
Outputs this:
...
hex = 9C, char = £, alnum = 102, alp = 102, num = 0, chType = 302
...
hex = A3, char = ú, alnum = 0, alp = 0, num = 0, chType = 210
...
Current CType locale = English_United Kingdom.1252
Current user default locale = en-GB
Locale default codepage = 850, default ansi code page = 1252
Actual output cp = 437, input cp = 437
So on code page 437 the £ sign is at 9C, which isalnum claims is a lowercase character (alnum 102 is alpha + lower).
While A3 which is a lowercase alpha is claimed to be nothing of the sort (chType 210 is 'it exists' + punctuation).

If we change the output code page to 1252 though, the one the CType locale is using for its values (in old windows, you may also need to change the console font at this point),
...
hex = 9C, char = œ, alnum = 102, alp = 102, num = 0, chType = 302
...
hex = A3, char = £, alnum = 0, alp = 0, num = 0, chType = 210
...
Current CType locale = English_United Kingdom.1252
Current user default locale = en-GB
Locale default codepage = 850, default ansi code page = 1252
Actual output cp = 1252, input cp = 1252
A3 is now the punctuational pound sign, and 9C is a lowercase latin ligature. And the alnum values now correspond to the characters they are printed alongside
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: (mis)understanding Isalnum

Post by fzabkar »

Thank you for the detailed explanation.

I would argue that the behaviour of this function is counterintuitive. In any case, it's inappropriate for my application, so I've had to write my own trivial 7-bit ASCII version.

Thanks again.
Post Reply