asc/unicode

speedfixer · Post by **speedfixer** » Oct 03, 2015 22:35

I recently ran into a type-mismatch error that sent me on a trail looking at asc vs.unicode (and other stuff.)

I don't use unicode - and have no knowledge of it use in programs.
I now what it is and can do for other languages.

The question:

ASC returns integer -- 32 bit? 64 bit?

I see no form of unicode that needs 64 bit.

Should this be true? This caused a problem at least once:

http://www.freebasic.net/forum/viewtopi ... start=4185

Docs isn't the place to raise this question - but maybe ASC isn't a full integer in 64 bit? (I only play with 32 bit at this time.)

David

speedfixer · Post by **speedfixer** » Oct 04, 2015 5:04

And also, on the ASC example that includes Russian, shouldn't there be ',2' to specify the second character?

DOS386 · Post by **DOS386** » Dec 15, 2015 8:16

http://www.freebasic.net/wiki/wikka.php?wakka=KeyPgAsc

on the ASC example that includes Russian, shouldn't there be ',2' to specify the second character?

Seems that YES.

http://www.freebasic.net/wiki/wikka.php ... yPgWstring

On Win32 wstrings are encoded in UCS-2 (UTF-16 LE) and a character takes up 2 bytes

This is ^^^ incorrect / confusing ( http://en.wikipedia.org/wiki/UTF-16 )

and means that characters with codes / codepoints > $FFFF (extra Chinese stuff, smileys, ...) will NOT work if indeed "a character takes up 2 bytes" (fixed-width encoding).

I see no form of unicode that needs 64 bit.

Indeed. The most generous / wasteful encoding uses 32 bits and there is absolutely no point to use even more. BTW, currently cca 110'000 char's are defined occupying cca 10 % of totally available 1'114'112 "codepoints" (a few of them are "reserved"). Theoretically, only 20.087462841250339408254066011 bits per codepoint would be enough.

caseih · Post by **caseih** » Dec 15, 2015 22:22

The FB wiki isn't completely wrong about Win32 wstrings. On older versions of Windows (2000 I think, and early versions of XP) indeed did define a WString as being made up of UCS-2 characters. Later on, Win32 switched to UTF-16, to allow for non-BMP characters. This of course came at the expense of fast string indexing, as you can no longer assume a character is 2 bytes (though it often is). I'm sure there are plenty of programs and programmers out there that just assume UCS-2.

I would guess that FB's unicode support in ASC() is working with UTF-16 input.

MrSwiss · Post by **MrSwiss** » Dec 16, 2015 14:58

speedfixer wrote:ASC returns integer -- 32 bit? 64 bit?

According to Doc: ASC() returns UInteger -- 32 bit with FBC-32bit -- or -- 64bit with FBC-64bit
independent of String/WString ...

Post by **dkl** » Dec 16, 2015 17:39

Hi,

I've just adjusted KeyPgAsc a bit to clarify the result type and the returned value.

Currently it always returns a ULong, not UInteger. But it doesn't really matter, since it just returns the raw char value, just in a ULong. Furthermore, the docs also said that it returned a Unicode value, which was partially wrong aswell. With UTF16-encoded wstrings on Windows it returns the raw 16bit value, that's not always a Unicode codepoint (Asc() does not decode UTF16 surrogate pairs for us...).

Asc(): function call that does bounds checking, and converts to ULong.
String indexing: direct memory access, no function call, no bounds checking, no conversion.

MrSwiss · Post by **MrSwiss** » Dec 17, 2015 14:56

@dkl,

thanks, for clarifying/rewriting that.
There is something else that is puzzling to me:

This will be a 7-bit ASCII code, or even a 8-bit character value from some code-page, depending on the string data stored in str.

I've been under the Impression, that the 8bit Stuff relies on CP437 or, in other Words, IBM-Extended?
Isn't that internal to FBC? As defined in the ASCII-Table in Doc?

In Case I'm wrong:
how is the Mechanism working, that decides how Translation is handled, or which CP is used??

caseih · Post by **caseih** » Dec 17, 2015 15:01

The character displayed in the console window depends entirely on the code-page or encoding the console is using. FB assigns no special meaning to characters above ASCII. They are all just bytes to FB. The windows console very well could be using CP437, but it's by no means guaranteed.

Graphics mode, however, does implement the cp437 code page. This is a potential problem as it means that programs in graphics mode cannot accept input from the keyboard in some languages like Hebrew or Chinese, nor of course can such characters be printed to the screen.

On my Linux UTF-8 terminal emulator, for example, I just get question marks when I print out chr() over 128. If I want cp437 characters, I would have to tweak my emulator's settings. Or use UTF-8 byte pairs that correspond with the cp437 characters.

MrSwiss · Post by **MrSwiss** » Dec 17, 2015 15:54

caseih wrote:The windows console ...

is definitely using CP437, as it's displaying what we have in ASCII-Table. Just now tested on WIN 8.1
64bit, using FBC 1.04.0 WIN64 standalone.

caseih · Post by **caseih** » Dec 17, 2015 16:07

I suspect that depends on your locale. For example, if your system is Russian, the command console would doubtless be set to either UTF-8 or cp866.

Also you can change the code page from the command prompt with the chcp command:
http://ss64.com/nt/chcp.html

We kind of need to ween ourselves from the old notion that bytes have specific, unchanging meanings I think. I never understood Albert's need to display exactly all the bytes from a binary file, for example, in CP437, as if they characters held some special meaning.

MrSwiss · Post by **MrSwiss** » Dec 18, 2015 13:57

caseih wrote:I suspect that depends on your locale.

And I suspect you're wrong on that.
Remember: on WIN > XP the Language Setting is not anymore Part of Locale Setting.
I'd think it is more related to the Language Setting and NOT the Locale used.

My Locale: Switzerland
My Language: US-EN (for the OS)

BTW: my Test was carried out on "Open-Console", which I think was Part of earlier FBC-Distro.
chcp returns: Active Code Page = 437

Post by **dkl** » Dec 18, 2015 21:38

And on top of that confusion, it looks like programs can call SetConsoleCP() and SetConsoleOutputCP() at any time to override the default/user setting. Though I don't think FB ever tried to do that internally.

asc/unicode

asc/unicode

Re: asc/unicode

Re: asc/unicode

Re: asc/unicode

Re: asc/unicode

Re: asc/unicode

Re: asc/unicode

Re: asc/unicode

Re: asc/unicode

Re: asc/unicode

Re: asc/unicode

Re: asc/unicode