asc/unicode
-
- Posts: 606
- Joined: Nov 28, 2012 1:27
- Location: CA, USA moving to WA, USA
- Contact:
asc/unicode
I recently ran into a type-mismatch error that sent me on a trail looking at asc vs.unicode (and other stuff.)
I don't use unicode - and have no knowledge of it use in programs.
I now what it is and can do for other languages.
The question:
ASC returns integer -- 32 bit? 64 bit?
I see no form of unicode that needs 64 bit.
Should this be true? This caused a problem at least once:
http://www.freebasic.net/forum/viewtopi ... start=4185
Docs isn't the place to raise this question - but maybe ASC isn't a full integer in 64 bit? (I only play with 32 bit at this time.)
David
I don't use unicode - and have no knowledge of it use in programs.
I now what it is and can do for other languages.
The question:
ASC returns integer -- 32 bit? 64 bit?
I see no form of unicode that needs 64 bit.
Should this be true? This caused a problem at least once:
http://www.freebasic.net/forum/viewtopi ... start=4185
Docs isn't the place to raise this question - but maybe ASC isn't a full integer in 64 bit? (I only play with 32 bit at this time.)
David
-
- Posts: 606
- Joined: Nov 28, 2012 1:27
- Location: CA, USA moving to WA, USA
- Contact:
Re: asc/unicode
And also, on the ASC example that includes Russian, shouldn't there be ',2' to specify the second character?
Re: asc/unicode
http://www.freebasic.net/wiki/wikka.php?wakka=KeyPgAsc
http://www.freebasic.net/wiki/wikka.php ... yPgWstring
and means that characters with codes / codepoints > $FFFF (extra Chinese stuff, smileys, ...) will NOT work if indeed "a character takes up 2 bytes" (fixed-width encoding).
Seems that YES.on the ASC example that includes Russian, shouldn't there be ',2' to specify the second character?
http://www.freebasic.net/wiki/wikka.php ... yPgWstring
This is ^^^ incorrect / confusing ( http://en.wikipedia.org/wiki/UTF-16 )On Win32 wstrings are encoded in UCS-2 (UTF-16 LE) and a character takes up 2 bytes
and means that characters with codes / codepoints > $FFFF (extra Chinese stuff, smileys, ...) will NOT work if indeed "a character takes up 2 bytes" (fixed-width encoding).
Indeed. The most generous / wasteful encoding uses 32 bits and there is absolutely no point to use even more. BTW, currently cca 110'000 char's are defined occupying cca 10 % of totally available 1'114'112 "codepoints" (a few of them are "reserved"). Theoretically, only 20.087462841250339408254066011 bits per codepoint would be enough.I see no form of unicode that needs 64 bit.
Re: asc/unicode
The FB wiki isn't completely wrong about Win32 wstrings. On older versions of Windows (2000 I think, and early versions of XP) indeed did define a WString as being made up of UCS-2 characters. Later on, Win32 switched to UTF-16, to allow for non-BMP characters. This of course came at the expense of fast string indexing, as you can no longer assume a character is 2 bytes (though it often is). I'm sure there are plenty of programs and programmers out there that just assume UCS-2.
I would guess that FB's unicode support in ASC() is working with UTF-16 input.
I would guess that FB's unicode support in ASC() is working with UTF-16 input.
Re: asc/unicode
According to Doc: ASC() returns UInteger -- 32 bit with FBC-32bit -- or -- 64bit with FBC-64bitspeedfixer wrote:ASC returns integer -- 32 bit? 64 bit?
independent of String/WString ...
Re: asc/unicode
Hi,
I've just adjusted KeyPgAsc a bit to clarify the result type and the returned value.
Currently it always returns a ULong, not UInteger. But it doesn't really matter, since it just returns the raw char value, just in a ULong. Furthermore, the docs also said that it returned a Unicode value, which was partially wrong aswell. With UTF16-encoded wstrings on Windows it returns the raw 16bit value, that's not always a Unicode codepoint (Asc() does not decode UTF16 surrogate pairs for us...).
Asc(): function call that does bounds checking, and converts to ULong.
String indexing: direct memory access, no function call, no bounds checking, no conversion.
I've just adjusted KeyPgAsc a bit to clarify the result type and the returned value.
Currently it always returns a ULong, not UInteger. But it doesn't really matter, since it just returns the raw char value, just in a ULong. Furthermore, the docs also said that it returned a Unicode value, which was partially wrong aswell. With UTF16-encoded wstrings on Windows it returns the raw 16bit value, that's not always a Unicode codepoint (Asc() does not decode UTF16 surrogate pairs for us...).
Asc(): function call that does bounds checking, and converts to ULong.
String indexing: direct memory access, no function call, no bounds checking, no conversion.
Re: asc/unicode
@dkl,
thanks, for clarifying/rewriting that.
There is something else that is puzzling to me:
Isn't that internal to FBC? As defined in the ASCII-Table in Doc?
In Case I'm wrong:
how is the Mechanism working, that decides how Translation is handled, or which CP is used??
thanks, for clarifying/rewriting that.
There is something else that is puzzling to me:
I've been under the Impression, that the 8bit Stuff relies on CP437 or, in other Words, IBM-Extended?This will be a 7-bit ASCII code, or even a 8-bit character value from some code-page, depending on the string data stored in str.
Isn't that internal to FBC? As defined in the ASCII-Table in Doc?
In Case I'm wrong:
how is the Mechanism working, that decides how Translation is handled, or which CP is used??
Re: asc/unicode
The character displayed in the console window depends entirely on the code-page or encoding the console is using. FB assigns no special meaning to characters above ASCII. They are all just bytes to FB. The windows console very well could be using CP437, but it's by no means guaranteed.
Graphics mode, however, does implement the cp437 code page. This is a potential problem as it means that programs in graphics mode cannot accept input from the keyboard in some languages like Hebrew or Chinese, nor of course can such characters be printed to the screen.
On my Linux UTF-8 terminal emulator, for example, I just get question marks when I print out chr() over 128. If I want cp437 characters, I would have to tweak my emulator's settings. Or use UTF-8 byte pairs that correspond with the cp437 characters.
Graphics mode, however, does implement the cp437 code page. This is a potential problem as it means that programs in graphics mode cannot accept input from the keyboard in some languages like Hebrew or Chinese, nor of course can such characters be printed to the screen.
On my Linux UTF-8 terminal emulator, for example, I just get question marks when I print out chr() over 128. If I want cp437 characters, I would have to tweak my emulator's settings. Or use UTF-8 byte pairs that correspond with the cp437 characters.
Re: asc/unicode
is definitely using CP437, as it's displaying what we have in ASCII-Table. Just now tested on WIN 8.1caseih wrote:The windows console ...
64bit, using FBC 1.04.0 WIN64 standalone.
Re: asc/unicode
I suspect that depends on your locale. For example, if your system is Russian, the command console would doubtless be set to either UTF-8 or cp866.
Also you can change the code page from the command prompt with the chcp command:
http://ss64.com/nt/chcp.html
We kind of need to ween ourselves from the old notion that bytes have specific, unchanging meanings I think. I never understood Albert's need to display exactly all the bytes from a binary file, for example, in CP437, as if they characters held some special meaning.
Also you can change the code page from the command prompt with the chcp command:
http://ss64.com/nt/chcp.html
We kind of need to ween ourselves from the old notion that bytes have specific, unchanging meanings I think. I never understood Albert's need to display exactly all the bytes from a binary file, for example, in CP437, as if they characters held some special meaning.
Re: asc/unicode
And I suspect you're wrong on that.caseih wrote:I suspect that depends on your locale.
Remember: on WIN > XP the Language Setting is not anymore Part of Locale Setting.
I'd think it is more related to the Language Setting and NOT the Locale used.
- My Locale: Switzerland
My Language: US-EN (for the OS)
chcp returns: Active Code Page = 437
Re: asc/unicode
And on top of that confusion, it looks like programs can call SetConsoleCP() and SetConsoleOutputCP() at any time to override the default/user setting. Though I don't think FB ever tried to do that internally.