asc/unicode

Forum for discussion about the documentation project.
Post Reply
speedfixer
Posts: 606
Joined: Nov 28, 2012 1:27
Location: CA, USA moving to WA, USA
Contact:

asc/unicode

Post by speedfixer »

I recently ran into a type-mismatch error that sent me on a trail looking at asc vs.unicode (and other stuff.)

I don't use unicode - and have no knowledge of it use in programs.
I now what it is and can do for other languages.



The question:

ASC returns integer -- 32 bit? 64 bit?

I see no form of unicode that needs 64 bit.

Should this be true? This caused a problem at least once:

http://www.freebasic.net/forum/viewtopi ... start=4185


Docs isn't the place to raise this question - but maybe ASC isn't a full integer in 64 bit? (I only play with 32 bit at this time.)

David
speedfixer
Posts: 606
Joined: Nov 28, 2012 1:27
Location: CA, USA moving to WA, USA
Contact:

Re: asc/unicode

Post by speedfixer »

And also, on the ASC example that includes Russian, shouldn't there be ',2' to specify the second character?
DOS386
Posts: 798
Joined: Jul 02, 2005 20:55

Re: asc/unicode

Post by DOS386 »

http://www.freebasic.net/wiki/wikka.php?wakka=KeyPgAsc
on the ASC example that includes Russian, shouldn't there be ',2' to specify the second character?
Seems that YES.

http://www.freebasic.net/wiki/wikka.php ... yPgWstring
On Win32 wstrings are encoded in UCS-2 (UTF-16 LE) and a character takes up 2 bytes
This is ^^^ incorrect / confusing ( http://en.wikipedia.org/wiki/UTF-16 )

and means that characters with codes / codepoints > $FFFF (extra Chinese stuff, smileys, ...) will NOT work if indeed "a character takes up 2 bytes" (fixed-width encoding).
I see no form of unicode that needs 64 bit.
Indeed. The most generous / wasteful encoding uses 32 bits and there is absolutely no point to use even more. BTW, currently cca 110'000 char's are defined occupying cca 10 % of totally available 1'114'112 "codepoints" (a few of them are "reserved"). Theoretically, only 20.087462841250339408254066011 bits per codepoint would be enough.
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: asc/unicode

Post by caseih »

The FB wiki isn't completely wrong about Win32 wstrings. On older versions of Windows (2000 I think, and early versions of XP) indeed did define a WString as being made up of UCS-2 characters. Later on, Win32 switched to UTF-16, to allow for non-BMP characters. This of course came at the expense of fast string indexing, as you can no longer assume a character is 2 bytes (though it often is). I'm sure there are plenty of programs and programmers out there that just assume UCS-2.

I would guess that FB's unicode support in ASC() is working with UTF-16 input.
MrSwiss
Posts: 3910
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: asc/unicode

Post by MrSwiss »

speedfixer wrote:ASC returns integer -- 32 bit? 64 bit?
According to Doc: ASC() returns UInteger -- 32 bit with FBC-32bit -- or -- 64bit with FBC-64bit
independent of String/WString ...
dkl
Site Admin
Posts: 3235
Joined: Jul 28, 2005 14:45
Location: Germany

Re: asc/unicode

Post by dkl »

Hi,

I've just adjusted KeyPgAsc a bit to clarify the result type and the returned value.

Currently it always returns a ULong, not UInteger. But it doesn't really matter, since it just returns the raw char value, just in a ULong. Furthermore, the docs also said that it returned a Unicode value, which was partially wrong aswell. With UTF16-encoded wstrings on Windows it returns the raw 16bit value, that's not always a Unicode codepoint (Asc() does not decode UTF16 surrogate pairs for us...).

Asc(): function call that does bounds checking, and converts to ULong.
String indexing: direct memory access, no function call, no bounds checking, no conversion.
MrSwiss
Posts: 3910
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: asc/unicode

Post by MrSwiss »

@dkl,

thanks, for clarifying/rewriting that.
There is something else that is puzzling to me:
This will be a 7-bit ASCII code, or even a 8-bit character value from some code-page, depending on the string data stored in str.
I've been under the Impression, that the 8bit Stuff relies on CP437 or, in other Words, IBM-Extended?
Isn't that internal to FBC? As defined in the ASCII-Table in Doc?

In Case I'm wrong:
how is the Mechanism working, that decides how Translation is handled, or which CP is used??
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: asc/unicode

Post by caseih »

The character displayed in the console window depends entirely on the code-page or encoding the console is using. FB assigns no special meaning to characters above ASCII. They are all just bytes to FB. The windows console very well could be using CP437, but it's by no means guaranteed.

Graphics mode, however, does implement the cp437 code page. This is a potential problem as it means that programs in graphics mode cannot accept input from the keyboard in some languages like Hebrew or Chinese, nor of course can such characters be printed to the screen.

On my Linux UTF-8 terminal emulator, for example, I just get question marks when I print out chr() over 128. If I want cp437 characters, I would have to tweak my emulator's settings. Or use UTF-8 byte pairs that correspond with the cp437 characters.
MrSwiss
Posts: 3910
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: asc/unicode

Post by MrSwiss »

caseih wrote:The windows console ...
is definitely using CP437, as it's displaying what we have in ASCII-Table. Just now tested on WIN 8.1
64bit, using FBC 1.04.0 WIN64 standalone.
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: asc/unicode

Post by caseih »

I suspect that depends on your locale. For example, if your system is Russian, the command console would doubtless be set to either UTF-8 or cp866.

Also you can change the code page from the command prompt with the chcp command:
http://ss64.com/nt/chcp.html

We kind of need to ween ourselves from the old notion that bytes have specific, unchanging meanings I think. I never understood Albert's need to display exactly all the bytes from a binary file, for example, in CP437, as if they characters held some special meaning.
MrSwiss
Posts: 3910
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: asc/unicode

Post by MrSwiss »

caseih wrote:I suspect that depends on your locale.
And I suspect you're wrong on that.
Remember: on WIN > XP the Language Setting is not anymore Part of Locale Setting.
I'd think it is more related to the Language Setting and NOT the Locale used.
  • My Locale: Switzerland
    My Language: US-EN (for the OS)
BTW: my Test was carried out on "Open-Console", which I think was Part of earlier FBC-Distro.
chcp returns: Active Code Page = 437
dkl
Site Admin
Posts: 3235
Joined: Jul 28, 2005 14:45
Location: Germany

Re: asc/unicode

Post by dkl »

And on top of that confusion, it looks like programs can call SetConsoleCP() and SetConsoleOutputCP() at any time to override the default/user setting. Though I don't think FB ever tried to do that internally.
Post Reply