[SOLVED] Pass UTF-8 strings

General FreeBASIC programming questions.
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: [SOLVED] Pass UTF-8 strings

Post by TJF »

3oheicrw wrote: Apr 04, 2022 16:37BTW you were right. Geany is the best editor for FreeBASIC.
OK (for now). Let us talk again when you got familiar with the basics and you had time to evaluate the sugar features (ie custom commands).

And please forgive me that I didn't check your assumption regarding the fbc bug. First I concentrated on getting your code running.

Regarding the CoderJeff statement (Apr 03, 2022 15:51) a BOM results in different readings on different platforms, and an UTF-8 BOM makes string literals in the source unreadable. That's a bug full sure.

@fxm:
Please add a note to the docs that using a BOM in the source code files is currently experimental, and that reliable cross platform code has to be encoded without BOM in ASCII/UTF-8 characters. (I guess that this can not get fixed in short term.)
fxm
Moderator
Posts: 12107
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: [SOLVED] Pass UTF-8 strings

Post by fxm »

Done:
- KeyPgWstring → fxm [added note on 'BOM' usage]
- ProPgSourceFiles → fxm [added note on 'BOM' usage]
- ProPgLiterals → fxm [added note on 'BOM' usage]
coderJeff
Site Admin
Posts: 4326
Joined: Nov 04, 2005 14:23
Location: Ontario, Canada
Contact:

Re: [SOLVED] Pass UTF-8 strings

Post by coderJeff »

Identifying BOM's and parsing source is neither experimental nor bugged (*).

Imagine you have this valid source in any encoding:

Code: Select all

var s = "test 中国語"
rtlib_func( s )
What kind of encoding (bytes in memory) should rtlib_func() expect to receive and why?

(*) unless start digging in to wstring -> ascii localized conversion and probably unicode code points > ffff on windows.
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: [SOLVED] Pass UTF-8 strings

Post by caseih »

Now that Windows has solid support for UTF-8, and in fact is moving towards that being the default, I think the only sane choice for the future is to *require* that all FB source code files be UTF-8 encoded, BOM or no BOM. There are already popular languages now that do this on all platforms. Then FB need do nothing with string literals. Leave them as they are, utf-8-encoded bytes. Also I believe the with windows 10 you can now use utf-8 bytes when interacting with the ANSI versions of win32 calls, provided the application has set utf-8 as its default encoding. Plus on Mac, Linux, and any other platform, all system calls and library APIs expect UTF-8 bytes. If a source code file is detected to be UTF-16 (wide text on Windows), before parsing, all input would be converted to UTF-8. This would make string literals work the same even if someone is using a "legacy" Windows editors.

Of course the FB runtime would require some additional support, such as versions of LEN, MID, RIGHT, LEFT, etc that can decode UTF-8. We would still need versions of those functions that can count and slice raw bytes of course, but the majority of use cases would be with UTF-8 encoded strings. With this approach there is little need for "WIDE" strings at all. Although the problem with UTF-8 is that LEN, MID, RIGHT, LEFT, etc, all become O(n) operations because UTF-8 must be decoded to identify the code points for counting, slicing, etc. But in my opinion, this is a small price to pay for making FB more at home in a modern unicode world.

Alternatively, but not ideally, a notation to indicate a unicode literal (a WIDE string literal) could be used, not unlike the !"" notation used to denoted escaped strings.

I recognize that all of these ideas are breaking changes. But I also acknowledge that the current state of string literal interpretation is less than ideal and can appear to the programmer as inconsistent as he or she will likely be unware of whether the editor they are using is using BOMs or not.
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: [SOLVED] Pass UTF-8 strings

Post by TJF »

coderJeff wrote: Apr 05, 2022 23:20 Imagine you have this valid source in any encoding:

Code: Select all

var s = "test 中国語"
rtlib_func( s )
What kind of encoding (bytes in memory) should rtlib_func() expect to receive and why?
When there is no BOM, the source reads as ASCII, one byte per character. In this case the Hànzì characters in the string literal are UTF-8 encoded. The function receives the byte sequence !"test \xE4\xB8\xAD\xE5\x9B\xBD\xE8\xAA\x9E". This is currently working.

When the exact same source gets prepended by the matching BOM (!"\xEF\xBB\xBF" for UTF-8), then nothing should change. But fbc currently fails and the function receives !"test \x4E\x2D\x56\xFD\x8A\x9E" on windows and (regarding your first post here) something different on LINUX. This is a bug.

In general: fbc should not internally change the encoding of string laterals at all.
fxm
Moderator
Posts: 12107
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: [SOLVED] Pass UTF-8 strings

Post by fxm »

fxm wrote: Apr 05, 2022 11:34 Done:
- KeyPgWstring → fxm [added note on 'BOM' usage]
- ProPgSourceFiles → fxm [added note on 'BOM' usage]
- ProPgLiterals → fxm [added note on 'BOM' usage]

Moderation done:
- KeyPgWstring → fxm [moderated the note on 'BOM' usage]
- ProPgSourceFiles → fxm [moderated the note on 'BOM' usage]
- ProPgLiterals → fxm [moderated the note on 'BOM' usage]

The final note text is now pending the conclusion of your exchanges.
Post Reply