[SOLVED] Pass UTF-8 strings

TJF · Post by **TJF** » Apr 05, 2022 11:13

3oheicrw wrote: ↑Apr 04, 2022 16:37BTW you were right. Geany is the best editor for FreeBASIC.

OK (for now). Let us talk again when you got familiar with the basics and you had time to evaluate the sugar features (ie custom commands).

And please forgive me that I didn't check your assumption regarding the fbc bug. First I concentrated on getting your code running.

Regarding the CoderJeff statement (Apr 03, 2022 15:51) a BOM results in different readings on different platforms, and an UTF-8 BOM makes string literals in the source unreadable. That's a bug full sure.

@fxm:
Please add a note to the docs that using a BOM in the source code files is currently experimental, and that reliable cross platform code has to be encoded without BOM in ASCII/UTF-8 characters. (I guess that this can not get fixed in short term.)

Post by **fxm** » Apr 05, 2022 11:34

Done:
- KeyPgWstring → fxm [added note on 'BOM' usage]
- ProPgSourceFiles → fxm [added note on 'BOM' usage]
- ProPgLiterals → fxm [added note on 'BOM' usage]

Post by **coderJeff** » Apr 05, 2022 23:20

Identifying BOM's and parsing source is neither experimental nor bugged (*).

Imagine you have this valid source in any encoding:

Code: Select all

var s = "test 中国語"
rtlib_func( s )

What kind of encoding (bytes in memory) should rtlib_func() expect to receive and why?

(*) unless start digging in to wstring -> ascii localized conversion and probably unicode code points > ffff on windows.

caseih · Post by **caseih** » Apr 06, 2022 0:19

Now that Windows has solid support for UTF-8, and in fact is moving towards that being the default, I think the only sane choice for the future is to *require* that all FB source code files be UTF-8 encoded, BOM or no BOM. There are already popular languages now that do this on all platforms. Then FB need do nothing with string literals. Leave them as they are, utf-8-encoded bytes. Also I believe the with windows 10 you can now use utf-8 bytes when interacting with the ANSI versions of win32 calls, provided the application has set utf-8 as its default encoding. Plus on Mac, Linux, and any other platform, all system calls and library APIs expect UTF-8 bytes. If a source code file is detected to be UTF-16 (wide text on Windows), before parsing, all input would be converted to UTF-8. This would make string literals work the same even if someone is using a "legacy" Windows editors.

Of course the FB runtime would require some additional support, such as versions of LEN, MID, RIGHT, LEFT, etc that can decode UTF-8. We would still need versions of those functions that can count and slice raw bytes of course, but the majority of use cases would be with UTF-8 encoded strings. With this approach there is little need for "WIDE" strings at all. Although the problem with UTF-8 is that LEN, MID, RIGHT, LEFT, etc, all become O(n) operations because UTF-8 must be decoded to identify the code points for counting, slicing, etc. But in my opinion, this is a small price to pay for making FB more at home in a modern unicode world.

Alternatively, but not ideally, a notation to indicate a unicode literal (a WIDE string literal) could be used, not unlike the !"" notation used to denoted escaped strings.

I recognize that all of these ideas are breaking changes. But I also acknowledge that the current state of string literal interpretation is less than ideal and can appear to the programmer as inconsistent as he or she will likely be unware of whether the editor they are using is using BOMs or not.

TJF · Post by **TJF** » Apr 06, 2022 4:33

coderJeff wrote: ↑Apr 05, 2022 23:20 Imagine you have this valid source in any encoding:
Code: Select all
var s = "test 中国語"
rtlib_func( s )
What kind of encoding (bytes in memory) should rtlib_func() expect to receive and why?

When there is no BOM, the source reads as ASCII, one byte per character. In this case the Hànzì characters in the string literal are UTF-8 encoded. The function receives the byte sequence !"test \xE4\xB8\xAD\xE5\x9B\xBD\xE8\xAA\x9E". This is currently working.

When the exact same source gets prepended by the matching BOM (!"\xEF\xBB\xBF" for UTF-8), then nothing should change. But fbc currently fails and the function receives !"test \x4E\x2D\x56\xFD\x8A\x9E" on windows and (regarding your first post here) something different on LINUX. This is a bug.

In general: fbc should not internally change the encoding of string laterals at all.

Post by **fxm** » Apr 06, 2022 5:10

fxm wrote: ↑Apr 05, 2022 11:34 Done:
- KeyPgWstring → fxm [added note on 'BOM' usage]
- ProPgSourceFiles → fxm [added note on 'BOM' usage]
- ProPgLiterals → fxm [added note on 'BOM' usage]

Moderation done:
- KeyPgWstring → fxm [moderated the note on 'BOM' usage]
- ProPgSourceFiles → fxm [moderated the note on 'BOM' usage]
- ProPgLiterals → fxm [moderated the note on 'BOM' usage]

The final note text is now pending the conclusion of your exchanges.

[SOLVED] Pass UTF-8 strings

Re: [SOLVED] Pass UTF-8 strings

Re: [SOLVED] Pass UTF-8 strings

Re: [SOLVED] Pass UTF-8 strings

Re: [SOLVED] Pass UTF-8 strings

Re: [SOLVED] Pass UTF-8 strings

Re: [SOLVED] Pass UTF-8 strings