[SOLVED] Pass UTF-8 strings

TJF · Post by **TJF** » Apr 03, 2022 7:45

Just for curiosity, what's the output from

VAR c_str = @CHR(&h74, &h65, &h73, &h74, &h20 _
               , &he4, &hb8, &had, &he5, &h9b, &hbd, &he8, &haa, &h9e)
test_C_function(c_str)
?LEFT(*c_str, 4)
?MID(*c_str, 6)

3oheicrw · Post by **3oheicrw** » Apr 03, 2022 9:53

@TJF On Windows Command Prompt prints meaningless unicode character, after "chcp 65001" prints full of squares. But if redirect the output to test.txt (without changing codebase first), it output the correct string 中国語.

BTW, I decided my FreeBASIC code will be English only as I don't want to deal with this trouble.

3oheicrw · Post by **3oheicrw** » Apr 03, 2022 10:02

TJF wrote: ↑Apr 02, 2022 12:20 const char* translates to CONST ZSTRING PTR

Example
Code: Select all
test_C_function(@"test 中国語")

VAR c_str = @"test 中国語" '' creates a ZSTRING PTR
test_C_function(c_str)
?LEFT(*c_str, 4)
?MID(*c_str, 6)

This code was wrong. The compiler compiled but report this message: warning 3(2): Passing different pointer types. The displayed text is also incorrect.

TJF · Post by **TJF** » Apr 03, 2022 11:05

3oheicrw wrote: ↑Apr 03, 2022 9:53 @TJF On Windows Command Prompt prints meaningless unicode character, after "chcp 65001" prints full of squares. But if redirect the output to test.txt (without changing codebase first), it output the correct string 中国語.

Your reports are horror. My last code produces three outputs:

The window header bar text (test_C_function)
The LEFT function
The RIGHT function

You didn't report on each one by one. Instead you reported about a mixture of the console output #2 and #3, and nothing about #1.

Anyhow, it seems that your editor is responsible for the issue. It doesn't encode the string literal to UTF-8, so fbc didn't get the desired input. Your test doesn't show this, since the wrong encoding is present at two places, in the source and in the output in file text.txt. In your testing the problem compensates itself.

For a reliable test execute on the command line

Code: Select all

fbc -gen gcc -r test.bas

That doesn't compile, but creates a new source file named test.c. In that file you should find a line

Code: Select all

	C_STR$0 = (uint8*)"test \xE4\xB8\xAD\xE5\x9B\xBD\xE8\xAA\x9E";

for the code

Code: Select all

VAR c_str = @CHR(&h74, &h65, &h73, &h74, &h20 _
               , &he4, &hb8, &had, &he5, &h9b, &hbd, &he8, &haa, &h9e)

When you find a different line for

Code: Select all

VAR c_str = @"test 中国語"

that indicates the editor failure.

3oheicrw wrote: ↑Apr 03, 2022 10:02 This code was wrong. The compiler compiled but report this message: warning 3(2): Passing different pointer types.

My code matches your description in the first post. So either your description or your header is wrong. Why didn't you report earlier?

3oheicrw wrote: ↑Apr 03, 2022 2:40 After the command "chcp 65001" instead of squares it outputs squares with ? inside.

This indicates that the C-lib isn't UTF-8 only. Instead it considers the setting of the system code page -> one more parameter, complicating the issue.

3oheicrw wrote: ↑Apr 03, 2022 9:53 BTW, I decided my FreeBASIC code will be English only as I don't want to deal with this trouble.

Each source should be English in the origin (UTF-8 encoded). Later it gets adapted by native translators using the libintl tools.

BTW: You did already mention that developing on LINUX is much less trouble.

3oheicrw · Post by **3oheicrw** » Apr 03, 2022 12:40

@TJF Finally I found it. It's not the editor's fault. It's fbc's fault. fbc mistakenly recognized UTF-8 with BOM as UTF-16. If I use Geany to remove the Unicode BOM to make it UTF-8 without BOM, then everything works as expected.

TJF · Post by **TJF** » Apr 03, 2022 12:58

3oheicrw wrote: ↑Apr 03, 2022 12:40uint16* C_STR$0;
C_STR$0 = (uint16*)L"test \x4E2D\x56FD\x8A9E";

This confirms my asumption. In case of an UTF-8 source file you'll get

Code: Select all

C_STR$0 = (uint8*)"test \xE4\xB8\xAD\xE5\x9B\xBD\xE8\xAA\x9E";

Note: your characters are two bytes in length, the UTF-8 encoded characters are three bytes each.

When switching to Geany a lot of your trouble will be solved.

Edit: Sorry, our posts overlapped.

It's not an fbc fault. fbc treats the source regarding the BOM. Obviously your BOM is UTF-16. No BOM means UTF-8.

Post by **coderJeff** » Apr 03, 2022 13:51

TJF wrote: ↑Apr 03, 2022 12:58fbc treats the source regarding the BOM.

This is correct. Internally however fbc doesn't track UTF-8 usage.

If there is a BOM (including UTF-8), the string is converted to a WSTRING depending on platform and the original UTF-8 encoding is lost.
Windows -> UTF16LE
Linux -> UTF32LE
Big Endian ->UTF32BE

If there is no BOM, then the file is read as ascii and a UTF-8 encoded string in the literal string of the source file is stored as-is without conversion internally as if the bytes were an ascii string. So, in this usage, the string literal is passed around as a ZSTRING.

3oheicrw · Post by **3oheicrw** » Apr 03, 2022 14:03

coderJeff wrote: ↑Apr 03, 2022 13:51
TJF wrote: ↑Apr 03, 2022 12:58fbc treats the source regarding the BOM.
This is correct. Internally however fbc doesn't track UTF-8 usage.

If there is a BOM (including UTF-8), the string is converted to a WSTRING depending on platform and the original UTF-8 encoding is lost.
Windows -> UTF16LE
Linux -> UTF32LE
Big Endian ->UTF32BE

If there is no BOM, then the file is read as ascii and a UTF-8 encoded string in the literal string of the source file is stored as-is without conversion internally as if the bytes were an ascii string. So, in this usage, the string literal is passed around as a ZSTRING.

This should be documented. It will save a lot of trouble.

Post by **fxm** » Apr 03, 2022 14:28

Look at (for example):
- Literals (String Literals)
- WString
- Source Files (.bas) (Unicode support)

TJF · Post by **TJF** » Apr 04, 2022 11:02

coderJeff wrote: ↑Apr 03, 2022 13:51If there is a BOM (including UTF-8), the string is converted to a WSTRING depending on platform and the original UTF-8 encoding is lost.
Windows -> UTF16LE
Linux -> UTF32LE
Big Endian ->UTF32BE

This is a bug in fbc!

The BOM can be either two, three or four bytes. It should get (platform independant)

FE FF => UTF-16BE
FF FE => UTF-16LE
EF BB BF => UTF-8 (no conversation)
00 00 FE FF => UTF-32BE
00 00 FF FE => UTF-32LE

Munair · Post by **Munair** » Apr 04, 2022 11:14

TJF wrote: ↑Apr 02, 2022 12:20 const char* translates to CONST ZSTRING PTR

Example
Code: Select all
test_C_function(@"test 中国語")

VAR c_str = @"test 中国語" '' creates a ZSTRING PTR
test_C_function(c_str)
?LEFT(*c_str, 4)
?MID(*c_str, 6)

The UTF8 library pointed to by Vortex has both zstring ptr and const zstring ptr defined as:

Code: Select all

type PChar as zstring ptr
type PCChar as const zstring ptr

While I have not tested this, you should be able to pass them to C functions.

3oheicrw · Post by **3oheicrw** » Apr 04, 2022 16:37

TJF wrote: ↑Apr 04, 2022 11:02
coderJeff wrote: ↑Apr 03, 2022 13:51If there is a BOM (including UTF-8), the string is converted to a WSTRING depending on platform and the original UTF-8 encoding is lost.
Windows -> UTF16LE
Linux -> UTF32LE
Big Endian ->UTF32BE
This is a bug in fbc!

The BOM can be either two, three or four bytes. It should get (platform independant)

FE FF => UTF-16BE

FF FE => UTF-16LE

EF BB BF => UTF-8 (no conversation)

00 00 FE FF => UTF-32BE

00 00 FF FE => UTF-32LE

I agreed. This must be a bug. BTW you were right. Geany is the best editor for FreeBASIC.

Post by **coderJeff** » Apr 04, 2022 23:23

TJF wrote: ↑Apr 04, 2022 11:02 This is a bug in fbc!

3oheicrw wrote: ↑Apr 04, 2022 16:37 I agreed. This must be a bug.

fbc will parse files source files encoded with any of those BOM's listed.
fbc supports only one encoding for unicode strings on any given platform, and all string literals are converted to it.
fbc doesn't have a built-in UTF-8 encoded string type or a way to indicate in source (in any encoding) that quoted string literals should be stored as UTF-8 encoded.

caseih · Post by **caseih** » Apr 05, 2022 0:04

On Linux BOM is rarely seen in text files, source files especially. It's reasonable to assume that such files are UTF-8, although guessing encodings is always a bit fragile. Some languages let you put an encoding description in a comment in the first few lines to help the compiler or interpreter know whether it's UTF-8, ASCII, or some other legacy 1-byte encoding.

I suppose the fact that UTF-8 string literals on files without BOMs remain UTF-8-encoded strings is somewhat of a lucky coincidence, and probably one shouldn't assume that it is always so. Perhaps it's worth changing FBC to assume that if there's no BOM, it must be UTF-8, and that will be right 99% of the time. Of course that would also mean that unicode in string literals makes the literals WSTRING, and using the conversion functions in utf_conv.bi to explicitly convert unicode to UTF-8-encoded bytes for passing to C functions, or writing to files. And of course when reading input, decode it from bytes to unicode. But there's not yet a dynamic WString type.

3oheicrw · Post by **3oheicrw** » Apr 05, 2022 6:14

IMHO the current behavior of fbc is really not right. If the string's type be inferred from the source file's encoding so why there is a need to specify it as zstring or wstring anyway? Why don't just use var and let the compiler handle it for you?

[SOLVED] Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: [SOLVED] Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: Pass UTF-8 strings

Re: [SOLVED] Pass UTF-8 strings

Re: [SOLVED] Pass UTF-8 strings