[SOLVED] Pass UTF-8 strings

General FreeBASIC programming questions.
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: Pass UTF-8 strings

Post by TJF »

Just for curiosity, what's the output from

Code: Select all

VAR c_str = @CHR(&h74, &h65, &h73, &h74, &h20 _
               , &he4, &hb8, &had, &he5, &h9b, &hbd, &he8, &haa, &h9e)
test_C_function(c_str)
?LEFT(*c_str, 4)
?MID(*c_str, 6)
3oheicrw
Posts: 25
Joined: Mar 21, 2022 6:41

Re: Pass UTF-8 strings

Post by 3oheicrw »

@TJF On Windows Command Prompt prints meaningless unicode character, after "chcp 65001" prints full of squares. But if redirect the output to test.txt (without changing codebase first), it output the correct string 中国語.

BTW, I decided my FreeBASIC code will be English only as I don't want to deal with this trouble.
3oheicrw
Posts: 25
Joined: Mar 21, 2022 6:41

Re: Pass UTF-8 strings

Post by 3oheicrw »

TJF wrote: Apr 02, 2022 12:20 const char* translates to CONST ZSTRING PTR

Example

Code: Select all

test_C_function(@"test 中国語")

VAR c_str = @"test 中国語" '' creates a ZSTRING PTR
test_C_function(c_str)
?LEFT(*c_str, 4)
?MID(*c_str, 6)
This code was wrong. The compiler compiled but report this message: warning 3(2): Passing different pointer types. The displayed text is also incorrect.
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: Pass UTF-8 strings

Post by TJF »

3oheicrw wrote: Apr 03, 2022 9:53 @TJF On Windows Command Prompt prints meaningless unicode character, after "chcp 65001" prints full of squares. But if redirect the output to test.txt (without changing codebase first), it output the correct string 中国語.
Your reports are horror. My last code produces three outputs:
  1. The window header bar text (test_C_function)
  2. The LEFT function
  3. The RIGHT function
You didn't report on each one by one. Instead you reported about a mixture of the console output #2 and #3, and nothing about #1.

Anyhow, it seems that your editor is responsible for the issue. It doesn't encode the string literal to UTF-8, so fbc didn't get the desired input. Your test doesn't show this, since the wrong encoding is present at two places, in the source and in the output in file text.txt. In your testing the problem compensates itself.

For a reliable test execute on the command line

Code: Select all

fbc -gen gcc -r test.bas
That doesn't compile, but creates a new source file named test.c. In that file you should find a line

Code: Select all

	C_STR$0 = (uint8*)"test \xE4\xB8\xAD\xE5\x9B\xBD\xE8\xAA\x9E";
for the code

Code: Select all

VAR c_str = @CHR(&h74, &h65, &h73, &h74, &h20 _
               , &he4, &hb8, &had, &he5, &h9b, &hbd, &he8, &haa, &h9e)
When you find a different line for

Code: Select all

VAR c_str = @"test 中国語"
that indicates the editor failure.

3oheicrw wrote: Apr 03, 2022 10:02 This code was wrong. The compiler compiled but report this message: warning 3(2): Passing different pointer types.
My code matches your description in the first post. So either your description or your header is wrong. Why didn't you report earlier?
3oheicrw wrote: Apr 03, 2022 2:40 After the command "chcp 65001" instead of squares it outputs squares with ? inside.
This indicates that the C-lib isn't UTF-8 only. Instead it considers the setting of the system code page -> one more parameter, complicating the issue.
3oheicrw wrote: Apr 03, 2022 9:53 BTW, I decided my FreeBASIC code will be English only as I don't want to deal with this trouble.
Each source should be English in the origin (UTF-8 encoded). Later it gets adapted by native translators using the libintl tools.

BTW: You did already mention that developing on LINUX is much less trouble.
3oheicrw
Posts: 25
Joined: Mar 21, 2022 6:41

Re: Pass UTF-8 strings

Post by 3oheicrw »

@TJF Finally I found it. It's not the editor's fault. It's fbc's fault. fbc mistakenly recognized UTF-8 with BOM as UTF-16. If I use Geany to remove the Unicode BOM to make it UTF-8 without BOM, then everything works as expected.
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: Pass UTF-8 strings

Post by TJF »

3oheicrw wrote: Apr 03, 2022 12:40uint16* C_STR$0;
C_STR$0 = (uint16*)L"test \x4E2D\x56FD\x8A9E";
This confirms my asumption. In case of an UTF-8 source file you'll get

Code: Select all

C_STR$0 = (uint8*)"test \xE4\xB8\xAD\xE5\x9B\xBD\xE8\xAA\x9E";
Note: your characters are two bytes in length, the UTF-8 encoded characters are three bytes each.

When switching to Geany a lot of your trouble will be solved.

Edit: Sorry, our posts overlapped.

It's not an fbc fault. fbc treats the source regarding the BOM. Obviously your BOM is UTF-16. No BOM means UTF-8.
coderJeff
Site Admin
Posts: 4326
Joined: Nov 04, 2005 14:23
Location: Ontario, Canada
Contact:

Re: Pass UTF-8 strings

Post by coderJeff »

TJF wrote: Apr 03, 2022 12:58fbc treats the source regarding the BOM.
This is correct. Internally however fbc doesn't track UTF-8 usage.

If there is a BOM (including UTF-8), the string is converted to a WSTRING depending on platform and the original UTF-8 encoding is lost.
Windows -> UTF16LE
Linux -> UTF32LE
Big Endian ->UTF32BE

If there is no BOM, then the file is read as ascii and a UTF-8 encoded string in the literal string of the source file is stored as-is without conversion internally as if the bytes were an ascii string. So, in this usage, the string literal is passed around as a ZSTRING.
3oheicrw
Posts: 25
Joined: Mar 21, 2022 6:41

Re: Pass UTF-8 strings

Post by 3oheicrw »

coderJeff wrote: Apr 03, 2022 13:51
TJF wrote: Apr 03, 2022 12:58fbc treats the source regarding the BOM.
This is correct. Internally however fbc doesn't track UTF-8 usage.

If there is a BOM (including UTF-8), the string is converted to a WSTRING depending on platform and the original UTF-8 encoding is lost.
Windows -> UTF16LE
Linux -> UTF32LE
Big Endian ->UTF32BE

If there is no BOM, then the file is read as ascii and a UTF-8 encoded string in the literal string of the source file is stored as-is without conversion internally as if the bytes were an ascii string. So, in this usage, the string literal is passed around as a ZSTRING.
This should be documented. It will save a lot of trouble.
fxm
Moderator
Posts: 12107
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: [SOLVED] Pass UTF-8 strings

Post by fxm »

Look at (for example):
- Literals (String Literals)
- WString
- Source Files (.bas) (Unicode support)
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: Pass UTF-8 strings

Post by TJF »

coderJeff wrote: Apr 03, 2022 13:51If there is a BOM (including UTF-8), the string is converted to a WSTRING depending on platform and the original UTF-8 encoding is lost.
Windows -> UTF16LE
Linux -> UTF32LE
Big Endian ->UTF32BE
This is a bug in fbc!

The BOM can be either two, three or four bytes. It should get (platform independant)
  • FE FF => UTF-16BE
  • FF FE => UTF-16LE
  • EF BB BF => UTF-8 (no conversation)
  • 00 00 FE FF => UTF-32BE
  • 00 00 FF FE => UTF-32LE
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: Pass UTF-8 strings

Post by Munair »

TJF wrote: Apr 02, 2022 12:20 const char* translates to CONST ZSTRING PTR

Example

Code: Select all

test_C_function(@"test 中国語")

VAR c_str = @"test 中国語" '' creates a ZSTRING PTR
test_C_function(c_str)
?LEFT(*c_str, 4)
?MID(*c_str, 6)
The UTF8 library pointed to by Vortex has both zstring ptr and const zstring ptr defined as:

Code: Select all

type PChar as zstring ptr
type PCChar as const zstring ptr
While I have not tested this, you should be able to pass them to C functions.
3oheicrw
Posts: 25
Joined: Mar 21, 2022 6:41

Re: Pass UTF-8 strings

Post by 3oheicrw »

TJF wrote: Apr 04, 2022 11:02
coderJeff wrote: Apr 03, 2022 13:51If there is a BOM (including UTF-8), the string is converted to a WSTRING depending on platform and the original UTF-8 encoding is lost.
Windows -> UTF16LE
Linux -> UTF32LE
Big Endian ->UTF32BE
This is a bug in fbc!

The BOM can be either two, three or four bytes. It should get (platform independant)
  • FE FF => UTF-16BE
  • FF FE => UTF-16LE
  • EF BB BF => UTF-8 (no conversation)
  • 00 00 FE FF => UTF-32BE
  • 00 00 FF FE => UTF-32LE
I agreed. This must be a bug. BTW you were right. Geany is the best editor for FreeBASIC.
coderJeff
Site Admin
Posts: 4326
Joined: Nov 04, 2005 14:23
Location: Ontario, Canada
Contact:

Re: Pass UTF-8 strings

Post by coderJeff »

TJF wrote: Apr 04, 2022 11:02 This is a bug in fbc!
3oheicrw wrote: Apr 04, 2022 16:37 I agreed. This must be a bug.
fbc will parse files source files encoded with any of those BOM's listed.
fbc supports only one encoding for unicode strings on any given platform, and all string literals are converted to it.
fbc doesn't have a built-in UTF-8 encoded string type or a way to indicate in source (in any encoding) that quoted string literals should be stored as UTF-8 encoded.
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: [SOLVED] Pass UTF-8 strings

Post by caseih »

On Linux BOM is rarely seen in text files, source files especially. It's reasonable to assume that such files are UTF-8, although guessing encodings is always a bit fragile. Some languages let you put an encoding description in a comment in the first few lines to help the compiler or interpreter know whether it's UTF-8, ASCII, or some other legacy 1-byte encoding.

I suppose the fact that UTF-8 string literals on files without BOMs remain UTF-8-encoded strings is somewhat of a lucky coincidence, and probably one shouldn't assume that it is always so. Perhaps it's worth changing FBC to assume that if there's no BOM, it must be UTF-8, and that will be right 99% of the time. Of course that would also mean that unicode in string literals makes the literals WSTRING, and using the conversion functions in utf_conv.bi to explicitly convert unicode to UTF-8-encoded bytes for passing to C functions, or writing to files. And of course when reading input, decode it from bytes to unicode. But there's not yet a dynamic WString type.
3oheicrw
Posts: 25
Joined: Mar 21, 2022 6:41

Re: [SOLVED] Pass UTF-8 strings

Post by 3oheicrw »

IMHO the current behavior of fbc is really not right. If the string's type be inferred from the source file's encoding so why there is a need to specify it as zstring or wstring anyway? Why don't just use var and let the compiler handle it for you?
Post Reply