Ansi, Utf8 & Utf16 encoding problems

General discussion for topics related to the FreeBASIC project or its community.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Ansi, Utf8 & Utf16 encoding problems

Post by jj2007 »

Following a confused thread about a "simple" GUI application (Simple tutorial to create first Windows applications), here is a little testbed for your specific IDE, version of FB, toolchain, whatever:

Code: Select all

'#define Unicode
#include "Windows.bi"
#include "crt.bi"

const szAppName = "Добро пожаловать"
printf( "[%s - printed with CRT printf]\n", szAppName)	' the \n escape gets happily ignored
Print	' just a CrLf
Print "[";szAppName;"] (this works if the file is saved with a Utf8 BOM)"
MessageBox(0, @szAppName, "A: Should be Russian, with @:", MB_OK)
MessageBox(0, szAppName, "A: Should be Russian, no @:", MB_OK)
MessageBoxW(0, @szAppName, "W: Should be Russian, with @:", MB_OK)
MessageBoxW(0, szAppName, "W: Should be Russian, no @:", MB_OK)
Save as Ansi, Utf8 with BOM, Utf16 with or without BOM, and try to find a pattern or a rule that could enhance the FB manual. Perhaps somebody can even find out why FB, when it sees a Utf8 BOM, passes strings as UTF16 to Windows APIs ;-)

P.S.: I have not yet found a scenario where printf("Добро пожаловать\n") would not show garbage; in contrast, it works perfectly in Masm32:

Code: Select all

include \masm32\include\masm32rt.inc
.code
start:
  cls
  printf("Добро пожаловать\n")
  printf("that was Hello World")
  exit
end start
Добро пожаловать
that was Hello World
Last edited by jj2007 on Feb 24, 2021 11:50, edited 1 time in total.
Xusinboy Bekchanov
Posts: 789
Joined: Jul 26, 2018 18:28

Re: Ansi, Utf8 & Utf16 encoding problems

Post by Xusinboy Bekchanov »

jj2007 wrote: P.S.: I have not yet found a scenario where printf("Добро пожаловать\n") would not show garbage; in contrast, it works perfectly in Masm32:

Code: Select all

include \masm32\include\masm32rt.inc
.code
start:
  cls
  printf("Добро пожаловать\n")
  printf("that was Hello World")
  exit
end start
Добро пожаловать
that was Hello World
Here:

Code: Select all

wprintf(@"Добро пожаловать\n")   ' the \n escape gets happily ignored
First argument of printf declared as zstring:

Code: Select all

Declare Function printf (ByVal As ZString Ptr, ...) As Long
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by jj2007 »

Code: Select all

' #define Unicode
#include "Windows.bi"
#include "crt.bi"

const szAppName = "Добро пожаловать is Russian and means 'Welcome'"
'printf( "[%s - printed with CRT printf]\n", szAppName)	' the \n escape gets happily ignored
'wprintf( "[%s - printed with CRT printf]\n", @szAppName)	' the \n escape gets happily ignored

wprintf(@"[ Добро пожаловать ] is Russian and means 'Welcome'")

Print	' just a CrLf
Print "[";szAppName;"] (this works if the file is saved with a Utf8 BOM)"
MessageBox(0, @szAppName, "A: Should be Russian, with @:", MB_OK)
MessageBox(0, szAppName, "A: Should be Russian, no @:", MB_OK)
MessageBoxW(0, @szAppName, "W: Should be Russian, with @:", MB_OK)
MessageBoxW(0, szAppName, "W: Should be Russian, no @:", MB_OK)
[ ] is Russian and means 'Welcome'
[Добро пожаловать is Russian and means 'Welcome'] (this works if the file is saved with a Utf8 BOM)

The Russian part gets ignored. Under the hood one can see that a Utf-16 string is being passed to printf, with a correct Russian part.
Last edited by jj2007 on Feb 24, 2021 12:01, edited 1 time in total.
Xusinboy Bekchanov
Posts: 789
Joined: Jul 26, 2018 18:28

Re: Ansi, Utf8 & Utf16 encoding problems

Post by Xusinboy Bekchanov »

jj2007 wrote:I see only the \n
I have a Russian operating system. Mine, too, does not show Unicode strings (except Russian)
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by jj2007 »

Xusinboy Bekchanov wrote:I have a Russian operating system. Mine, too, does not show Unicode strings (except Russian)
My Italian OS displays Russian perfectly when told to do so from Assembly: printf("Добро пожаловать\n")
With Fbc32 and with or without #define UNICODE, this happens under the hood:

Code: Select all

asm int 3
wprintf(@"[ Добро пожаловать ] is Russian and means 'Welcome'")
asm nop

Code: Select all

 int3
 mov dword ptr [local.11], offset 0040702 ; /format => "[ Добро пожаловать ] is Russian and means 'Welcome'" (UTF-16!!!)
 call <jmp.&msvcrt.wprintf>               ; \MSVCRT.wprintf
 nop
VANYA
Posts: 1834
Joined: Oct 24, 2010 15:16
Location: Ярославль
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by VANYA »

I do not know how in other languages ​​there is a declaration of functions, but in the FB, the PrintF Declaration:

Code: Select all

declare function printf (byval as Zstring ptr, ...) as long
If it were, it would work as it should:

Code: Select all

#define UNICODE
extern "C"
   #ifdef UNICODE 
      declare function printf alias "wprintf"(byval as wstring ptr, ...) as long    
   #else  
      declare function printf (byval as Zstring ptr, ...) as long
   #endif
End Extern

dim as wstring ptr sz = @"Добро пожаловать"

printf(sz)

sleep
VANYA
Posts: 1834
Joined: Oct 24, 2010 15:16
Location: Ярославль
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by VANYA »

Xusinboy Bekchanov wrote:I have a Russian operating system. Mine, too, does not show Unicode strings (except Russian)
So like in Windows there is no normal unicode console (at least in the Russian version). Displays only the current locale.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by jj2007 »

VANYA wrote:
Xusinboy Bekchanov wrote:I have a Russian operating system. Mine, too, does not show Unicode strings (except Russian)
So like in Windows there is no normal unicode console (at least in the Russian version). Displays only the current locale.
The Windows console displays Russian just fine, even though the locale is Italian (or rather: CP 850). With Assembly and/or MasmBasic, that works just fine by setting the codepage to 65001 alias Utf8. With FB, wprintf() prints the non-Russian part but leaves a few spaces where Russian should appear.
Xusinboy Bekchanov
Posts: 789
Joined: Jul 26, 2018 18:28

Re: Ansi, Utf8 & Utf16 encoding problems

Post by Xusinboy Bekchanov »

VANYA wrote:
Xusinboy Bekchanov wrote:I have a Russian operating system. Mine, too, does not show Unicode strings (except Russian)
So like in Windows there is no normal unicode console (at least in the Russian version). Displays only the current locale.
The console shows Unicode strings:

Code: Select all

Const szAppName = "ЎҚҒҲ"
Print szAppName
Sleep
Only wprintf does not show.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by jj2007 »

Xusinboy Bekchanov wrote:Only wprintf does not show.
Yessss! But it does show the non-exotic part, as in
[Добро пожаловать] is Russian
which gets printed as
[ ] is Russian. And that is weird, because it works fine from Assembly.
VANYA
Posts: 1834
Joined: Oct 24, 2010 15:16
Location: Ярославль
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by VANYA »

Xusinboy Bekchanov wrote:The console shows Unicode strings:

Code: Select all

Const szAppName = "ЎҚҒҲ"
Print szAppName
Sleep
Only wprintf does not show.
Yes, it displays, truth is limited (not all symbols), for example:

Code: Select all

#define UNICODE
#include "windows.bi"
dim as wstring*1000 szAppName = "Testing unicode -- English -- Русский -- Ελληνικά -- Español " & wchr(&h4e27)
Print szAppName
messagebox(0, szAppName ,  "" ,0)
Only wprintf does not show.
Right. Although in the parameters of the transmitted in the debugger, the string is correct:
0012F6F4 Testing unicode -- English -- Ру
0012F734 сский -- Ελληνικά -- Español 丧.^
0012F774 _`
In fact, it is important that PRINT displays correctly :)
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: Ansi, Utf8 & Utf16 encoding problems

Post by caseih »

The C standard library printf knows nothing about unicode byte encodings at all. It just emits bytes. That is all. The Windows console may or may not interpret UTF-8 bytes correctly. I believe in recent versions of Windows 10 it now properly decodes UTF-8. But it never did before. You could specify a single-byte code page. And it might understand UTF-16. It's a mess that's only recently being addressed by Microsoft, thanks to their renewed interest in the Windows Console.

Also, under some circumstances, printf might not emit all the bytes at once. It may emit bytes one at a time, which can confuse the console's unicode decoding. This could explain some of the difference between your assembly use of printf and calling printf from FB code.

In short, this has nothing to do with FB. See https://stackoverflow.com/questions/108 ... ws-console
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by jj2007 »

caseih wrote:The C standard library printf knows nothing about unicode byte encodings at all. It just emits bytes. That is all. The Windows console may or may not interpret UTF-8 bytes correctly. I believe in recent versions of Windows 10 it now properly decodes UTF-8. But it never did before.
Oh dear, now tell me how it is possible that in the last 12 years or so I've been successfully printing Utf8 to the console?

Code: Select all

include \masm32\MasmBasic\MasmBasic.inc
  Init
  PrintLine "Russian:", Tb$, "Введите текст здесь"
  printf(cfm$("Russian (CRT):\tВведите текст здесь\n"))
  PrintLine "Greek:  ", Tb$, "Το ελληνικό κρασί έχει καλή γεύση!"
  printf(cfm$( "Greek (CRT):  \tΤο ελληνικό κρασί έχει καλή γεύση!\n"))
EndOfCode
Output (to a standard console under Windows XP, installed in April 2008):

Code: Select all

Russian:        Введите текст здесь
Russian (CRT):  Введите текст здесь
Greek:          Το ελληνικό κρασί έχει καλή γεύση
Greek (CRT):    Το ελληνικό κρασί έχει καλή γεύση
As you can see, both Print (alias WriteFile) and printf() work fine with two alphabets that are not native to this Italian machine.
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: Ansi, Utf8 & Utf16 encoding problems

Post by caseih »

I can only surmise that you weren't actually using UTF-8 but UTF-16. It's likely that your source file was UTF-16, so the string literals were already encoded. I see nothing about UTF-8 in your little MASM code example. Instead of using string literals, why not try actual UTF-8 -encoded bytes and see what happens.

0x92, 0xd0, 0xb2, 0xd0, 0xb5, 0xd0, 0xb4, 0xd0, 0xb8, 0xd0,
0x82, 0xd1, 0xb5, 0xd0, 0xd1, 0x20, 0xd0, 0x82, 0xd0, 0xb5,
0xd1, 0xba, 0xd1, 0x81, 0x20, 0x82, 0xb7, 0xd0, 0xb4, 0xd0,
0xb5, 0xd0, 0x81, 0xd1, 0x8c, 0xd1

Window XP certainly has no UTF-8 support at all. I know that for a fact. UTF-16, yes.

As recently as 2018, a development blog at Microsoft said the following:
Alas, the Windows Console is not (currently) able to support UTF-8 text!

Windows Console was created way back in the early days of Windows, back before Unicode itself existed! Back then, a decision was made to represent each text character as a fixed-length 16-bit value (UCS-2).
Shrug. I believe this post was accurate. Actual UTF-8 support in the console arrived in Windows 10 1809. And now Windows has a new Terminal app that fully supports UTF-8, released last year.

My previous statement is accurate. Printf() emits bytes. Period. It has no knowledge of any unicode encoding and will simply pass the bytes on through. If the target understands the encoding then you get the expected characters. If not, you get mojibake.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Ansi, Utf8 & Utf16 encoding problems

Post by jj2007 »

caseih wrote:I can only surmise that you weren't actually using UTF-8 but UTF-16
Surmise, but remember that I have the habit to look at what my code does at assembly level:

Code: Select all

include \masm32\include\masm32rt.inc
.code
start:
  int 3
  printf("Добро пожаловать\n")
  nop
  printf("that was Hello World")
  exit
end start

Code: Select all

 int3
 push offset 00402000                     ; /format = "Добро пожаловать <<<<<<<< that is commonly called "UTF-8"
 call near [<&msvcrt.printf>]             ; \MSVCRT.printf
 add esp, 4
Window XP certainly has no UTF-8 support at all. I know that for a fact. UTF-16, yes.
I love facts! And as a matter of fact, looking under the hood of the little proggie that spits out Russian text correctly, in WinXP installed in April 2008, the bytes that are being sent to WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), ...) appear to be, to my untrained eye, UTF-8, not UTF-16. Which is what I would expect, since I wrote the code myself, and since one line earlier I set the codepage to 65001 alias UTF-8:

Code: Select all

mov eax, esp                             ; PTR to UTF-8 "Russian:	Введите текст здесь
push 0                                   ; | Overlapped = NULL
push eax                                 ; | pBytesWritten
push ecx                                 ; | Arg1
call MbStrLen                            ; | get #bytes (yes, BYTES)
push eax                                 ; | Size
push ecx                                 ; | Buffer
mov edx, offset MbFlags                  ; |
movzx ecx, byte ptr [edx-14]             ; |
mov eax, [ecx*4+edx-10]                  ; |
mov [edx-18], eax                        ; |
push eax                                 ; | 65001 alias CP_UTF8
push eax                                 ; | 65001 alias CP_UTF8
call SetConsoleCP                        ; | kernel32.SetConsoleCP
call SetConsoleOutputCP                  ; | kernel32.SetConsoleOutputCP
push -0B                                 ; | StdHandle = STD_OUTPUT_HANDLE
call GetStdHandle                        ; | KERNEL32.GetStdHandle
push eax                                 ; | hFile
call WriteFile                           ; | KERNEL32.WriteFile
Post Reply