Ansi, Utf8 & Utf16 encoding problems

jj2007 · Post by **jj2007** » Feb 24, 2021 11:16

Following a confused thread about a "simple" GUI application (Simple tutorial to create first Windows applications), here is a little testbed for your specific IDE, version of FB, toolchain, whatever:

Code: Select all

'#define Unicode
#include "Windows.bi"
#include "crt.bi"

const szAppName = "Добро пожаловать"
printf( "[%s - printed with CRT printf]\n", szAppName)	' the \n escape gets happily ignored
Print	' just a CrLf
Print "[";szAppName;"] (this works if the file is saved with a Utf8 BOM)"
MessageBox(0, @szAppName, "A: Should be Russian, with @:", MB_OK)
MessageBox(0, szAppName, "A: Should be Russian, no @:", MB_OK)
MessageBoxW(0, @szAppName, "W: Should be Russian, with @:", MB_OK)
MessageBoxW(0, szAppName, "W: Should be Russian, no @:", MB_OK)

Save as Ansi, Utf8 with BOM, Utf16 with or without BOM, and try to find a pattern or a rule that could enhance the FB manual. Perhaps somebody can even find out why FB, when it sees a Utf8 BOM, passes strings as UTF16 to Windows APIs ;-)

P.S.: I have not yet found a scenario where printf("Добро пожаловать\n") would not show garbage; in contrast, it works perfectly in Masm32:

Code: Select all

include \masm32\include\masm32rt.inc
.code
start:
  cls
  printf("Добро пожаловать\n")
  printf("that was Hello World")
  exit
end start

Добро пожаловать
that was Hello World

Xusinboy Bekchanov · Post by **Xusinboy Bekchanov** » Feb 24, 2021 11:41

jj2007 wrote: P.S.: I have not yet found a scenario where printf("Добро пожаловать\n") would not show garbage; in contrast, it works perfectly in Masm32:
Code: Select all
include \masm32\include\masm32rt.inc
.code
start:
  cls
  printf("Добро пожаловать\n")
  printf("that was Hello World")
  exit
end start
Добро пожаловать
that was Hello World

Here:

Code: Select all

wprintf(@"Добро пожаловать\n")   ' the \n escape gets happily ignored

First argument of printf declared as zstring:

Code: Select all

Declare Function printf (ByVal As ZString Ptr, ...) As Long

jj2007 · Post by **jj2007** » Feb 24, 2021 11:55

Code: Select all

' #define Unicode
#include "Windows.bi"
#include "crt.bi"

const szAppName = "Добро пожаловать is Russian and means 'Welcome'"
'printf( "[%s - printed with CRT printf]\n", szAppName)	' the \n escape gets happily ignored
'wprintf( "[%s - printed with CRT printf]\n", @szAppName)	' the \n escape gets happily ignored

wprintf(@"[ Добро пожаловать ] is Russian and means 'Welcome'")

Print	' just a CrLf
Print "[";szAppName;"] (this works if the file is saved with a Utf8 BOM)"
MessageBox(0, @szAppName, "A: Should be Russian, with @:", MB_OK)
MessageBox(0, szAppName, "A: Should be Russian, no @:", MB_OK)
MessageBoxW(0, @szAppName, "W: Should be Russian, with @:", MB_OK)
MessageBoxW(0, szAppName, "W: Should be Russian, no @:", MB_OK)

[ ] is Russian and means 'Welcome'
[Добро пожаловать is Russian and means 'Welcome'] (this works if the file is saved with a Utf8 BOM)

The Russian part gets ignored. Under the hood one can see that a Utf-16 string is being passed to printf, with a correct Russian part.

Xusinboy Bekchanov · Post by **Xusinboy Bekchanov** » Feb 24, 2021 12:01

jj2007 wrote:I see only the \n

I have a Russian operating system. Mine, too, does not show Unicode strings (except Russian)

jj2007 · Post by **jj2007** » Feb 24, 2021 12:03

Xusinboy Bekchanov wrote:I have a Russian operating system. Mine, too, does not show Unicode strings (except Russian)

My Italian OS displays Russian perfectly when told to do so from Assembly: printf("Добро пожаловать\n")
With Fbc32 and with or without #define UNICODE, this happens under the hood:

Code: Select all

asm int 3
wprintf(@"[ Добро пожаловать ] is Russian and means 'Welcome'")
asm nop

Code: Select all

 int3
 mov dword ptr [local.11], offset 0040702 ; /format => "[ Добро пожаловать ] is Russian and means 'Welcome'" (UTF-16!!!)
 call <jmp.&msvcrt.wprintf>               ; \MSVCRT.wprintf
 nop

VANYA · Post by **VANYA** » Feb 24, 2021 12:54

I do not know how in other languages there is a declaration of functions, but in the FB, the PrintF Declaration:

Code: Select all

declare function printf (byval as Zstring ptr, ...) as long

If it were, it would work as it should:

Code: Select all

#define UNICODE
extern "C"
   #ifdef UNICODE 
      declare function printf alias "wprintf"(byval as wstring ptr, ...) as long    
   #else  
      declare function printf (byval as Zstring ptr, ...) as long
   #endif
End Extern

dim as wstring ptr sz = @"Добро пожаловать"

printf(sz)

sleep

VANYA · Post by **VANYA** » Feb 24, 2021 13:18

Xusinboy Bekchanov wrote:I have a Russian operating system. Mine, too, does not show Unicode strings (except Russian)

So like in Windows there is no normal unicode console (at least in the Russian version). Displays only the current locale.

jj2007 · Post by **jj2007** » Feb 24, 2021 13:53

VANYA wrote:
Xusinboy Bekchanov wrote:I have a Russian operating system. Mine, too, does not show Unicode strings (except Russian)
So like in Windows there is no normal unicode console (at least in the Russian version). Displays only the current locale.

The Windows console displays Russian just fine, even though the locale is Italian (or rather: CP 850). With Assembly and/or MasmBasic, that works just fine by setting the codepage to 65001 alias Utf8. With FB, wprintf() prints the non-Russian part but leaves a few spaces where Russian should appear.

Xusinboy Bekchanov · Post by **Xusinboy Bekchanov** » Feb 24, 2021 14:06

VANYA wrote:
Xusinboy Bekchanov wrote:I have a Russian operating system. Mine, too, does not show Unicode strings (except Russian)
So like in Windows there is no normal unicode console (at least in the Russian version). Displays only the current locale.

The console shows Unicode strings:

Code: Select all

Const szAppName = "ЎҚҒҲ"
Print szAppName
Sleep

Only wprintf does not show.

jj2007 · Post by **jj2007** » Feb 24, 2021 14:12

Xusinboy Bekchanov wrote:Only wprintf does not show.

Yessss! But it does show the non-exotic part, as in
[Добро пожаловать] is Russian
which gets printed as
[ ] is Russian. And that is weird, because it works fine from Assembly.

VANYA · Post by **VANYA** » Feb 24, 2021 15:16

Xusinboy Bekchanov wrote:The console shows Unicode strings:
Code: Select all
Const szAppName = "ЎҚҒҲ"
Print szAppName
Sleep
Only wprintf does not show.

Yes, it displays, truth is limited (not all symbols), for example:

Code: Select all

#define UNICODE
#include "windows.bi"
dim as wstring*1000 szAppName = "Testing unicode -- English -- Русский -- Ελληνικά -- Español " & wchr(&h4e27)
Print szAppName
messagebox(0, szAppName ,  "" ,0)

Only wprintf does not show.

Right. Although in the parameters of the transmitted in the debugger, the string is correct:

0012F6F4 Testing unicode -- English -- Ру
0012F734 сский -- Ελληνικά -- Español 丧.^
0012F774 _`

In fact, it is important that PRINT displays correctly :)

caseih · Post by **caseih** » Feb 24, 2021 15:36

The C standard library printf knows nothing about unicode byte encodings at all. It just emits bytes. That is all. The Windows console may or may not interpret UTF-8 bytes correctly. I believe in recent versions of Windows 10 it now properly decodes UTF-8. But it never did before. You could specify a single-byte code page. And it might understand UTF-16. It's a mess that's only recently being addressed by Microsoft, thanks to their renewed interest in the Windows Console.

Also, under some circumstances, printf might not emit all the bytes at once. It may emit bytes one at a time, which can confuse the console's unicode decoding. This could explain some of the difference between your assembly use of printf and calling printf from FB code.

In short, this has nothing to do with FB. See https://stackoverflow.com/questions/108 ... ws-console

jj2007 · Post by **jj2007** » Feb 24, 2021 17:07

caseih wrote:The C standard library printf knows nothing about unicode byte encodings at all. It just emits bytes. That is all. The Windows console may or may not interpret UTF-8 bytes correctly. I believe in recent versions of Windows 10 it now properly decodes UTF-8. But it never did before.

Oh dear, now tell me how it is possible that in the last 12 years or so I've been successfully printing Utf8 to the console?

Code: Select all

include \masm32\MasmBasic\MasmBasic.inc
  Init
  PrintLine "Russian:", Tb$, "Введите текст здесь"
  printf(cfm$("Russian (CRT):\tВведите текст здесь\n"))
  PrintLine "Greek:  ", Tb$, "Το ελληνικό κρασί έχει καλή γεύση!"
  printf(cfm$( "Greek (CRT):  \tΤο ελληνικό κρασί έχει καλή γεύση!\n"))
EndOfCode

Output (to a standard console under Windows XP, installed in April 2008):

Code: Select all

Russian:        Введите текст здесь
Russian (CRT):  Введите текст здесь
Greek:          Το ελληνικό κρασί έχει καλή γεύση
Greek (CRT):    Το ελληνικό κρασί έχει καλή γεύση

As you can see, both Print (alias WriteFile) and printf() work fine with two alphabets that are not native to this Italian machine.

caseih · Post by **caseih** » Feb 25, 2021 3:12

I can only surmise that you weren't actually using UTF-8 but UTF-16. It's likely that your source file was UTF-16, so the string literals were already encoded. I see nothing about UTF-8 in your little MASM code example. Instead of using string literals, why not try actual UTF-8 -encoded bytes and see what happens.

0x92, 0xd0, 0xb2, 0xd0, 0xb5, 0xd0, 0xb4, 0xd0, 0xb8, 0xd0,
0x82, 0xd1, 0xb5, 0xd0, 0xd1, 0x20, 0xd0, 0x82, 0xd0, 0xb5,
0xd1, 0xba, 0xd1, 0x81, 0x20, 0x82, 0xb7, 0xd0, 0xb4, 0xd0,
0xb5, 0xd0, 0x81, 0xd1, 0x8c, 0xd1

Window XP certainly has no UTF-8 support at all. I know that for a fact. UTF-16, yes.

As recently as 2018, a development blog at Microsoft said the following:

Alas, the Windows Console is not (currently) able to support UTF-8 text!

Windows Console was created way back in the early days of Windows, back before Unicode itself existed! Back then, a decision was made to represent each text character as a fixed-length 16-bit value (UCS-2).

Shrug. I believe this post was accurate. Actual UTF-8 support in the console arrived in Windows 10 1809. And now Windows has a new Terminal app that fully supports UTF-8, released last year.

My previous statement is accurate. Printf() emits bytes. Period. It has no knowledge of any unicode encoding and will simply pass the bytes on through. If the target understands the encoding then you get the expected characters. If not, you get mojibake.

jj2007 · Post by **jj2007** » Feb 25, 2021 10:38

caseih wrote:I can only surmise that you weren't actually using UTF-8 but UTF-16

Surmise, but remember that I have the habit to look at what my code does at assembly level:

Code: Select all

include \masm32\include\masm32rt.inc
.code
start:
  int 3
  printf("Добро пожаловать\n")
  nop
  printf("that was Hello World")
  exit
end start

Code: Select all

 int3
 push offset 00402000                     ; /format = "Добро пожаловать <<<<<<<< that is commonly called "UTF-8"
 call near [<&msvcrt.printf>]             ; \MSVCRT.printf
 add esp, 4

Window XP certainly has no UTF-8 support at all. I know that for a fact. UTF-16, yes.

I love facts! And as a matter of fact, looking under the hood of the little proggie that spits out Russian text correctly, in WinXP installed in April 2008, the bytes that are being sent to WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), ...) appear to be, to my untrained eye, UTF-8, not UTF-16. Which is what I would expect, since I wrote the code myself, and since one line earlier I set the codepage to 65001 alias UTF-8:

Code: Select all

mov eax, esp                             ; PTR to UTF-8 "Russian:	Введите текст здесь
push 0                                   ; | Overlapped = NULL
push eax                                 ; | pBytesWritten
push ecx                                 ; | Arg1
call MbStrLen                            ; | get #bytes (yes, BYTES)
push eax                                 ; | Size
push ecx                                 ; | Buffer
mov edx, offset MbFlags                  ; |
movzx ecx, byte ptr [edx-14]             ; |
mov eax, [ecx*4+edx-10]                  ; |
mov [edx-18], eax                        ; |
push eax                                 ; | 65001 alias CP_UTF8
push eax                                 ; | 65001 alias CP_UTF8
call SetConsoleCP                        ; | kernel32.SetConsoleCP
call SetConsoleOutputCP                  ; | kernel32.SetConsoleOutputCP
push -0B                                 ; | StdHandle = STD_OUTPUT_HANDLE
call GetStdHandle                        ; | KERNEL32.GetStdHandle
push eax                                 ; | hFile
call WriteFile                           ; | KERNEL32.WriteFile

Ansi, Utf8 & Utf16 encoding problems

Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems

Re: Ansi, Utf8 & Utf16 encoding problems