A new Unicode & Newbie problem

New to FreeBASIC? Post your questions here.
newbieforever
Posts: 117
Joined: Jun 21, 2018 11:14

A new Unicode & Newbie problem

Postby newbieforever » Jan 17, 2019 10:12

...
I just can't deal with all this Unicode stuff that's so complicated in FB...

I hope the following ad-hoc demo code describes my problem understandably enough (if there are bugs, please ignore them).

Thank you very much in advance for your tips.

Code: Select all

' (Edited!)
' NOTE: This code should be saved as Unicode encoded text file,
' and the name of the compiled code should be Сергeй.exe.

Dim As Wstring * 500   txtin  = "C:\Test\TextIn.txt"
Dim As Wstring * 500   txtout = "C:\Test\TextOut.txt"

#Define unicode     
#Include once "windows.bi"
#Include once "win/shellapi.bi"
#Include once "file.bi"
' ==================================================
Declare Function LoadFile(file as Wstring) As String
Declare Sub      Savefile(file As Wstring, o_p As String)
' ==================================================

' NOTE: The name (w/o extension) of the executable should be extracted from the command line launching the executable.

Dim As Wstring * 10000 cmdl = *GetCommandLineW()
Dim As Long            narg
Dim As Wstring Ptr Ptr argl = Cast(Wstring ptr ptr, CommandLineToArgvW(cmdl, @narg))
Dim As Wstring * 500   appp = *argl[0]
Dim As Wstring * 500   appn = Mid(appp, InStrRev(appp, "\") + 1)
Dim As Wstring * 500   namx = Left(appn, len(appn) - 4)

' NOTE: namx is "Сергeй".
' print namx would result in "Сергeй" on screen!
' (I need this variable in my code for other purposes too, not only for what is following here below.)

' NOTE: TextIn.txt is a Unicode encoded text file containing e.g. this line:
' His name was Sergej Сергeевич.
' Now the content of this file is read:

Dim As String          cont

cont = LoadFile(txtin)

' NOTE: cont ist the Unicode encoded string "His name was Sergej Сергeевич."
' (I need this variable in my code for other purposes too, not only for what is following here below.)
' This string should be changed by replacing a substring, "Sergej", by namx set above, i.e. "Сергeй"

' ====================================================================================
' MY PROBLEM: How to "adapt" namx to make this replacement in the Unicode string cont?
' ====================================================================================

' After this replacement is done, the changed string will be writen to a Unicode encoded text file:

SaveFile(txtout, cont)

' =================

Function LoadFile(file as Wstring) as String
  Dim As String  i_p
  Dim As Any Ptr f
  Dim As Integer fsize, bytesread
  f = CreateFileW(file, GENERIC_READ, 0, 0, OPEN_EXISTING, 0, 0)
  If f Then
    fsize = GetFileSize(f, 0)
    i_p = String(fsize, Asc("x"))
    ReadFile(f, @i_p[0], fsize, @bytesread, 0)
    CloseHandle(f)
  End If
  Return i_p
End Function

Sub SaveFile(file As Wstring, o_p As String)
  Dim As Any ptr   n, pstr
  Dim As Integer   byteswritten
  n = CreateFileW(file, GENERIC_WRITE, 0, 0, CREATE_ALWAYS, 0, 0)
  If n <> -1 Then
    WriteFile(n, Peek(any ptr, VarPtr(o_p)), Len(o_p), @byteswritten, 0)
    CloseHandle(n)
  End If
End Sub
Last edited by newbieforever on Jan 18, 2019 15:39, edited 2 times in total.
newbieforever
Posts: 117
Joined: Jun 21, 2018 11:14

Re: A new Unicode & Newbie problem

Postby newbieforever » Jan 17, 2019 21:54

.
Either my question is too trivial...or so difficult that even the friendly gurus here, who have helped me so often, can't help me this time!!??

If I see this correctly, I would first need a conversion of "Сергeй" (namx) to a Unicode string. This seems to be a problem in FB. For example, the letter й is represented by 39 04 in Unicode. How do I get to this Unicode string?
Munair
Posts: 834
Joined: Oct 19, 2017 15:00
Location: 't Zand, NL
Contact:

Re: A new Unicode & Newbie problem

Postby Munair » Jan 18, 2019 6:22

There is no such thing as a Unicode string. You would have a string containing a Unicode encoding, either UTF8, UTF16 or UTF32. While Linux and the internet primarily use UTF8, Windows doesn't. So it is the encoding conversion that matters.
newbieforever
Posts: 117
Joined: Jun 21, 2018 11:14

Re: A new Unicode & Newbie problem

Postby newbieforever » Jan 18, 2019 7:27

Thank you, Munair.

To satisfy your terminological requirements: "Unicode string" in this (or my) case (e.g. when it comes to saving texts in Windows Notepad) means a string of characters encoded in UTF-16LE. Is this correct enough, at least for interpreting my problem and my question?

Back to my problem. Maybe the following is the most reduced example to describe it.

Suppose I write "Сергeй" in Notepad and save the file as TxtA.txt in Unicode (yes, UTF-16LE). In the file the string of this six characters is represented as "21 04 35 04 40 04 33 04 65 00 39 04". (BOM should be ignored in this example.)

My string variable namx (which receives its value by extracting it from the command line by which the executable was launched) has the same value, "Сергeй". Print namx whould display this character string correctly.

The problem is how to convert this variable namx to a string namy which via SaveFile(TxtB, namy) would result in the exact copy of TxtA.txt (again, BOM should be ignored here).

Apart from terminological complaints, are there really no suggestions for a solution in FB?
Munair
Posts: 834
Joined: Oct 19, 2017 15:00
Location: 't Zand, NL
Contact:

Re: A new Unicode & Newbie problem

Postby Munair » Jan 18, 2019 7:58

If you use a normal string data type, the bytes should be preserved.
Josep Roca
Posts: 415
Joined: Sep 27, 2016 18:20
Location: Valencia, Spain

Re: A new Unicode & Newbie problem

Postby Josep Roca » Jan 18, 2019 9:14

Use dynamic unicode strings instead of ansi strings.

See: viewtopic.php?f=8&t=26856&hilit=dwstring
for DWSTRING.bi and DWStrProcs.bi

Code: Select all

'#CONSOLE ON
#INCLUDE ONCE "Afx/DWSTRING.bi"   ' --> change me
#INCLUDE ONCE "Afx/DWStrProcs.bi"   ' --> change me

' // Read the file
DIM dwsFileName AS DWSTRING = "C:\Programs\Tests\TextIn.txt"   ' --> change me
DIM hFile AS HANDLE
DIM dwsOut AS DWSTRING
hFile = CreateFileW(dwsFilename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, NULL)
IF hFile THEN
   DIM dwFileSize AS DWORD = GetFileSize(hFile, NULL)
   IF dwFileSize THEN
      dwsOut = WSPACE(dwFileSize \ 2)
      DIM bSuccess AS LONG = ReadFile(hFile, *dwsOut, dwFileSize, NULL, NULL)
      CloseHandle(hFile)
      PRINT dwsOut
   END IF
END IF

' // Replace the name
Dim As Wstring * 500   namx = "Сергeй"
dwsOut = DWStrReplace(dwsOut, "Sergej", namx)

' // Writing to a file
dwsFileName = "C:\Programs\Tests\TextOut.txt"   ' --> change me
hFile = CreateFileW(dwsFileName, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL)
IF hFile THEN
   DIM dwBytesWritten AS DWORD
   DIM bSuccess AS LONG = WriteFile(hFile, dwsOut, LEN(dwsOut) * 2, @dwBytesWritten, NULL)
   CloseHandle(hFile)
END IF

PRINT "Press any key..."
SLEEP
newbieforever
Posts: 117
Joined: Jun 21, 2018 11:14

Re: A new Unicode & Newbie problem

Postby newbieforever » Jan 18, 2019 9:49

Thank you, Josep!

Before I rush to convert all my code to the functions you suggested (which in itself would be an enormous challenge for me!), I'd rather ask again:

In my real code I do not read a simple text file TextIn.txt (as shown in demo), but a file that contains binary data and also a Unicode encoded text. What I read with LoadFile(), I absolutely need in the code for some other things, so I can not do without it. What is called cont in the demo, in my real code is just a substring that contains the Unicode encoded text.

And in my real code, the changed cont should not be written directly into a text file, but integrated into other (binary) data, which is then saved via SaveFile().

(Btw: Also in the first part of my demo I extract several other variables from the CommandLine, which I cannot do without, and namx is only one of them.)

So at the moment I already have a finished complex code that works wonderfully (with LoadFile(), SaveFile(), namx as WString and cont as String). Only afterwards has it become necessary to replace a substring of cont with namx.
Josep Roca
Posts: 415
Joined: Sep 27, 2016 18:20
Location: Valencia, Spain

Re: A new Unicode & Newbie problem

Postby Josep Roca » Jan 18, 2019 10:17

Then forget my code.
newbieforever
Posts: 117
Joined: Jun 21, 2018 11:14

Re: A new Unicode & Newbie problem

Postby newbieforever » Jan 18, 2019 10:22

I understand. But maybe in the end I will be forced to completely change my code to your solution...
Josep Roca
Posts: 415
Joined: Sep 27, 2016 18:20
Location: Valencia, Spain

Re: A new Unicode & Newbie problem

Postby Josep Roca » Jan 18, 2019 11:00

If you're working with a binary file, your problem is not an unicode problem. In binary, you work with bytes. Therefore, you have to replace the bytes used by "Sergej" with bytes that represent "Сергeй".

> But maybe in the end I will be forced to completely change my code to your solution...

No. It is not advisable to use a unicode string to store binary data.
newbieforever
Posts: 117
Joined: Jun 21, 2018 11:14

Re: A new Unicode & Newbie problem

Postby newbieforever » Jan 18, 2019 11:29

Josep Roca wrote:It is not advisable to use a unicode string to store binary data.

I understand.

Josep Roca wrote:...you have to replace the bytes used by "Sergej" with bytes that represent "Сергeй".

Exactly. I actually thought that I had expressed exactly this as my problem and my question.

I am experimenting with converting cont to a Wstring contw to do the replacement...
Josep Roca
Posts: 415
Joined: Sep 27, 2016 18:20
Location: Valencia, Spain

Re: A new Unicode & Newbie problem

Postby Josep Roca » Jan 18, 2019 11:38

> Exactly. I actually thought that I had expressed exactly this as my problem and my question.

But you forgot to say that you were working with a binary file. Everything you said, including the .txt extension of the file, indicated that you were working with a text file.
newbieforever
Posts: 117
Joined: Jun 21, 2018 11:14

Re: A new Unicode & Newbie problem

Postby newbieforever » Jan 18, 2019 12:01

I understand. But I was thinking that by using LoadFile() it is clear that (even when file is a text file) I get exactly this, a string of bytes. So that was a misconception? Sorry!
newbieforever
Posts: 117
Joined: Jun 21, 2018 11:14

Re: A new Unicode & Newbie problem

Postby newbieforever » Jan 18, 2019 15:55

But no, Josep, this must be so! I see now that cont received by LoadFile() from TextIn.txt is a string of bytes where two bytes represent a Unicode encoded character.

As I said, Print namx displays correctly "Сергeй", and I see now that e.g. mid(namx, 6, 1) results correctly in "й".

So my naive question (please be patient with me!) is:

What "byte structure" (or encoding?) does namx have? Or in other words: Should it really be so difficult (or even impossible) in FB to convert a string "Сергeй", which can be output correctly on the screen, into a string of bytes to be writen by SaveFile() into a text file?
Josep Roca
Posts: 415
Joined: Sep 27, 2016 18:20
Location: Valencia, Spain

Re: A new Unicode & Newbie problem

Postby Josep Roca » Jan 18, 2019 17:31

Dim As Wstring * 500 namx = "Сергeй"
DIM s AS STRING = SPACE(12)
memcpy STRPTR(s), STRPTR(namx), 12

Return to “Beginners”

Who is online

Users browsing this forum: No registered users and 1 guest