using unicode (tips and tricks)

Windows specific questions.
Post Reply
sallecta
Posts: 1
Joined: Nov 23, 2012 8:02
Location: Russia

using unicode (tips and tricks)

Post by sallecta »

After forum research and investigation of FB Unicode examples I'd like to discuss the requirements that ensure the Unicode program will work as expected.

1. The source text must have UTF-8 encoding itself (or, more correct: it must use same UTF encoding the Windows version uses)
2. The source text must use "Unicode BOM"
3. We must tell FBC that we going to use Unicode:
3.1.

Code: Select all

#define UNICODE
4. Then we can include the windows headers:
4.1

Code: Select all

#include "windows.bi"
5. There is no default Unicode string type in FB. I've found the following workarounds:
5.1. Define the string as

Code: Select all

dim strUnicode as wstring * SomeNumber
5.2. Create user defined type

Code: Select all

Type UtfStr as wstring * SomeNumber
5.2.1. Then use this type, when you need to deal with Unicode strings, ex.

Code: Select all

dim txtBuffer as UtfStr
My summary
As we can see, the mass usage of the "5.1" method will make the code less readable.
From other side, the usage of the "5.2" method will fix the size of the Unicode string (or I'm wrong and there is a way to "enlarge' the size of that variable)

Discussion scope
Is there an "optimal" value for the wstring multiplier? I've found that 20 is not enough for me, so I used 200.
Is it possible to "enlarge" the size of the Unicode string while needed?
I would like you to check if the are any other possibilities or more elegant way to do Unicode stuff.
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: using unicode (tips and tricks)

Post by TJF »

Hallo sallecta, welcome at the forum!
sallecta wrote:After forum research and investigation of FB Unicode examples I'd like to discuss the requirements that ensure the Unicode program will work as expected.

1. The source text must have UTF-8 encoding itself ...
As far as I know you need further tools (libintl) to use UTF-8 encoding on windows (it may be different in current versions ?!?).

Using UTF-8 in FB code doesn't have any requirements. Just switch your IDE (Geany is a good choise) to UFT-8 encoding and start to code. Use simple STRING variables (variable size or fixed length) or ZSTRING PTRs, but note that a charcter may have more than one byte in size (1 to 4 byte). WSTRINGs don't work for UTF-8 (, since WSTRING allways uses 2 bytes per character and a UTF-8 character may have just one byte)!!!

UTF-8 is the best choise if you need to do I18N. You can use the tools for C programming language (like gettext, PoEdit, ...). Find an example here: Also, you can search for GTK+ applications. They all have UTF-8 and I18N support like GladeToBac can auto-call the xgettext tool for your FB source (that's a bit tricky since xgettext doesn't support FB syntax), so you can easy create one (UTF-8 encoded) translation file *.pot for all source files.
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: using unicode (tips and tricks)

Post by marcov »

TJF wrote:Hallo sallecta, welcome at the forum!
sallecta wrote:After forum research and investigation of FB Unicode examples I'd like to discuss the requirements that ensure the Unicode program will work as expected.

1. The source text must have UTF-8 encoding itself ...
As far as I know you need further tools (libintl) to use UTF-8 encoding on windows (it may be different in current versions ?!?).
Afaik Windows supports utf8 since Windows XP, which is close to 10 years now
than one byte in size (1 to 4 byte). WSTRINGs don't work for UTF-8 (, since WSTRING allways uses 2 bytes per character and a UTF-8 character may have just one byte)!!!
If WStrings maps to COM BStr there might also be other (performance) problems with heavy use. If one goes in the direction of making a 2-byte unicode type, it is better to make it native (entirely handled by the runtime), and use wstring only for COM and intermodule interfacing.
UTF-8 is the best choise if you need to do I18N. You can use the tools for C programming language (like gettext, PoEdit, ...). Find an example here:
Afaik Apple's Cocoa, QT, Java, .NET and Windows in general are 2-byte based.
Ascend4nt
Posts: 5
Joined: Apr 05, 2012 18:58
Location: NJ, USA
Contact:

Re: using unicode (tips and tricks)

Post by Ascend4nt »

Huh, how'd I miss this. I was just discussing unicode text handling in Windows in another forum.

The problem with FreeBASIC is it has weird half-arsed support for WString.. you can create WString's on the fly by combining two values in a function call, but otherwise you have to be very specific about allocating and deallocating memory for them. I understand Basic was originally a ASCII/ANSI based program, but WString support is pretty important for anyone working on Windows. Most if not all of the internal FreeBasic O/S interface code is ANSI based and will give crap results if any foreign non-English characters are in filenames or registry keys or any other O/S component.

So basically, using WString throughout FreeBASIC is very important when programming Windows...

To that end, I have already created a 'WideString' object that handles all the allocation/resizing/deallocation with overrides for '+' and other operators. It's still a work in progress, but it has the potential to alleviate at least some of this Windows WString problem in FreeBASIC. I'll try to clean it up and start a thread with it in the (hopefully) near future.

As far as UTF-8<->UTF-16 conversions, there are two Windows functions (MultiByteToWideChar & WideCharToMultiByte) that handle this quite easily. I've already coded up functions for calling these. The remaining problems are replacing common built-in functions with Windows calls to the wide string variants of functions, which in effect makes it look a lot less like a 'Basic' language. But alas, it HAS to be done this way.. coding Windows without UTF-16 strings is a big mistake.

Oh and btw, an annoying caveat: Windows' Wide strings (UTF-16) consists 'mostly* of just 2-byte sequences. There is a range of Unicode that expands it into 2 sequences of 2 bytes which yields 4 bytes, though that's seen more often with Eastern dialects.

Anyway.. good discussion.. I wish this O/S supported UTF-8 like linux and Mac (or at least I thought).. thats a much easier unicode standard to deal with
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: using unicode (tips and tricks)

Post by TJF »

Ascend4nt wrote:... I wish this O/S supported UTF-8 like linux and Mac (or at least I thought).. thats a much easier unicode standard to deal with
Sure, it's much easier to use simple STRINGs for UTF-8 encoding.

You can have UTF-8 support on windows by using libraries like libintl, SQL, GTK (PangoCairo), ... GLib ie comes with all the helper functions like character count or transformation to an other encoding. GLib is well supported since fbc-0.24.0.
St_W
Posts: 1619
Joined: Feb 11, 2009 14:24
Location: Austria
Contact:

Re: using unicode (tips and tricks)

Post by St_W »

@TJF: please don't write nonsense, if you have no idea - in fact Microsoft implemented Unicode in Windows quite early and Windows was even the first OS that used Unicode in system calls.

Regarding Unicode and Windows:
  • Windows doesn't use UTF-8 internally, but UTF-16 (UCS-2) instead.
  • Windows NT (including Windows NT 4.0, 2000, XP, Vista, 7, 8 and onwards) support Unicode out of the box; Windows 95, 98, ME (you shouldn't use these nowadays anyway) support Unicode through MSLU
  • Apart from unicows on Windows 95,98,ME no additional libraries are needed; in particular no GTK, libintl, etc.
  • UTF-8 is supported on Windows NT 4.0, 2000, XP and onwards out of the box
Regarding Unicode and FreeBasic:
  • FreeBasic supports Unicode (source code) files
  • FreeBasic does not support variable-length Unicode Strings like "String"-Datatype for ANSI Strings. Wide-character datatype "WString" is comparable to "ZString" for ANSI Strings. Of course one could use ANSI "String"-Datatype and store Unicode Strings, but FreeBasic built-in functions like LEN would probably give unexpected results in that case (byte-length instead of character count e.g. for LEN)
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: using unicode (tips and tricks)

Post by TJF »

St_W wrote:@TJF: please don't write nonsense, if you have no idea
Ditto.
St_W wrote:
  • FreeBasic does not support variable-length Unicode Strings like "String"-Datatype for ANSI Strings.
I use UTF-8 encoded variable-length STRINGs since years (ie Data2App, OpenGlPlayer, GladeToBac, ...). Please note: Unicode mustn't be UTF-16. It may also be UTF-8 or UTF-32.
St_W wrote:
  • Apart from unicows on Windows 95,98,ME no additional libraries are needed; in particular no GTK, libintl, etc.
  • UTF-8 is supported on Windows NT 4.0, 2000, XP and onwards out of the box
I don't know which kind of support you have in mind here. But did you try to use UTF-8 encoded file names in any of these OS versions yet (including native characters)? Or did you check UTF-8 output in a console?

There're good reasons why I translated 1,5 MB of header files for GTK and its dependencies. And I have good reasons why I use them instead of win API.

If you think you can have reasonable I18N or L10N in an other way, that's OK for you. But don't name statements you can't follow "nonsense", please!
Ascend4nt
Posts: 5
Joined: Apr 05, 2012 18:58
Location: NJ, USA
Contact:

Re: using unicode (tips and tricks)

Post by Ascend4nt »

One thing that usually stops me from including large support libraries is the resultant executable size. However, I'm always curious just what will be pulled into the executable..

Out of curiousity, I tried a really simple GLib test:

Code: Select all

#include "glib.bi"
Dim OSVer As guint
OSVer = g_win32_get_windows_version()
    
Print "O/S Version = ";OSVer
Sleep
The executable unfortunately is looking for "(null).dll". wtf? haha.. that makes no sense.

Oh, one thing - the remark "UTF-8 is supported on Windows NT 4.0, 2000, XP and onwards out of the box" by St_W.. that's misleading. No Windows API functions support UTF-8 aside from the unicode conversion functions. Otherwise its just 'ANSI' or 'Wide' (UTF-16)
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: using unicode (tips and tricks)

Post by TJF »

Ascend4nt wrote:The executable unfortunately is looking for "(null).dll". wtf? haha.. that makes no sense.
That's a known issue, the fbc-0.24.0 import libraries are broken. See
and find download links to fix it in the post below.
Ascend4nt wrote:However, I'm always curious just what will be pulled into the executable..
The executable gets just a few additional bytes (name of library, declarations of the used functions). It's much smaller as it would be when you put the used library code in your FB source.
St_W
Posts: 1619
Joined: Feb 11, 2009 14:24
Location: Austria
Contact:

Re: using unicode (tips and tricks)

Post by St_W »

TJF wrote:I use UTF-8 encoded variable-length STRINGs since years (ie Data2App, OpenGlPlayer, GladeToBac, ...). Please note: Unicode mustn't be UTF-16. It may also be UTF-8 or UTF-32.
Sure, you may misuse ordinal "String"s for storing Unicode data (no matter which encoding: UTF-8, UTF-16, etc.), but FreeBasic and especially the rtlib doesn't support it; it will still be treated as ordinal ANSI String, whether it's Unicode or not. I've mentioned that in my last post.
TJF wrote:I don't know which kind of support you have in mind here. But did you try to use UTF-8 encoded file names in any of these OS versions yet (including native characters)? Or did you check UTF-8 output in a console?
Ascend4nt wrote:No Windows API functions support UTF-8 aside from the unicode conversion functions. Otherwise its just 'ANSI' or 'Wide' (UTF-16)
Of course you cannot call system functions and pass data using a different Unicode encoding. Windows uses UTF-16 internally, thus you have to call Win API functions using that encoding. UTF-8 support is provided through MultiByteToWideChar and WideCharToMultiByte system functions, which can be used to convert beetween UTF-8 and UTF-16 Unicode encodings.
TJF wrote:There're good reasons why I translated 1,5 MB of header files for GTK and its dependencies. And I have good reasons why I use them instead of win API.
I haven't doubted that there are reasons for using GTK: e.g. platform interoperability and internationalization are some. But just for writing Unicode programs for Windows you do not need GTK.
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: using unicode (tips and tricks)

Post by TJF »

St_W wrote:Sure, you may misuse ordinal "String"s for storing Unicode data (no matter which encoding: UTF-8, UTF-16, etc.), but FreeBasic and especially the rtlib doesn't support it; it will still be treated as ordinal ANSI String, whether it's Unicode or not. I've mentioned that in my last post.
Don you really think a STRING must contain ANSI characters and any other usage is a misuse? What about the MKI functions family? For me a STRING is a flexible data type (similar to an ALLOCATEd memory block, but easier to handle). Ie my linear algebra library libFBla is based on the STRING data type.

UTF-8 encoding is designed to be used as an ANSI string. There is no difference in the lower page (0 to 127). The native characters are at special codes in the upper page. So UTF-8 is supported by fbcs rtlib, ie for input and output at the console. When your command shell is set to UTF-8 you can use this encoding out-of-the-box. (The LEN function doesn't count the number of characters, it serves the size in bytes -- as you mentioned. GLib closes that gap -- as I mentioned.)

On a windows box it's hard to find a command shell with UFT-8 encoding. That's why I say it isn't supported.
St_W wrote:But just for writing Unicode programs for Windows ...
fbc is a cross-platform compiler. Why not using all it's features?

BTW:
marcov wrote:Afaik Apple's Cocoa, QT, Java, .NET and Windows in general are 2-byte based.
AFAIK QT uses Cairo for rendering and PangoCairo for text output. So it cannot be 2-byte-based in general. At least the output is based on UTF-8 encoding. I think it's similar with Apple OS (as Ascend4nt mentioned). I recomment to use just one encoding in a project if at all possible.
Ascend4nt
Posts: 5
Joined: Apr 05, 2012 18:58
Location: NJ, USA
Contact:

Re: using unicode (tips and tricks)

Post by Ascend4nt »

TJF wrote:That's a known issue, the fbc-0.24.0 import libraries are broken
Cool. Thanks for that! Search doesn't always get me what I need :)
TJF wrote:
Ascend4nt wrote:However, I'm always curious just what will be pulled into the executable..
The executable gets just a few additional bytes (name of library, declarations of the used functions). It's much smaller as it would be when you put the used library code in your FB source.
Hmm.. are we talking about statically linked or dynamically? I can't figure out how to statically link libglib-2.0-0's object code in.. would you know the compiler options for that? If it's not possible, then the size is going to be way too much for a simple application. 1.33MB for libglib-2.0-0.dll plus its dependency intl.dll..
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: using unicode (tips and tricks)

Post by marcov »

TJF wrote:
marcov wrote:Afaik Apple's Cocoa, QT, Java, .NET and Windows in general are 2-byte based.
AFAIK QT uses Cairo for rendering and PangoCairo for text output.
Afaik QT has some options to integrate with Cairo, but that doesn't mean it hard not Cairo based. It merely allows you to paint on cairo canvases. They are mostly meant to keep QT functioning in Gnome centric environments, and might not even be enabled on KDE centric systems.
So it cannot be 2-byte-based in general. At least the output is based on UTF-8 encoding. I think it's similar with Apple OS (as Ascend4nt mentioned). I recomment to use just one encoding in a project if at all possible.
The FreeBSD parts of OS X are 1-byte (and thus utf8, though like most Unices they are better described mostly encoding agnostic, storing filenames as binary rather than an defined encoding). Aqua and other application centric parts of OS X are based on objective C, which has a two byte string type.
TJF wrote: fbc is a cross-platform compiler. Why not using all it's features?
(Using GTK to avoid Windows conventions is the opposite of cross platform. It is emulation)

Note that the UTF8 <-> Ansi compatibility is fairly shallow, and any accent will be mutilated (rendering this equivalency for e.g. continental European languages, and maybe even English, since personal names might have accents).

Basically it only allows to recycle relatively encoding agnostic code that assumes 1-byte per char, and doesn't render anything and avoids per character transformations.
Last edited by marcov on Jan 04, 2013 19:48, edited 1 time in total.
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Re: using unicode (tips and tricks)

Post by TJF »

Ascend4nt wrote:Hmm.. are we talking about statically linked or dynamically? I can't figure out how to statically link libglib-2.0-0's object code in.. would you know the compiler options for that? If it's not possible, then the size is going to be way too much for a simple application. 1.33MB for libglib-2.0-0.dll plus its dependency intl.dll..
I'm talking about dynamic linking. I don't use static linking at all. (And sorry, I don't know how to compile GLib for static linkage -- it must be possible in a way. You may need GLibObject and GIO as well, depending on the features you use.)

GLib usually gets updated twice a year. Static linking means you have to recompile your software (and re-install it at each users box) when you want to use a new feature. And users who may use allready Geany or Inkscape or GIMP or Pidgin or ... get a further copy with your software as well. (That's not just a waste of hard disk space. You'll waste time when making back-ups, searching in the back-up, ...)

I never had problems with deprecated functions from old versions. When a feature is to be removed they anounce it 2 or three years before they finally remove it. You'll have enough time to update (and also improve) your code.
Post Reply