unicode variable length string type?

For other topics related to the FreeBASIC project or its community.
marpon
Posts: 342
Joined: Dec 28, 2012 13:31
Location: Paris - France

unicode variable length string type?

Postby marpon » Oct 22, 2015 16:02

looking on the github
i have found
https://github.com/freebasic/fbc/blob/m ... r/dstr.bas and other helper.bas files
with that code :

Code: Select all

'' dynamic wstrings
''::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

'':::::
sub DWstrZero _
   ( _
      byref dst as DWSTRING _
   )

    dst.data = NULL
    dst.len = 0
    dst.size = 0

end sub


is it the next Wstring (variable length type) coming?

as some others, I will be very interested on : unicode variable length type , or at least : Wstring variable length type
and string manipulation functions for these types.
It would simplify the way when doing unicode text program

Is anyone who have tried to create such kind of types and functions to play with ?

currently because i don't know how to create such types, i'm trying to use the string (variable length)type as a container
with functions to cast wstring ptr to zstring on data udt element , and back to wstring ptr
it works correctly to concat with "+ or &" , instr at the zstring ptr level ok.
better than nothing, have to recreate a specific strptr to work with unicode windows api.

I am interrested on yours attemps ( specially chinese/japanese ...)
dkl
Site Admin
Posts: 3209
Joined: Jul 28, 2005 14:45
Location: Germany

Re: unicode variable length string type?

Postby dkl » Oct 23, 2015 8:45

Hi,

that code from the compiler sources is internal to the compiler, it's some helper functions for a variable-length wstrings, used by the preprocessor and such. It's actually one of the oldest parts of the compiler - from 2005/2006...

Sadly the rtlib doesn't have support for dynamic wstrings yet, otherwise that could have been used internally in the compiler, instead of the dstr.bas code...

It would definitely be nice to have a dynamic wstring in FB though. These are the problems that I think need to be solved first though:

1. The name "WString" is already used for null-terminated wstrings. We need a new name for wstring descriptors, to be able to differentiate between descriptor and string data. In case of String/ZString, we have String=descriptor, ZString=string data. It needs to be UString or DWString or something like that...

2. Lots of work on the rtlib sources will be needed to implement the new type, and possibly adjust existing functions to make use of it. The latter would also mean a backwards compatibility break.
PaulSquires
Posts: 771
Joined: Jul 14, 2005 23:41
Contact:

Re: unicode variable length string type?

Postby PaulSquires » Oct 23, 2015 11:26

Adding a dynamic length Unicode variable type is my #1 wish at the moment. Everything in Windows is Unicode and working with new COM based technologies is so much easier with such a variable. Not sure how important/useful such a variable type would be on Linux because I do not program on the platform. I wish that I understood the inner workings of the compiler code enough to be able to make a valid contribution of code for this dynamic variable unicode issue.
marpon
Posts: 342
Joined: Dec 28, 2012 13:31
Location: Paris - France

Re: unicode variable length string type?

Postby marpon » Oct 23, 2015 14:52

@dkl
Yes, I've seen that dstr.bas was intended to supply the need of unicode variable length type to compiler functions, that why i was surprised nothing more in that direction have been done.
It was my research to understand the mechanism of making variable length strings.
I understood more or less how it is created , allocate, reallocate ... , but what i did not get is the mechanism to free the allocated memory automatically when the program ends.
It has to be stored some part ! very interrested to see how.

@Paul
I am sure you are also interrested on having such type for window, but i will say , even in linux ( without the COM need)
it is obvious to have it . Any international program target need to adress unicode , it is not possible to not take care of the 2/3 of the human people.
diskay
Posts: 17
Joined: Jan 06, 2010 9:36
Location: china

Re: unicode variable length string type?

Postby diskay » Oct 23, 2015 15:28

Use -UNICODE to open
"string" for UNCODE encoding
caseih
Posts: 1379
Joined: Feb 26, 2007 5:32

Re: unicode variable length string type?

Postby caseih » Oct 24, 2015 3:44

No having string default to unicode would probably not be good. Mainly because except for string, FB has no other way of addressing actual bytes, which are required for communicating with the rest of the computing universe. If string defaulted to unicode then you'd have to have a bytestring type and a complete suite of encoding and decoding tools in the runtime library just to get input from a text file. Python defaults string to unicode now, but has a bytes type for doing I/O.

The issue is also more complicated than simply adding a 2-byte dynamic string. Different platforms support wchar differently. See https://www.gnu.org/software/libunistri ... 005ft-mess for a brief talk about the perils of wchar. Also there's UCS-2 vs UTF-16. UCS-2 is always 2 bytes per character, whereas UTF-16 may not be (similar to UTF-8). The windows API is pretty much all UTF-16 now. I am not sure about .net but I think they've gone to UTF-8 internally.

Also, UTF-16 (Windows) and UTF-8 (everyone else) both suffer from the problem that you can't index strings based on character position without scanning from the beginning of the string every time. This is O(n) and can be a performance issue. If the runtime library kept an index of characters that need additional bytes or sets of bytes (UTF-8 is 1 to 6 bytes per character) when they were added to the string, that could make an operation like MID() fast again, but at the expense of some additional memory.

FB could simply adopt 4-byte strings, and then re-encode as necessary, automatically, when casting to wstring. UTF-8 would still be manual. But such as scheme will increase RAM usage fairly considerably.

Another alternative is to adopt a scheme similar to Python, which uses a flexible representation to get an O(1) string that optimally uses either 1, 2, or 4 bytes per character as required for an individual string. MID() can be done with simple pointer arithmetic. However this is not ideal for FB. Since Python's strings are not dynamic nor mutable, so once a string is created, it is never added to or changed, so the optimal encoding will never have to change. In FB's case, if you had a string that was encoded with 1-byte per character, and then you added a character that required 2 or 4 bytes, then the entire string would have to be re-encoded. (This also happens in Python when people misunderstand strings and continually do something like my_string += "something new": normally some kind of StringIO buffer is recommended instead.)

This is way too long, so to make it longer here's a summary:
- FB either adopts the platform's native encoding (see the wchar mess link above), or uses its own encoding internally.
- A complete suite of byte decoding to unicode and unicode encoding to byte utilities would have to exist. Even printing to the screen with the print statement requires such things, as the output device may or may not support whatever FB's internal encoding is. Probably best to tie the runtime into to an existing unicode library for doing encoding and decoding.
- string would continue to be a byte string, used for reading or writing bytes.
- All of FB's string runtime functions would need a *lot* of work to support unicode. Such as ucase() and lcase().
Tourist Trap
Posts: 2762
Joined: Jun 02, 2015 16:24

Re: unicode variable length string type?

Postby Tourist Trap » Oct 24, 2015 9:29

Is there not any possibility to do some #include "someOpensourceUnicodeLib.bi" ? Is there such a library?
caseih
Posts: 1379
Joined: Feb 26, 2007 5:32

Re: unicode variable length string type?

Postby caseih » Oct 24, 2015 14:36

Yes that is possible now if you made bindings to one of the C unicode libraries out there and may be the best course of action. An object-oriented wrapper around it could provide a reasonable native-looking way of doing dynamic unicode strings. And the hack I saw for overriding built-in functions could make all the standard string functions work.
marpon
Posts: 342
Joined: Dec 28, 2012 13:31
Location: Paris - France

Re: unicode variable length string type?

Postby marpon » Oct 24, 2015 14:42

@caseih
No having string default to unicode would probably not be good

i'm also on the same view , i think it's better to add something like ustring.

I follow you also UTF8;UTF16 are multi units by character , only UTF32 is giving the 1 unit(4 bytes) by character
it simplifies the direct indexing but in conterpart use at least 2 times more memory to hold the string and in case of windows you still have to convert to utf16le when working with win api.

Not so simple as you said

Today , what I am using is string as container for utf16le elements ( to have the complete allocate/reallocate/free without risk) and some functions to convert to/from that string according the need
when it is few strings for the interface it is enougth , but string manipulation is an other story...
Probably UTF32 is the best for unicode string manipulation.
caseih
Posts: 1379
Joined: Feb 26, 2007 5:32

Re: unicode variable length string type?

Postby caseih » Oct 25, 2015 16:14

Thinking on it, I think if I were going to create a unicode string class for FB, I'd choose UTF-16 as the internal storage encoding (we can't actually use WCHAR as that is platform-specific and is only 16 bits on Windows; everywhere else right now it's 8 bits). I'd use one of the C unicode libraries for manipulating it. I would implement one optimization to start, though. If the encoded string has no surrogate pairs in it, I'd set a flag. That would mean there's a 1:1 correlation between character and position, so MID() and friends could be O(1). This would handle all the European languages and also major Asian ones which use the basic multilingual plane, but probably not some of the south Asian ones, which require surrogate pairs as they use what's called the supplemental multilingual plane.

At that point, perhaps a different indexing scheme would be of assistance to deal with surrogate pairs. Profiling would have to be done to see if it's worth it I guess.

In any case, I feel that UTF-16 is a good compromise between speed and memory storage.

While unicode support could be done entirely in FB code without any support from the compiler and runtime, without some support from the compiler itself, there's no way to define unicode string literals. Adding support for that would mean making fbc unicode-aware and able to decode from a variety of encodings (UTF-8, UTF-16, etc). That could be quite an undertaking. Makes me tired thinking about it. Internally, FB's parse tree would have be able to deal with unicode.

Currently FB keyworks and variable names are limited to using ASCII characters, and that would not have to change. Personally I disagreed when Python developers decided to allow unicode characters in variable names but that's another story I suppose.
v1ctor
Site Admin
Posts: 3799
Joined: May 27, 2005 8:08
Location: SP / Bra[s]il
Contact:

Re: unicode variable length string type?

Postby v1ctor » Oct 27, 2015 13:41

Just a correction: the compiler can already parse Unicode files encoded in UTF-8 (with BOM), UTF-16 (BE/LE) and UFT-32 (BE/LE).

There are some examples at http://sourceforge.net/p/fbc/code/ci/master/tree/examples/unicode/
caseih
Posts: 1379
Joined: Feb 26, 2007 5:32

Re: unicode variable length string type?

Postby caseih » Oct 27, 2015 16:04

Okay that's good then. The compiler seemed to have no difficulty with a bare UTF-8 file (no BOM). Am I right in thinking, then, that fbc assumes if it's not UTF-32 or UTF-16, then it has to be UTF-8? If so, that's a good way of doing it, I think. Working with legacy 1-byte encodings is messy.
marpon
Posts: 342
Joined: Dec 28, 2012 13:31
Location: Paris - France

Re: unicode variable length string type?

Postby marpon » Nov 13, 2015 9:30

I'm still working on creating this variable length unicode type based on wstring, able to work for windows or linux ( difference win : wstring= 2 bytes ; linux : wstring =4 bytes)

I'm curenltly using a type , with wstring ptr , len , size , surrogate ( for utf16 wstring in 2 bytes ).

I've almost done the automatic alloc/realloc/free process

I have a specific question on that subject, is it mandatory to free the allocated memory at the end of the program ?
Because, I've done tests freeing the memory or not freeing the memory and have not noticeced difference with Win XP .
It's just to avoid the free process when terminating, it should symplify the code without that feature.
And , if the program crash for any reason how to insure the automatic freeing ?


to continue , i need other informations :
1_ Wstr function is easy to convert from ansi string , and using the escape sequence !"\uXXYY" but how to enter unicode char > uFFFF , its seems not possible directly by that escape sequence at lest with windows ?
it it possible with linux and how ?
today when the unicode char is > uFFFF , i convert it with ubyte array, is it the only way ?

2_ input unicode direcly: today with my French win xp configuration i can only simulate input from keyboard.
how is it on language using chars >FF and chars >FFFF , is it direcly put on the right unicode coding,
on windows, on Utf16 with surrogates pairs when needed ?
on linux, how it is : in Utf32 or utf8 ?

3_ can some of you put sample texts using unicode chars with chars >FFFF and other, because i'm using just the ones in the example Unicode folder in FB , and not easy for me to produce such test examples and figure out how is the treatment time on convertion , the string manipulation... only with these few samples.
please post them here or send me direcly marpon@aliceadsl.fr
they can be in UTF8 , UTF16 or UTF32, as you want.

4_ last point, question for linux users , how do you think as unicode variable length string :
internally coded in UTF8 ,or direcly in UTF32 ?
For windows I decide to go to UTF16 (direct to API ) even if i have to manage the surrogate pair story.

Your comments will be very apreciated.
marpon
Posts: 342
Joined: Dec 28, 2012 13:31
Location: Paris - France

Re: unicode variable length string type?

Postby marpon » Nov 13, 2015 17:45

In the mean time ...

here the first version of dynamic wstring : https://db.tt/vRc5Kx3j

everything in the bi file
and 2 test files to evaluate

as you will notice the internal coding will be
for windows : UTF16 to direct access for Win API
and UTF32 for linux , the linux part is not tested at all, it is implemented to have cross platform code.

the outpout is hard coded
in utf16 in UTF16LE with BOM the normal endian format for 16
in utf32 in UTF32BE with BOM the normal endian format for 32
in utf8 in UTF8 with BOM

the endian input is normally understood by the function

the #define __VERBOSE_MODE__ if uncommented shows the process management alloc/reallocate/deallocate

I will open soon an other topic in General or Tip & Tricks

I'll be interrested on your test results / comments specially from Asian / Eastern europe Countries and linux people.

Thanks in advance
caseih
Posts: 1379
Joined: Feb 26, 2007 5:32

Re: unicode variable length string type?

Postby caseih » Nov 14, 2015 1:59

I didn't see this until now, but to answer questions in your other post:

1. I have no idea
2. In Linux, on the console (stdin) it would be UTF-8. In graphics mode, it all depends on how the keyboard input is done. currently it looks like the runtime uses X directly, and as near as I can tell there's zero support for non ASCII layouts. I can switch to arabic, and I get nothing at all in an FB app. If unicode is to be supported in the graphics runtime, that will require a ton of work.
3. Here's a link to a bunch of code points above the BMP. http://stackoverflow.com/questions/5567 ... actual-use
4. That's the million dollar question, isn't it. For your first attempt UTF-32 is the best way to go. Long term, if you're going to manage the possibility of surrogate pairs on Windows, you could do on Linux to, and go with a UTF-16 encoding internally.

By the way my condolences to your country for the shocking attacks today.

Return to “Community Discussion”

Who is online

Users browsing this forum: nidhiwani and 28 guests