unicode variable length string type?

General discussion for topics related to the FreeBASIC project or its community.
TeeEmCee
Posts: 375
Joined: Jul 22, 2006 0:54
Location: Auckland

Re: unicode variable length string type?

Post by TeeEmCee »

Cool! I haven't really looked at the API yet, but tried the demo.

However, many people (like me) can't use your library with the license you stated. Claiming that the code is freeware and then saying that you aren't allowed to distribute it as a .lib or sell it or modify it is totally contradictory. Why don't you release it as Free/Libre/Open Source Software (FLOSS) instead?

What encoding is DWSTR.inc in? It doesn't display correctly for me.

linux_test_1_dwstr.bas runs fine on 64-bit GNU/Linux, but crashes when compiled as 32-bit, on line "48 mydw111 = dw_str( "test")"

I had a look at your code and see that you seem to incorrectly write "bytes" everywhere when you mean "wide characters". For example

Code: Select all

m_BufferLen AS LONG = 0    		' Length in bytes of the current string in the buffer
Is false, it's not bytes. Why?!?

Using valgrind, I tracked down the crash to ResizeBuffer being broken. It should read:

Code: Select all

		IF m_pBuffer THEN
			m_pBuffer = reAllocate(m_pBuffer, (nValue + 1) * _MY_SIZE_WSTRING_)
		else
			m_pBuffer = Allocate((nValue + 1) * _MY_SIZE_WSTRING_)
		END IF
I don't understand why it doesn't crash on other platforms. If I compile the program 64 bit, valgrind doesn't even produce any warnings! Maybe this is because integer/long are used incorrectly somewhere.

In 32 bit builds, valgrind does give one more warning:

Code: Select all

==26910== Conditional jump or move depends on uninitialised value(s)
==26910==    at 0x403582C: wcslen (vg_replace_strmem.c:1660)
==26910==    by 0x804ED65: DWSTR::ADD(wchar_t*, long) (in /mnt/common/src/dwstr/linux_test_1_dwstr)
==26910==    by 0x804EAEF: DWSTR::operator=(wchar_t*&) (in /mnt/common/src/dwstr/linux_test_1_dwstr)
==26910==    by 0x804B62E: main (in /mnt/common/src/dwstr/linux_test_1_dwstr)
That probably indicates another bug.

You should remove all mentions of __FB_LINUX__ from your code, and replace them with __FB_UNIX__. __FB_LINUX__ should almost never be used. Even better would be to check "#if sizeof(wstring) = 4" instead of "#ifdef __FB_UNIX__" (I realise that you do want to check for UNIX in some places)

Why do you use _MY_SIZE_WSTRING_ instead of sizeof(wstring)?


Unfortunately, building a dynamic wstring type into FB is completely different and far more work than writing a dynamic wstring class in FB. Personally I no longer care too much about it, because I'm using UTF8 stored in STRINGs instead.
marpon
Posts: 342
Joined: Dec 28, 2012 13:31
Location: Paris - France

Re: unicode variable length string type?

Post by marpon »

@TeeEmCee

thanks for your comments and debugg

> What encoding is DWSTR.inc in? It doesn't display correctly for me.

It is ansi windows codepage 1252 .

> bytes
true its wchars, the comments are old ones (when it was a ubyte buffer)

> ResizeBuffer
Thanks, for noticing it, you are true

> __FB_UNIX__
I'm not familiar with Unix/linux , as i only code normally for windows, what i want is to separate clearly windows / linux platform
are Unix and Linux having exactly the same behavior , specialy the charset ? i don't know. as i considere as default charset for linux xx-XX.UTF-8
to make the conversion it is important to be sure.

> _MY_SIZE_WSTRING_ instead of sizeof(wstring)
its just because its a define to a constant, i imagine it could be faster than every time sizeof(wstring), my quest of speed...

> the valgrind warning on 32 builds
I 'm not familiar with valgrind , so i do not identify where it is in the code

> freeware
its clear you can use it as it is, it's not open source. its free to use as an include file for freebasic not more.

You can create your executables with it , distribute these exe or sell them without any problem as long as you use that include file to produce the executable.

But not allowed to make lib from that code, or distribute/sell as a lib without agreement.
again, its freeware not open source. Sorry if for you is not suitable as it is. I'm currious, what would you like to do with it ?

As you said you are using 'UTF8 stored in STRINGs', you probably do not need it.
I can imagine you don't use Ucase/Lcase ,mid, reverse, replace, ... even len , to play with unit codes, at least for that kind of functions wstrings are more easy.

Again, thanks for your interest, comments, proposals. And hope that include code could ease some of your tasks.
fxm
Moderator
Posts: 12158
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: unicode variable length string type?

Post by fxm »

TeeEmCee wrote:

Code: Select all

		IF m_pBuffer THEN
			m_pBuffer = reAllocate(m_pBuffer, (nValue + 1) * _MY_SIZE_WSTRING_)
		else
			m_pBuffer = Allocate((nValue + 1) * _MY_SIZE_WSTRING_)
		END IF
Remark about the following code line which can also replace Allocate and advantageously Deallocate:
p = Reallocate(p, n)

- If pointer 'p' is null then 'Reallocate' behaves identically to the instruction: 'p = Allocate(n)',
- otherwise (p > 0) if count 'n' is null then Reallocate behaves identically to the both instructions: 'Deallocate(p)' followed by 'p = 0',
- otherwise (p > 0 and n > 0) normal reallocation functioning.

Example:

Code: Select all

Dim As Zstring Ptr pz
Dim As Integer n
Print pz, "'" & *pz & "'"

n = 9+1
pz = Reallocate(pz, n)            '' as pz = 0, equivalent to: "pz = Allocate(n)"
*pz = "FreeBasic"
Print pz, "'" & *pz & "'"

n = 29+1
pz = Reallocate(pz, n)            '' as pz > 0 and n > 0, normal reallocation
*(pz+9) = " 1.06.0 for win32/64"
Print pz, "'" & *pz & "'"

n = 0
pz = Reallocate(pz, n)            '' as pz > 0 and n = 0, equivalent to: "Deallocate(pz)" plus "pz = 0"
Print pz, "'" & *pz & "'"

Sleep

Code: Select all

0             ''
2502456       'FreeBasic'
2502456       'FreeBasic 1.06.0 for win32/64'
0             ''
TeeEmCee
Posts: 375
Joined: Jul 22, 2006 0:54
Location: Auckland

Re: unicode variable length string type?

Post by TeeEmCee »

Opps, yes, I wasn't thinking about realloc. But I think the code needs more work; I would not be happy including messy code full of false comments (confusion between bytes and characters) into my project.

Was it really necessary to reimplement OPEN ... ENCODING?

GNU/Linux and other Unices more or less all behave the same, with some exceptions like Android, which has an incomplete and non-standard libc, especially when it comes to widechar strings. However I'm not sure whether they all use "UTF-8" in the locale name. I read "On some UNIX-type systems, non-standard names are used for encodings". I'm also not certain whether they all use 4-byte wstrings, but you can use "sizeof(wstring)" to test this instead of assuming (it seems a pretty safe assumption, though). As I wrote in the other thread, it's wrong to assume that the default encoding on Unix is UTF8, which your code does, in the Dw_Str function.

setlocale is very unpleasant, because it controls both external encodings of text (filenames, text printed to console), and internal encodings of strings in your program. Which is why I'm not using mbstowcs/etc at all, I need internal strings to be UTF8 regardless of what the external encoding is! That's also why I don't like wstring: no control over whether it's 16 bit or 32 bit.

Unfortunately, calling setlocale at runtime in Dwstr_To_Str2 isn't thread-safe, but I don't know how to fix that. mbstowcs_l, which takes the locale as an argument, isn't standard.
marpon wrote:its clear you can use it as it is, it's not open source.
Sorry, I can't use it (if I had wanted to use wstrings), and neither can anyone else who writes free/open source software, such as FreeBASIC itself. Your license is incompatible with ALL open source/free software, by the OSI and FSF standard definitions of open source and of free software! The all disallow putting restrictions on selling the executables or source code.
> the valgrind warning on 32 builds
I 'm not familiar with valgrind , so i do not identify where it is in the code
Unfortunately valgrind doesn't print the line numbers; but maybe this gdb backtrace will help:

Code: Select all

(gdb) bt full
#0  0x0403582c in _vgr20370ZU_libcZdsoZa_wcslen (str=0x433f218) at ../shared/vg_replace_strmem.c:1660
#1  0x0804ed66 in DWSTR::ADD (THIS=..., PWSZSTR=<error reading variable>, NLEN=<error reading variable>) at /mnt/common/src/dwstr/DWSTR.inc:454
        NLENSTRING = 0
#2  0x0804eaf0 in DWSTR::operator= (THIS=..., PWSZSTR=<error reading variable>) at /mnt/common/src/dwstr/DWSTR.inc:334
No locals.
#3  0x0804b62f in main (__FB_ARGC__=<error reading variable>, __FB_ARGV__=<error reading variable>) at linux_test_1_dwstr.bas:105
        ...
        DW2 = {M_PBUFFER = 0x4339288, M_BUFFERLEN = 0, M_CAPACITY = 260, M_GROWSIZE = 260, M_FLAG = 0}
(gdb) frame 1
#1  0x0804ed66 in DWSTR::ADD (THIS=..., PWSZSTR=<error reading variable>, NLEN=<error reading variable>) at /mnt/common/src/dwstr/DWSTR.inc:454
454				nLenString = .LEN(*pwszStr)
(gdb) p PWSZSTR 
$1 = (uinteger *) 0x433f218
(gdb) p NLEN
$2 = -1
its just because its a define to a constant, i imagine it could be faster than every time sizeof(wstring), my quest of speed...
sizeof is always evaluated at compiletime (even if you use inheritance and do sizeof(*baseclassptr), it doesn't figure out the size of the actual object at runtime)
I can imagine you don't use Ucase/Lcase ,mid, reverse, replace, ... even len , to play with unit codes, at least for that kind of functions wstrings are more easy.
If you want to handle surrogate pairs correctly, then UTF16 isn't really easier than UTF8.
Implementing LEN/MID/LEFT/RIGHT/ASC/CHR for UTF8 is not too hard, but almost always I care about the number of pixels long a string is (with variable-width fonts), or drawing just the right/left X pixels of it, instead of the number of characters.
You can just use normal LCASE/UCASE on UTF8 STRINGs if you only want them to work on ASCII characters; implementing them for unicode is far too complex and I don't need that.
Normal INSTR also works with UTF8. The result is bytes, not characters, but actually that's usually what you want anyway.
I don't think I've ever had to reverse a string. But there many other operations, like excluding certain characters or replacing substrings or comparing strings. Actually, it looks like most of them will just work on UTF8. It's a clever encoding.
StringEpsilon
Posts: 42
Joined: Apr 09, 2015 20:49

Re: unicode variable length string type?

Post by StringEpsilon »

marpon wrote: its a freeware, a gift to the FreeBasic community!

1_ You, and everyone, is allowed to use it everywhere freely as an include file for freebasic,
2_ You, and everyone, can distribute it freely as it is : again as an include file for freebasic
3_ You are not allowed to create a lib file, nor to distribute/seel it on any lib form.
4_ You are not allowed to modify the code, not to distribute/seel it modified

so, you are free to use it in all projets you do, if the intention is not only to bypass the points 3 or 4

Again, my intention is to simplify the use of wstring for FreeBasic coders.

If that class is incorporated as a native type for the compiler ,better !
I won't use your project for anything then, because the terms you laid out here are two vague and to restricting. Not being able to modify it is a big deal (bug fixing, refactoring, maintenance, adding new features, improving performance, ...) . The "no commercial use" doesn't bother me on it's own, but that clause makes it incompatible with any FLOSS license. I couldn't use your code - even if I wanted - in fbJSON (MPL 2.0) for that reason, without changing my license to something equally restrictive.

If you change your mind, have a look here: http://choosealicense.com/licenses/
Post Reply