UTF-8 Variable Length String Library

User projects written in or related to FreeBASIC.
Munair
Posts: 358
Joined: Oct 19, 2017 15:00
Location: 't Zand, NL
Contact:

UTF-8 Variable Length String Library

Postby Munair » Dec 06, 2017 16:58

For the IDE/RAD I'm working on for Linux, it is essential to have full UTF-8 support for all possible string operations. Fortunately UTF-8 encoded text can be stored as simple variable length strings. But one cannot use Left, Right, Mid etc on those strings as it will break the multibyte Unicode.

With a little help from the Lazarus project (Free Pascal) I was able to write basic string routines that support UTF-8. I decided to use the 'U' prefix to distinguish from normal string routines. So Left will be ULeft, Instr will be UInstr etc. It may be a bit confusing because of the existing 'unsigned' prefix, but we're talking string routines here so it should be OK. I couldn't think of anything more appropriate without going verbose.

Additional routines are added not native to FreeBASIC: ResizeStr, UIsAscii, URemove, UInsert, UReplace and UReplaceAll. The beauty here is that all functions, including these additions, can also be used on 'normal' or ASCII strings. This means that one can simply concatenate strings as always: str3 = str2 + str1 (provided each string contains valid ASCII or UTF-8 byte sequences), because the functions will find out for themselves where the multi-byte encoded characters are.

One routine that's missing is the UMid statement. Perhaps someone can help out there, but I doubt there is a real need for it. Meanwhile, you're welcome to test, use and improve the code. MLGPL License included. ;)

On a final note, I decided not to hard code the string type so it can be changed or renamed at any time. It also makes the code more readable IMO.

Download zip: http://www.basicstudio.org/downloads/fxutf8.bi.zip
Download tarball: http://www.basicstudio.org/downloads/fxutf8.bi.tar.gz
Last updated: 12 December 2017, 19:23 UTC
Last edited by Munair on Dec 12, 2017 19:25, edited 13 times in total.
jj2007
Posts: 181
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: UTF-8 Variable Length String Library

Postby jj2007 » Dec 06, 2017 18:39

This looks very promising, compliments for your hard work!
TeeEmCee
Posts: 200
Joined: Jul 22, 2006 0:54
Location: Auckland

Re: UTF-8 Variable Length String Library

Postby TeeEmCee » Dec 07, 2017 6:56

Cool!
I also used "type ustring as string" to denote a string containing UTF8 characters. I haven't gotten around to writing replacements for all the builtin string functions/statements yet, though.
I think UTF8 strings really should be added to FB as a native data type, with ustring overloads for all the string functions. Since wstrings are incomplete and terribly misnamed, and are UTF16 on Windows, it's pretty painful to use them for unicode support and UTF8 seems the way to go in FB.
St_W
Posts: 1166
Joined: Feb 11, 2009 14:24
Location: Austria
Contact:

Re: UTF-8 Variable Length String Library

Postby St_W » Dec 07, 2017 15:44

I agree that UTF-8 (and ideally also UTF-16 and UTF-32) support should be added to FB as Unicode is pretty common nowadays. Unfortunately your code is GPL licensed and thus it's usefulness is quite restricted (e.g. it is incompatible with the license used for FreeBasics internal runtime library).
Munair
Posts: 358
Joined: Oct 19, 2017 15:00
Location: 't Zand, NL
Contact:

Re: UTF-8 Variable Length String Library

Postby Munair » Dec 07, 2017 15:49

St_W wrote:I agree that UTF-8 (and ideally also UTF-16 and UTF-32) support should be added to FB as Unicode is pretty common nowadays. Unfortunately your code is GPL licensed and thus it's usefulness is quite restricted (e.g. it is incompatible with the license used for FreeBasics internal runtime library).

Could you point me to that specific license? I perhaps mistakenly thought that "FreeBASIC is a multiplatform, free/open source (GPL) BASIC compiler."
marcov
Posts: 2405
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: UTF-8 Variable Length String Library

Postby marcov » Dec 07, 2017 16:17

Munair wrote:
St_W wrote:I agree that UTF-8 (and ideally also UTF-16 and UTF-32) support should be added to FB as Unicode is pretty common nowadays. Unfortunately your code is GPL licensed and thus it's usefulness is quite restricted (e.g. it is incompatible with the license used for FreeBasics internal runtime library).

Could you point me to that specific license? I perhaps mistakenly thought that "FreeBASIC is a multiplatform, free/open source (GPL) BASIC compiler."


So is the compiler source of Free Pascal, but the libraries have a different license. Which should have been near the top of lazutf8 btw:

See the file COPYING.modifiedLGPL.txt, included in this distribution,
for details about the license.


which also applies to derivatives.

The modified (L) GPL is key here. The modification is so significant that the L doesn't matter anymore. See also http://wiki.freepascal.org/licensing

Note that even GCC licenses its libraries not under the GPL.
Munair
Posts: 358
Joined: Oct 19, 2017 15:00
Location: 't Zand, NL
Contact:

Re: UTF-8 Variable Length String Library

Postby Munair » Dec 07, 2017 16:21

marcov wrote:
Munair wrote:
St_W wrote:I agree that UTF-8 (and ideally also UTF-16 and UTF-32) support should be added to FB as Unicode is pretty common nowadays. Unfortunately your code is GPL licensed and thus it's usefulness is quite restricted (e.g. it is incompatible with the license used for FreeBasics internal runtime library).

Could you point me to that specific license? I perhaps mistakenly thought that "FreeBASIC is a multiplatform, free/open source (GPL) BASIC compiler."


So is the compiler source of Free Pascal, but the libraries have a different license. Which should have been near the top of lazutf8 btw:

See the file COPYING.modifiedLGPL.txt, included in this distribution,
for details about the license.


which also applies to derivatives.

The modified (L) GPL is key here. The modification is so significant that the L doesn't matter anymore. See also http://wiki.freepascal.org/licensing

Note that even GCC licenses its libraries not under the GPL.

Thank marcov. I will look into it and adjust the license accordingly. I admit I overlooked it.
St_W
Posts: 1166
Joined: Feb 11, 2009 14:24
Location: Austria
Contact:

Re: UTF-8 Variable Length String Library

Postby St_W » Dec 07, 2017 16:25

Munair wrote:Could you point me to that specific license? I perhaps mistakenly thought that "FreeBASIC is a multiplatform, free/open source (GPL) BASIC compiler."
The compiler itself is GPL, but the runtime library is LGPL with an additional static linking exception. Otherwise every application created with FB would have to use GPL or a compatible license. And as you know maybe, GPL can be a real problem because of its restrictions - and the compatibility issues with other libraries therefore.
You can find the license in FB's readme file:
https://github.com/freebasic/fbc/blob/master/readme.txt
marcov
Posts: 2405
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: UTF-8 Variable Length String Library

Postby marcov » Dec 07, 2017 16:48

Munair wrote:
Thank marcov. I will look into it and adjust the license accordingly. I admit I overlooked it.


No problem. It is not that there is any problem taking the source from FPC/Lazarus. It is just how it generally goes in OSS development tools, the libraries are more free than the compiler, so that generated applications are not chained to GPL.
Munair
Posts: 358
Joined: Oct 19, 2017 15:00
Location: 't Zand, NL
Contact:

Re: UTF-8 Variable Length String Library

Postby Munair » Dec 07, 2017 16:54

I overlooked the fact that the FB runtime library is LGPL. Although some of this library's code is "translated" from the Lazarus project, it also includes rewritten and newly written code. So I'm not sure what rules are in effect here. I would be happy to go with LGPL for this and future libraries if that doesn't conflict with other licenses so that it can be included in the FB project.
Munair
Posts: 358
Joined: Oct 19, 2017 15:00
Location: 't Zand, NL
Contact:

Re: UTF-8 Variable Length String Library

Postby Munair » Dec 07, 2017 18:06

License adjusted.
Munair
Posts: 358
Joined: Oct 19, 2017 15:00
Location: 't Zand, NL
Contact:

Re: UTF-8 Variable Length String Library

Postby Munair » Dec 07, 2017 22:35

Lowercase function added. See the first post for the updated source code and description. The MLGPL License can be found here: http://www.ditrianum.org/mlgpl.html and as text: http://www.ditrianum.org/mlgpl.txt

The Lowercase function was only tested on the standard ASCII capital letters &h41 to &h5A. There's no guarantee that all languages are properly covered. The Lazarus project got a few fixes in the process too. ;)
jj2007
Posts: 181
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: UTF-8 Variable Length String Library

Postby jj2007 » Dec 08, 2017 19:44

Hi Munair,
You inspired me to implement uLeft$() and friends, see Unicode and UTF-8: Using non-Latin charsets in Assembler. Thanks ;-)
Munair
Posts: 358
Joined: Oct 19, 2017 15:00
Location: 't Zand, NL
Contact:

Re: UTF-8 Variable Length String Library

Postby Munair » Dec 08, 2017 20:24

jj2007 wrote:Hi Munair,
You inspired me to implement uLeft$() and friends, see Unicode and UTF-8: Using non-Latin charsets in Assembler. Thanks ;-)

Nice. Well done!
Munair
Posts: 358
Joined: Oct 19, 2017 15:00
Location: 't Zand, NL
Contact:

Re: UTF-8 Variable Length String Library

Postby Munair » Dec 08, 2017 20:37

UUCase is now included, which makes the library complete as far as basic string operations are concerned. See the updated source code in the first post of this thread.

Be sure to read the MLGPL license if you plan to use or modify the library: http://www.basicstudio.org/mlgpl.html Or download the text version if you wish: http://www.basicstudio.org/mlgpl.txt

Updated example code:

Code: Select all

dim as ustring s, t

s = "Sale €5,-"
t = "€5"
print UIsAscii(s)
print ULen(s) 'prints '9'
print UInstr(s, t) 'prints '6'
print UMid(s, 6, 4) 'prints "5,-"
print ULeft(s, 4) 'prints "Sale"
print UIsAscii(ULeft(s, 4))
print URight(s, 4) 'prints "5,-"
print URemove(s, 6, 4) 'prints "Sale"
s = UInsert(s, "€ ", 6)
print s
print len(s)
UReplaceAll(s, "€", "$")
print s
print len(s)
UReplace(s, "$", "€")
print s
print len(s)
s = "ПРИВЕТ"
print s
s = ULCase(s)
print s
print UUCase(s)
sleep
end

Return to “Projects”

Who is online

Users browsing this forum: Bing [Bot] and 2 guests