UTF-8 Variable Length String Library

Post your FreeBASIC source, examples, tips and tricks here. Please don’t post code without including an explanation.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

UTF-8 Variable Length String Library

Post by Munair »

With a little help from the Lazarus project (Free Pascal) I was able to write basic string routines that support UTF-8. I decided to use the 'U' prefix to distinguish from normal string routines. So Left will be ULeft, Instr will be UInstr etc. It may be a bit confusing because of the existing 'unsigned' prefix, but we're talking string routines here so it should be OK. I couldn't think of anything more appropriate without going verbose.

Additional routines are added not native to FreeBASIC: ResizeStr, UIsAscii, URemove, UInsert, UReplace and UReplaceAll. The beauty here is that all functions, including these additions, can also be used on 'normal' or ASCII strings. This means that one can simply concatenate strings as always: str3 = str2 + str1 (provided each string contains valid ASCII or UTF-8 byte sequences), because the functions will find out for themselves where the multi-byte encoded characters are.

One routine that's missing is the UMid statement. Perhaps someone can help out there, but I doubt there is a real need for it. Meanwhile, you're welcome to test, use and improve the code. MLGPL License included. ;)

On a final note, I decided not to hard code the string type so it can be changed or renamed at any time. It also makes the code more readable IMO.

Although the files are on the SharpBASIC server, the code is written in FreeBASIC and tested with fbc 1.08.

Download zip: https://sharpbasic.com/downloads/utf8.zip
Download tarball: https://sharpbasic.com/downloads/utf8.tar.gz
Last updated: 10 January 2022
Last edited by Munair on Jan 11, 2022 7:32, edited 20 times in total.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: UTF-8 Variable Length String Library

Post by jj2007 »

This looks very promising, compliments for your hard work!
TeeEmCee
Posts: 375
Joined: Jul 22, 2006 0:54
Location: Auckland

Re: UTF-8 Variable Length String Library

Post by TeeEmCee »

Cool!
I also used "type ustring as string" to denote a string containing UTF8 characters. I haven't gotten around to writing replacements for all the builtin string functions/statements yet, though.
I think UTF8 strings really should be added to FB as a native data type, with ustring overloads for all the string functions. Since wstrings are incomplete and terribly misnamed, and are UTF16 on Windows, it's pretty painful to use them for unicode support and UTF8 seems the way to go in FB.
St_W
Posts: 1619
Joined: Feb 11, 2009 14:24
Location: Austria
Contact:

Re: UTF-8 Variable Length String Library

Post by St_W »

I agree that UTF-8 (and ideally also UTF-16 and UTF-32) support should be added to FB as Unicode is pretty common nowadays. Unfortunately your code is GPL licensed and thus it's usefulness is quite restricted (e.g. it is incompatible with the license used for FreeBasics internal runtime library).
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

St_W wrote:I agree that UTF-8 (and ideally also UTF-16 and UTF-32) support should be added to FB as Unicode is pretty common nowadays. Unfortunately your code is GPL licensed and thus it's usefulness is quite restricted (e.g. it is incompatible with the license used for FreeBasics internal runtime library).
Could you point me to that specific license? I perhaps mistakenly thought that "FreeBASIC is a multiplatform, free/open source (GPL) BASIC compiler."
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by marcov »

Munair wrote:
St_W wrote:I agree that UTF-8 (and ideally also UTF-16 and UTF-32) support should be added to FB as Unicode is pretty common nowadays. Unfortunately your code is GPL licensed and thus it's usefulness is quite restricted (e.g. it is incompatible with the license used for FreeBasics internal runtime library).
Could you point me to that specific license? I perhaps mistakenly thought that "FreeBASIC is a multiplatform, free/open source (GPL) BASIC compiler."
So is the compiler source of Free Pascal, but the libraries have a different license. Which should have been near the top of lazutf8 btw:
See the file COPYING.modifiedLGPL.txt, included in this distribution,
for details about the license.
which also applies to derivatives.

The modified (L) GPL is key here. The modification is so significant that the L doesn't matter anymore. See also http://wiki.freepascal.org/licensing

Note that even GCC licenses its libraries not under the GPL.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

marcov wrote:
Munair wrote:
St_W wrote:I agree that UTF-8 (and ideally also UTF-16 and UTF-32) support should be added to FB as Unicode is pretty common nowadays. Unfortunately your code is GPL licensed and thus it's usefulness is quite restricted (e.g. it is incompatible with the license used for FreeBasics internal runtime library).
Could you point me to that specific license? I perhaps mistakenly thought that "FreeBASIC is a multiplatform, free/open source (GPL) BASIC compiler."
So is the compiler source of Free Pascal, but the libraries have a different license. Which should have been near the top of lazutf8 btw:
See the file COPYING.modifiedLGPL.txt, included in this distribution,
for details about the license.
which also applies to derivatives.

The modified (L) GPL is key here. The modification is so significant that the L doesn't matter anymore. See also http://wiki.freepascal.org/licensing

Note that even GCC licenses its libraries not under the GPL.
Thank marcov. I will look into it and adjust the license accordingly. I admit I overlooked it.
St_W
Posts: 1619
Joined: Feb 11, 2009 14:24
Location: Austria
Contact:

Re: UTF-8 Variable Length String Library

Post by St_W »

Munair wrote:Could you point me to that specific license? I perhaps mistakenly thought that "FreeBASIC is a multiplatform, free/open source (GPL) BASIC compiler."
The compiler itself is GPL, but the runtime library is LGPL with an additional static linking exception. Otherwise every application created with FB would have to use GPL or a compatible license. And as you know maybe, GPL can be a real problem because of its restrictions - and the compatibility issues with other libraries therefore.
You can find the license in FB's readme file:
https://github.com/freebasic/fbc/blob/master/readme.txt
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by marcov »

Munair wrote:
Thank marcov. I will look into it and adjust the license accordingly. I admit I overlooked it.
No problem. It is not that there is any problem taking the source from FPC/Lazarus. It is just how it generally goes in OSS development tools, the libraries are more free than the compiler, so that generated applications are not chained to GPL.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

I overlooked the fact that the FB runtime library is LGPL. Although some of this library's code is "translated" from the Lazarus project, it also includes rewritten and newly written code. So I'm not sure what rules are in effect here. I would be happy to go with LGPL for this and future libraries if that doesn't conflict with other licenses so that it can be included in the FB project.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

License adjusted.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

Lowercase function added. See the first post for the updated source code and description. The MLGPL License can be found here: http://www.ditrianum.org/mlgpl.html and as text: http://www.ditrianum.org/mlgpl.txt

The Lowercase function was only tested on the standard ASCII capital letters &h41 to &h5A. There's no guarantee that all languages are properly covered. The Lazarus project got a few fixes in the process too. ;)
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: UTF-8 Variable Length String Library

Post by jj2007 »

Hi Munair,
You inspired me to implement uLeft$() and friends, see Unicode and UTF-8: Using non-Latin charsets in Assembler. Thanks ;-)
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

jj2007 wrote:Hi Munair,
You inspired me to implement uLeft$() and friends, see Unicode and UTF-8: Using non-Latin charsets in Assembler. Thanks ;-)
Nice. Well done!
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

UUCase is now included, which makes the library complete as far as basic string operations are concerned. See the updated source code in the first post of this thread.

Be sure to read the MLGPL license if you plan to use or modify the library: http://www.basicstudio.org/mlgpl.html Or download the text version if you wish: http://www.basicstudio.org/mlgpl.txt

Updated example code:

Code: Select all

dim as ustring s, t

s = "Sale €5,-"
t = "€5"
print UIsAscii(s)
print ULen(s) 'prints '9'
print UInstr(s, t) 'prints '6'
print UMid(s, 6, 4) 'prints "5,-"
print ULeft(s, 4) 'prints "Sale"
print UIsAscii(ULeft(s, 4))
print URight(s, 4) 'prints "5,-"
print URemove(s, 6, 4) 'prints "Sale"
s = UInsert(s, "€ ", 6)
print s
print len(s)
UReplaceAll(s, "€", "$")
print s
print len(s)
UReplace(s, "$", "€")
print s
print len(s)
s = "ПРИВЕТ"
print s
s = ULCase(s)
print s
print UUCase(s)
sleep
end
Post Reply