UTF-8 Variable Length String Library

Post your FreeBASIC source, examples, tips and tricks here. Please don’t post code without including an explanation.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

New update can be downloaded. See links in the first post of this thread. The code is becoming too large (1700+ lines) to share inline. ;-)

Added:
- UAsc() returns Unicode value of UTF-8 codepoint
- UChr() returns UTF-8 codepoint of Unicode value
- UIvstr() returns True if string is valid UTF8
- URepair() Tries to repair UTF-8 string

Note: the functions UAsc() and UChr() will generate an "illegal function call" error when invalid codepoints are passed to them. So before using these functions on a UTF-8 string, be sure to have it checked first. See the following example on how to do that.

UIvstr() / URepair() example:

Code: Select all

dim as ustring s
's = chr(&h20AC) 'EURO sign
s = chr(&hC0) 'invalid UTF-8
if UIvstr(s) then
	print "valid UTF-8 string"
else
	URepair(s)
	if UIvstr(s) then
		print "UTF-8 string repaired"
	else
		print "UTF-8 string cannot be repaired"
	end if
end if
end
UAsc() / UChr() example:

Code: Select all

dim as uinteger a
dim as ustring c
a = &h20AC
c = UChr(a)
print c
a = UAsc(c)
print hex(a)
end
TeeEmCee
Posts: 375
Joined: Jul 22, 2006 0:54
Location: Auckland

Re: UTF-8 Variable Length String Library

Post by TeeEmCee »

Awesome; I'll just use these rather than continue my own project. However having support built into FB would still be a huge improvement, for type checking.

Why did you make this a single .bi file rather than a .bas and .bi pair? Anyone who wants to use it will have to manually split it into two files.

Also, you have a lot of mixed tabs and spaces.

So much of this code was translated from Pascal? Are there C versions of any of these functions? (I do actually recognise a couple of simple functions as being from cutetf8, which is public domain.) I'm asking for the purpose of adding these functions to FB's runtime library.
Nouba
Posts: 7
Joined: Sep 07, 2014 4:12

Re: UTF-8 Variable Length String Library

Post by Nouba »

@TeeEmCee,

there is a github project utf8string (MIT licensed) with a similar set of functions. Translating the headers with fbfrog works fine for me under Windows 7.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

TeeEmCee wrote:Why did you make this a single .bi file rather than a .bas and .bi pair? Anyone who wants to use it will have to manually split it into two files.
You can add support to your project by simply include the BI file (at the top). No need to split the procedures to a separate BAS file. The compiler sees BAS files as separate modules with individual name spaces.
TeeEmCee wrote:Also, you have a lot of mixed tabs and spaces.
That would have to be Geany not changing according to settings (when setting from tab to spaces). I will check this. Default indentation is 2, so if you set your indent to 2 in your editor, it should be fine.
TeeEmCee wrote:So much of this code was translated from Pascal? Are there C versions of any of these functions? (I do actually recognise a couple of simple functions as being from cutetf8, which is public domain.) I'm asking for the purpose of adding these functions to FB's runtime library.
Yes, some UTF8 specific functions were ported from the Lazarus project (lazutf8.pas). Next to that, I wrote some functions and adjusted things to make it more ' BASIC'. No other source was used as example.
Last edited by Munair on Dec 13, 2017 8:27, edited 1 time in total.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

Nouba wrote:there is a github project utf8string (MIT licensed) with a similar set of functions. Translating the headers with fbfrog works fine for me under Windows 7.
It is my opinion that one should use as much native FreeBasic code as possible. It is way easier to maintain (solving bugs, extending functionality, etc) and it allows for BASIC like structuring and naming, which is easier to use. For example, the library you point to uses a word-by-word buffer that needs to be allocated and deallocated explicitly. I for one, am not waiting for that kind of structuring in my projects. It is much more error prone (memory leaks etc).

One of the few reasons to use C libraries is when having to deal with specific bindings, like with GTK or Qt (C++). Other than that I may sneak in a few CRT functions like MEMCMP or MEMCHR, but that's it.
Last edited by Munair on Dec 13, 2017 8:41, edited 1 time in total.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

Indentation (TABS/Spaces) issue should be solved. Archives are updated.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: UTF-8 Variable Length String Library

Post by jj2007 »

Nouba wrote:there is a github project utf8string (MIT licensed) with a similar set of functions.
"A C library to manipulate UTF-8 strings. It uses the external library utf8proc".

A library based on a library based on a library... ;-)
Nouba
Posts: 7
Joined: Sep 07, 2014 4:12

Re: UTF-8 Variable Length String Library

Post by Nouba »

Strictly speaking the github project implements the code of the utf8proc library and does not depend on it as an external library (why should one reinvent the wheel if it's already done). From the user's point of view (only speaking for me), I do not really care if I link a handful of bas files or a library into my program. The main thing is that the code used is reliable and that seems to be true - at least for utf8proc which is also being used be many other trustworthy projects.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

Nouba wrote:and does not depend on it as an external library
Yes it does. There are library functions in utf8proc, such as scanning for grapheme clusters, that are not supported by the library you pointed to. In other words, procedures are shipped with the library that are never used.
Nouba wrote:(why should one reinvent the wheel if it's already done).
The Unicode project is a moving target. New developments may require code maintenance. It's better not to be dependent on third parties in order to see specific changes applied.

There are several implementations of Unicode out there, each in their own language with their own language specifics, for several reasons. ;-)
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

An updated version of the library can be downloaded. See the links in the first post of this thread.

Some code is optimized and some local identifiers were renamed (conform Unicode terminology). The archive now contains working examples (examples.bas). Just compile and run. ;-)
TeeEmCee
Posts: 375
Joined: Jul 22, 2006 0:54
Location: Auckland

Re: UTF-8 Variable Length String Library

Post by TeeEmCee »

Munair wrote:
TeeEmCee wrote:Why did you make this a single .bi file rather than a .bas and .bi pair? Anyone who wants to use it will have to manually split it into two files.
You can add support to your project by simply include the BI file (at the top). No need to split the procedures to a separate BAS file. The compiler sees BAS files as separate modules with individual name spaces.
If you import the file into more than one .bas, then when you link them you will get linker errors like "multiple definition of `ULEN'". So it's necessary to create a header file manually.
Munair wrote:The Unicode project is a moving target. New developments may require code maintenance. It's better not to be dependent on third parties in order to see specific changes applied.
To play devil's advocate: Yes. If you used a popular library like utf8proc, then whenever the unicode standard is updated (which eg expands the normalisation rules and tables of upper/lower case characters) I could just slot in a new version of utf8proc rather than wait for utf8.bi to be updated.

In fact, I went down the routine of cobbling together my own set of UTF8 routines and custom minimal normalisation because I didn't care about all the complexities of unicode and don't care to do normalisation or upper/lowercasing of Malay script. A single small .bi file is a lot nicer.
I care about UTF8 as an encoding and for compatibility, not Unicode as a way of representing text (it's a nightmare for that).
Nouba wrote:there is a github project utf8string (MIT licensed) with a similar set of functions. Translating the headers with fbfrog works fine for me under Windows 7.
Thanks. Seems a lot more relevant than utf8proc.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

New update can be downloaded. See links in the first post of this thread.

- Added: UReverse()
- Fix: trailing space character with ULCase()

Procedures and declarations are now in two separate files.
StringEpsilon
Posts: 42
Joined: Apr 09, 2015 20:49

Re: UTF-8 Variable Length String Library

Post by StringEpsilon »

Cool project (I started something similar last year). And props for tackling the hard stuff proper! I chickened out and just worked on dealing with codepoints

From glacing over your code, it seems your reversal function does not preserve grapheme clusters. See this article.

Code: Select all

s = UReverse("mañana")
print s ' outputs: anãnam - should be anañam
I found a reversal function written in rust that does the job, though it's not license compatible.

Also, if you want a function for decoding \u-notation, you can take a peek in fbJson. If you want to, I'll dual-licence the functions you need under the mlgpl, but the MPL should be compatible anyway.

Edit: Also, UUCase("ß") returns "SS". That's not the only valid transformation anymore. ẞ is also valid and the preferred transformation for proper names.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

StringEpsilon wrote:From glacing over your code, it seems your reversal function does not preserve grapheme clusters.
Correct, Full grapheme cluster support is A LOT of work, involving registering of diacritical marks (there's a lot of them, and new ones are added all the time.) In this respect Unicode is quite complicated. The code that ICU develops to cover all rules is some 15MB in size. It is easiest to use composed characters, but unfortunately the use of separate diacritical marks is encouraged. So for now, the library primarily supports codepoints, rather than grapheme clusters. But I will dive into this when I have more time. ;-)
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

StringEpsilon wrote:Edit: Also, UUCase("ß") returns "SS". That's not the only valid transformation anymore. ẞ is also valid and the preferred transformation for proper names.
Being half German myself, I have never seen a good use of capital "ß". It's a rule yes. I'll see if I can adjust it. But double S is still widely used, both lower and upper case.

Keep in mind that most software has difficulty in handling Unicode properly, including web browsers. This becomes apparent with fancy characters. It's because it's quite complicated and costly (performance wise) to tackle all problems.

Looking at your link, I can see you've been busy too. :-)
Post Reply