UTF-8 Variable Length String Library

Post your FreeBASIC source, examples, tips and tricks here. Please don’t post code without including an explanation.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: UTF-8 Variable Length String Library

Post by jj2007 »

Munair wrote:- Added: UReverse()
You inspired me, thanks ;-)
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by marcov »

Munair wrote: The code that ICU develops to cover all rules is some 15MB in size. It is easiest to use composed characters, but unfortunately the use of separate diacritical marks is encouraged.
And afaik on OS X, System APIs expect and return denormalized strings.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

With the recent discussion of how GUI libraries targeting multiple platforms usually lead to bloatware, it may be interesting to note that today's Unicode and support for converging between codepages easily 'blow up' an executable to 230kB. Here is a piece of code to demonstrate some common tasks with UTF-8 that I've been testing:

Code: Select all

#include once "encodings.bi"

dim buffer as string
dim s as string

' UTF-32 file
open "textfile.txt" for input as #1
  line input #1, buffer
close #1

' encode to UTF-8
s = Encodings.Decode(buffer)
if Encodings.Invalid then
  print "Binary file."
  end
end if

' convert UTF-8 to UTF-16 and write to file
buffer = Encodings.EncodeUTF16BE(s)
open "textfile16.txt" for output as #1
  print #1, chr(&hFE, &hFF) + buffer
close #1

' convert UTF-8 to UTF-32 and write to file
buffer = Encodings.EncodeUTF32BE(s)
open "textfile32.txt" for output as #1
  print #1, chr(&h0, &h0, &hFE, &hFF) + buffer
close #1
end

#include once "encodings.bas"
It may not be something that some of us westerners realize, accustomed as we are to good old plain ASCII. But Unicode has become a standard and should be supported by any application dealing with exchanging text.
Iczer
Posts: 99
Joined: Jul 04, 2017 18:09

Re: UTF-8 Variable Length String Library

Post by Iczer »

about converting ASCII-string into normal utf-8 string : what can be done with a numeric character reference : "&#nnnn;" or " &#xhhhh;"
where nnnn is the code point in decimal form, and hhhh is the code point in hexadecimal form?
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by marcov »

Iczer wrote:about converting ASCII-string into normal utf-8 string : what can be done with a numeric character reference : "&#nnnn;" or " &#xhhhh;"
where nnnn is the code point in decimal form, and hhhh is the code point in hexadecimal form?
That form is not ascii, but a document specific escape sequence (like html). You would need an interpreter for the relevant document format to translate it into proper utf8.
Last edited by marcov on Jan 16, 2018 19:14, edited 1 time in total.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: UTF-8 Variable Length String Library

Post by jj2007 »

Iczer wrote:about converting ASCII-string into normal utf-8 string : what can be done with a numeric character reference : "&#nnnn;" or " &#xhhhh;"
where nnnn is the code point in decimal form, and hhhh is the code point in hexadecimal form?
You can parse the string and translate the sequences to their UTF-8 or UTF-16 equivalents. Can you provide an example for testing?
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

The links in the first post have been updated.
Imortis
Moderator
Posts: 1923
Joined: Jun 02, 2005 15:10
Location: USA
Contact:

Re: UTF-8 Variable Length String Library

Post by Imortis »

Moved out of Archive since the code is being updated again.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

Thanks Imortis!
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by marcov »

Munair wrote:
One routine that's missing is the UMid statement. Perhaps someone can help out there, but I doubt there is a real need for it. Meanwhile, you're welcome to test, use and improve the code. MLGPL License included. ;)
UMid is copy() in Pascal? But that is a build-in, for strings probably in astrings/ustrings.inc. For utf8 in characters that is harder, perhaps lazutils?
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

marcov wrote:UMid is copy() in Pascal? But that is a build-in, for strings probably in astrings/ustrings.inc. For utf8 in characters that is harder, perhaps lazutils?
(Free)BASIC has the functions Left(), Right() and Mid(), which are equivalent to Pascal's Copy(), but also a Mid statement for convenience to directly change a specific part of the string:

Code: Select all

' function
s = mid(s, 2, 3)

' statement
mid(s, 4, 5) = text
' statement equivalent to:
s = left(s, 3) + text + mid(s, 9)
I don't know of a Pascal equivalent for the Mid statement.
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by marcov »

Munair wrote:
I don't know of a Pascal equivalent for the Mid statement.
Ah yes, there is also the lhs version. No there is none in that form, nor syntax (with a function/buildin on the left hand side of =)
3oheicrw
Posts: 25
Joined: Mar 21, 2022 6:41

Re: UTF-8 Variable Length String Library

Post by 3oheicrw »

utf8.bi(30) error 4: Duplicated definition in 'type PChar as zstring ptr'

Comment out this line get rid of the error.
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: UTF-8 Variable Length String Library

Post by Munair »

3oheicrw wrote: Apr 02, 2022 16:27 utf8.bi(30) error 4: Duplicated definition in 'type PChar as zstring ptr'

Comment out this line get rid of the error.
When compiling examples.bas or utf8.bas (both including utf8.bi) I'm not getting this error. As far as I can see there's no duplicate definition of PChar. Did you include utf8.bi in another file/project?

I recommend against commenting out definitions in the library as it could break library functionality. Try to find where else PChar is defined in your own file/project instead. You may also get this error if you try to include UTF8.bi more than once (e.g. without the once keyword).
coderJeff
Site Admin
Posts: 4313
Joined: Nov 04, 2005 14:23
Location: Ontario, Canada
Contact:

Re: UTF-8 Variable Length String Library

Post by coderJeff »

PCHAR is a windows data type. If windows.bi is included, PCHAR will be defined.
Post Reply