UTF-8 library

User projects written in or related to FreeBASIC.
Post Reply
Berkeley
Posts: 110
Joined: Jun 08, 2024 15:03

UTF-8 library

Post by Berkeley »

My UTF-8 library is going to become releaseable... You may contribute by filling the ISO/IEC 8859 tables:

Code: Select all

DIM SHARED iso8859tableentry(16, 96) AS ULONG = {
 ...
' This is one block:
{&H00A0, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, _
&H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, _
&H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, _
&H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, _
&H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, _
&H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000},_
ISO 8859-1, 8859-7 and 8859-15 are already finished.

Each value is the (hexadecimal) Unicode value of the character. The block starts with code 160 (&hA0), it are only 96 characters from 160 to 255. Wikipedia has great documentations: https://en.wikipedia.org/wiki/ISO/IEC_8859

Set undefined chars to &H0000.

Some blocks are in bigger parts a variation of ISO 8859-1, so its code might be easier to change:

Code: Select all

{&H00A0, &H00A1, &H00A2, &H00A3, &H00A4, &H00A5, &H00A6, &H00A7, &H00A8, &H00A9, &H00AA, &H00AB, &H00AC, &H00AD, &H00AE, &H00AF, _
&H00B0, &H00B1, &H00B2, &H00B3, &H00B4, &H00B5, &H00B6, &H00B7, &H00B8, &H00B9, &H00BA, &H00BB, &H00BC, &H00BD, &H00BE, &H00BF, _
&H00C0, &H00C1, &H00C2, &H00C3, &H00C4, &H00C5, &H00C6, &H00C7, &H00C8, &H00C9, &H00CA, &H00CB, &H00CC, &H00CD, &H00CE, &H00CF, _
&H00D0, &H00D1, &H00D2, &H00D3, &H00D4, &H00D5, &H00D6, &H00D7, &H00D8, &H00D9, &H00DA, &H00DB, &H00DC, &H00DD, &H00DE, &H00DF, _
&H00E0, &H00E1, &H00E2, &H00E3, &H00E4, &H00E5, &H00E6, &H00E7, &H00E8, &H00E9, &H00EA, &H00EB, &H00EC, &H00ED, &H00EE, &H00EF, _
&H00F0, &H00F1, &H00F2, &H00F3, &H00F4, &H00F5, &H00F6, &H00F7, &H00F8, &H00F9, &H00FA, &H00FB, &H00FC, &H00FD, &H00FE, &H00FF},_
Thank you, even just for reading.
Berkeley
Posts: 110
Joined: Jun 08, 2024 15:03

Re: UTF-8 library

Post by Berkeley »

Download is ready. Not much tested yet, "UPPERCASE"/"LOWERCASE" may produce crappy results.

https://www.freebasic-portal.de/dlfiles/836/utf8.bi
Löwenherz
Posts: 253
Joined: Aug 27, 2008 6:26
Location: Bad Sooden-Allendorf, Germany

Re: UTF-8 library

Post by Löwenherz »

Thanks for Infos and your Works Berkeley.. Not tested your library yet.. what about french spanish greece russian Chinese japanese etcpp Special characters? Utf-8 IS more for latin Alphabets reserved one Byte but greece and other languages need 2 bytes or even 4 Bytes Like asian languages or african one..
Berkeley
Posts: 110
Joined: Jun 08, 2024 15:03

Re: UTF-8 library

Post by Berkeley »

Even Cyrillic letters need in UTF-8 just 2 bytes, so UTF-16 won't save space. The library should support all up to 4 byte long UTF-8 sequences. Unfortunately FreeBASIC doesn't ensure whether WSTRING is 16 or 32 bits broad, depending on the operating system, and there is no dedicated "LSTRING", so it's senseless at this point trying to use UTF-16 or UTF-32 in FreeBASIC.
Berkeley
Posts: 110
Joined: Jun 08, 2024 15:03

Re: UTF-8 library

Post by Berkeley »

A new version is out: https://www.freebasic-portal.de/dlfiles/844/utf8.bi Some function names have changed. Now there are MID/LEFT/RIGHT functions for UTF-8 strings, which allow to additionally address by characters rather than by bytes.

You may still contribute by filling the ISO/IEC 8859 table entries, and post them here...
Berkeley
Posts: 110
Joined: Jun 08, 2024 15:03

Re: UTF-8 library

Post by Berkeley »

A new version is out: https://www.freebasic-portal.de/dlfiles/845/utf8.bi. Contains bugfixes and the new function UTF8DIFF().
Post Reply