UTF-8 library

Berkeley · Post by **Berkeley** » Aug 19, 2024 17:37

My UTF-8 library is going to become releaseable... You may contribute by filling the ISO/IEC 8859 tables:

DIM SHARED iso8859tableentry(16, 96) AS ULONG = {
 ...
' This is one block:
{&H00A0, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, _
&H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, _
&H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, _
&H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, _
&H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, _
&H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000, &H0000},_

ISO 8859-1, 8859-7 and 8859-15 are already finished.

Each value is the (hexadecimal) Unicode value of the character. The block starts with code 160 (&hA0), it are only 96 characters from 160 to 255. Wikipedia has great documentations: https://en.wikipedia.org/wiki/ISO/IEC_8859

Set undefined chars to &H0000.

Some blocks are in bigger parts a variation of ISO 8859-1, so its code might be easier to change:

Code: Select all

{&H00A0, &H00A1, &H00A2, &H00A3, &H00A4, &H00A5, &H00A6, &H00A7, &H00A8, &H00A9, &H00AA, &H00AB, &H00AC, &H00AD, &H00AE, &H00AF, _
&H00B0, &H00B1, &H00B2, &H00B3, &H00B4, &H00B5, &H00B6, &H00B7, &H00B8, &H00B9, &H00BA, &H00BB, &H00BC, &H00BD, &H00BE, &H00BF, _
&H00C0, &H00C1, &H00C2, &H00C3, &H00C4, &H00C5, &H00C6, &H00C7, &H00C8, &H00C9, &H00CA, &H00CB, &H00CC, &H00CD, &H00CE, &H00CF, _
&H00D0, &H00D1, &H00D2, &H00D3, &H00D4, &H00D5, &H00D6, &H00D7, &H00D8, &H00D9, &H00DA, &H00DB, &H00DC, &H00DD, &H00DE, &H00DF, _
&H00E0, &H00E1, &H00E2, &H00E3, &H00E4, &H00E5, &H00E6, &H00E7, &H00E8, &H00E9, &H00EA, &H00EB, &H00EC, &H00ED, &H00EE, &H00EF, _
&H00F0, &H00F1, &H00F2, &H00F3, &H00F4, &H00F5, &H00F6, &H00F7, &H00F8, &H00F9, &H00FA, &H00FB, &H00FC, &H00FD, &H00FE, &H00FF},_

Thank you, even just for reading.

Berkeley · Post by **Berkeley** » Aug 20, 2024 18:24

Download is ready. Not much tested yet, "UPPERCASE"/"LOWERCASE" may produce crappy results.

https://www.freebasic-portal.de/dlfiles/836/utf8.bi

Löwenherz · Post by **Löwenherz** » Aug 20, 2024 19:07

Thanks for Infos and your Works Berkeley.. Not tested your library yet.. what about french spanish greece russian Chinese japanese etcpp Special characters? Utf-8 IS more for latin Alphabets reserved one Byte but greece and other languages need 2 bytes or even 4 Bytes Like asian languages or african one..

Berkeley · Post by **Berkeley** » Aug 21, 2024 17:26

Even Cyrillic letters need in UTF-8 just 2 bytes, so UTF-16 won't save space. The library should support all up to 4 byte long UTF-8 sequences. Unfortunately FreeBASIC doesn't ensure whether WSTRING is 16 or 32 bits broad, depending on the operating system, and there is no dedicated "LSTRING", so it's senseless at this point trying to use UTF-16 or UTF-32 in FreeBASIC.

Berkeley · Post by **Berkeley** » Nov 07, 2024 17:32

A new version is out: https://www.freebasic-portal.de/dlfiles/844/utf8.bi Some function names have changed. Now there are MID/LEFT/RIGHT functions for UTF-8 strings, which allow to additionally address by characters rather than by bytes.

You may still contribute by filling the ISO/IEC 8859 table entries, and post them here...

Berkeley · Post by **Berkeley** » Nov 12, 2024 19:00

A new version is out: https://www.freebasic-portal.de/dlfiles/845/utf8.bi. Contains bugfixes and the new function UTF8DIFF().

UTF-8 library

UTF-8 library

Re: UTF-8 library

Re: UTF-8 library

Re: UTF-8 library

Re: UTF-8 library

Re: UTF-8 library