UTF-8 encoded byte stream validation

Post your FreeBASIC source, examples, tips and tricks here. Please don’t post code without including an explanation.
Post Reply
AGS
Posts: 1284
Joined: Sep 25, 2007 0:26
Location: the Netherlands

UTF-8 encoded byte stream validation

Post by AGS »

An UTF-8 encoded byte stream can contain illegal code points. The W3C has a nice page on how to check for illegal bytes in an UTF-8 encoded byte stream
http://www.w3.org/International/questio ... orms-utf-8

UTF-8 is a variable length multi byte character encoding for Unicode. Each code point (an UTF-8 byte stream represents a stream of Unicode code points) contained within the byte stream consists of either 1, 2, 3 or 4 bytes.

But not all combinations of bytes are 'valid'. There are rules UTF-8 encoded streams have to adhere to. These rules turn out to be so strict that it is relatively painless to check an UTF-8 byte stream for illegal characters (illegal code points).

To check the validity of an UTF-8 encoded byte stream you could use the following function. There is an example at the end of the code that shows how to call the function.

check_utf8 returns 0 if utf8_string is a 'legal' UTF-8 encoded byte stream and -1 if an error was found in the byte stream.

(len_ should equal the length of utf8_string (measured in bytes) )

Code: Select all

function check_utf8(byval utf8_string as ubyte ptr,byval len_ as integer) as integer

  if (utf8_string = 0) then
    return -1
  end if
   
  if (len_ <= 0) then
    return -1
  end if
  
  var ct = 0
  while (1)
again:
    select case as const utf8_string[ct]
    'ASCII range
    case &h00 to &h7F
      ct += 1
      if (ct = len_) then
        return 0
      end if
      goto again
    'non - overlong 2 byte
    case &hC2 to &hDF
      ct += 1
      if (ct = len_) then
        return -1
      end if
      select case as const utf8_string[ct]
      case &h80 to &hBF
        ct += 1
        if (ct = len_) then
          return 0
        else
          goto again
        end if
      end select
    'exluding overlongs
    case &hE0
      ct += 1
      if (ct = len_) then
        return -1
      end if
      select case as const utf8_string[ct]
      case &hA0 to &hBF
        ct += 1
        if (ct = len_) then
          return -1
        end if
        select case as const utf8_string[ct]
        case &h80 to &hBF
          ct += 1
          if (ct = len_) then
            return 0
          end if
        end select
      end select
    '3 byte
    case &hE1 to &hEC,&hEE,&hEF
      ct += 1
      if (ct = len_) then
        return -1
      end if
      select case as const utf8_string[ct]
      case &h80 to &hBF
        ct += 1
        if (ct = len_) then
          return -1
        end if
        select case as const utf8_string[ct]
        case &h80 to &hBF
          ct += 1
          if (ct = len_) then
            return 0
          else
            goto again
          end if
        end select
      end select
    'exclude surrogates (3 byte)
    case &hED
      ct += 1
      if (ct = len_) then
        return -1
      end if
      select case as const utf8_string[ct]
      case &h80 to &h9F
        ct += 1
        if (ct = len_) then
          return -1
        end if
        select case as const utf8_string[ct]
        case &h80 to &h9F
          ct += 1
          if (ct = len_) then
            return 0
          else
            goto again
          end if
        end select
      end select
    case &hF0
      ct += 1
      if (ct = len_) then
        return -1
      end if
      select case as const utf8_string[ct] 
      case &h90 to &hBF
        ct += 1
        if (ct = len_) then
          return -1
        end if
        select case as const utf8_string[ct]
        case &h80 to &hBF
          ct += 1
          if (ct = len_) then
            return 0
          else
            goto again
          end if
        end select
      end select
    case &hF1 to &hF3
      ct += 1
      if (ct = len_) then
        return -1
      end if
      select case as const utf8_string[ct] 
      case &h80 to &hBF
        ct += 1
        if (ct = len_) then
          return -1
        end if
        select case as const utf8_string[ct]
        case &h80 to &hBF
          ct += 1
          if (ct = len_) then
            return -1
          end if
          select case as const utf8_string[ct]
          case &h80 to &hBF
            ct += 1
            if (ct = len_) then
              return 0
            else
              goto again
            end if
          end select
        end select
      end select
    case &hF4
      ct += 1
      if (ct = len_) then
        return -1
      end if
      select case as const utf8_string[ct] 
      case &h80 to &hBF
        ct += 1
        if (ct = len_) then
          return -1
        end if
        select case as const utf8_string[ct]
        case &h80 to &hBF
          ct += 1
          if (ct = len_) then
            return -1
          end if
          select case as const utf8_string[ct]
          case &h80 to &hBF
            ct += 1
            if (ct = len_) then
              return 0
            else
              goto again
            end if
          end select
        end select
      end select
    'illegal byte in utf8 string
    case else
      return -1      
    end select
  wend

  return -1

end function

dim tst(0 to ...) as ubyte => {&hC0,&hAE}
'output: "illegal utf8 content"
if (check_utf8(@tst(0),ubound(tst) + 1)) then
  print "illegal utf8 content"
else
  print "legal utf8 content"
end if
dim tst2(0 to ...) as ubyte => {&hd5,&ha2,&hd5,&ha1,&hd5,&hab,&hd5,&ha5,&hd6,&h82}
'output: "legal utf8 content" (tst2 contains "hello" in the Armenian language).
if (check_utf8(@tst2(0),ubound(tst2) + 1)) then
  print "illegal utf8 content"
else
  print "legal utf8 content"
end if

(edit: changed range (first case statement) into &h00 to &h7F)
Last edited by AGS on Apr 09, 2011 22:46, edited 1 time in total.
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Post by TJF »

Nice work!

IMO you should use this different parameter list and code

Code: Select all

function check_utf8(byval utf8_string as ZSTRING ptr, byval len_ as integer = 0) as integer

  IF _len = 0 THEN _len = LEN(utf8_string) - 1

...
That way you can use STRINGS as parameters like VAR res = check_utf8("Hello"). Instead of handling the UBYTE arrays just type your text when working with an UTF-8 capable IDE.


BTW: it's pretty similar to
GLIB wrote:gboolean g_utf8_validate(const gchar *str, gssize max_len, const gchar **end);

Validates UTF-8 encoded text. str is the text to validate; if str is nul-terminated, then max_len can be -1, otherwise max_len should be the number of bytes to validate. If end is non-NULL, then the end of the valid range will be stored there (i.e. the start of the first invalid character if some bytes were invalid, or the end of the text being validated otherwise).
kiyotewolf
Posts: 1009
Joined: Oct 11, 2008 7:42
Location: ABQ, NM
Contact:

Post by kiyotewolf »

I remember trying to properly debug an ANSI command stream, validate it, as well as extend ANSI's stream to include new commands altogether.



~Kiyote!

What FB datatype should be used as a container for a UTF-8 encoded byte stream?
zstring?
AGS
Posts: 1284
Joined: Sep 25, 2007 0:26
Location: the Netherlands

Post by AGS »

kiyotewolf wrote:I remember trying to properly debug an ANSI command stream, validate it, as well as extend ANSI's stream to include new commands altogether.



~Kiyote!

What FB datatype should be used as a container for a UTF-8 encoded byte stream?
zstring?
I'd go with ubyte.

@TJF
Glib has some very nice UTF-8 support.

As for the interface: using string or zstring ptr will work as long as len_ is given. Calculating the length of the array/string within the routine is not possible as the UTF-8 byte sequence can contain 0's (the validator was not 100% correct).

Example

Code: Select all

var s = !"Das ist \&h0 a zero" 
print len(s)
output: 8

Everything after the 0 does not get counted as a character.
Same thing happens with a zstring ptr. The user has to provide the length as it is needed to make sure no out-of-bounds access occurs.

My validator was not correct and had a major flaw: ASCII range 0 - &h7F should be validated as legal. And the validator dismissed these as illegal. My mistake. 0 to &h7F are legal UTF-8 characters (legal code points).
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Post by TJF »

AGS wrote:As for the interface: using string or zstring ptr will work as long as len_ is given. Calculating the length of the array/string within the routine is not possible as the UTF-8 byte sequence can contain 0's (the validator was not 100% correct).
The zero character is an exeption. You won't find it often. But anyway you can handle it with my interface. That's why I didn't remove the _len parameter. You may use it if needed but you mustn't.

ZSTRING is working as an UBYTE array (exeption: no UBOUND function). It may have a zero character inside (try check_utf8(!"Das ist \&h0 a zero", 16) -- it's seen as an array of two ZSTRINGS, the \0 character is the terminator between them).


Anyway, IMO instead of an array of numbers using a STRING inside the source code is a major advantage!

Example ("hello" in the Armenian language)
  • "բաիեւ"
if you're familar with these characters it's easier to handle as
  • {&hd5,&ha2,&hd5,&ha1,&hd5,&hab,&hd5,&ha5,&hd6,&h82}
kiyotewolf wrote:What FB datatype should be used as a container for a UTF-8 encoded byte stream?
zstring?
As explained above UBYTE is a pain. You cant use STRING functions like ie FORMAT, UCASE, INSTR, .... At least adding two UTF-8 streams into one needs lots of code. I recommend to use STRING or ZSTRING.

Never use WSTRING!!!

You're at the safe side using STRING (the above mentioned problem with the LEN function don't occur).

On the other hand ZSTRING is faster in most cases (and a bit more flexible and compatible with C libraries but more risky).
kiyotewolf
Posts: 1009
Joined: Oct 11, 2008 7:42
Location: ABQ, NM
Contact:

Post by kiyotewolf »

{&hd5,&ha2,&hd5,&ha1,&hd5,&hab,&hd5,&ha5,&hd6,&h82}
How do I get a string of hex num's into a variable.
I've got a special character, ※ , how would I get a list of values like as above, for that character, and how easy it to mix the UTF-8 with standard ASCII then save it to a file?
I just wanna be able to write some standard text with a few uncode characters.



~Kiyote!

[edit]

Looked again

dim tst2(0 to ...) as ubyte => {&hd5,&ha2,&hd5,&ha1,&hd5,&hab,&hd5,&ha5,&hd6,&h82}

Duh.
Anyways, how do I take that fancy ※ and add it here & there to a standard text file I may be writing?
Don't I have to use OPEN "..." for.. encoding.. something or other, to open a file willing to accept unicode chars?
TJF
Posts: 3809
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Post by TJF »

Once more: UBYTE is a pain!
kiyotewolf wrote:I've got a special character, ※ , how would I get a list of values like as above, for that character, and how easy it to mix the UTF-8 with standard ASCII then save it to a file?
Use [Z]STRING variables instead (the

Code: Select all

-tag doesn't handle your character, so I use the [list]-tag here):
[list]? "Here is my ※ characters"

VAR a = "First ", b = " last", c = "※"
VAR _all = a & c & b
? _all
? INSTR(_all, c)
? LEN(c)[/list][/list][b]!!! use a professional IDE like Geany in UTF-8 mode !!!
!!! use a professional terminal in UTF-8 mode to execute this example !!![/b]


[quote="kiyotewolf"]Don't I have to use OPEN "..." for.. encoding.. something or other, to open a file willing to accept unicode chars?[/quote]
Use OPEN, GET #, INPUT #, PRINT #, PUT # ... like you did before! Nothing to change.

The only difference is: INSTR may not result in the exact character position, LEN may not result in the right number of characters. The position may be less than the INSTR result and the number of characters may be less than the LEN function returns.

The [b]check_utf8[/b] function from AGS can get extended to count the number of characters and to handle this issues. But I recomment to use the GLib versions [b]g_utf8_...[/b] (you'll find a well proven and complete set of functions here).

Anyway, the AGS function here is a good tutorial to see what's going on in these GLib functions and to learn about UTF-8 encoding.
Post Reply