What is the difference between a UTF-8 and a multibyte string? (I've read the Spolsky article thoroughly)counting_pine wrote:it expects UTF-8 strings rather than multibyte strings. That's why it only accepts zstrings.
No german umlauts with libcurl
Re: No german umlauts with libcurl
-
- Site Admin
- Posts: 6323
- Joined: Jul 05, 2005 17:32
- Location: Manchester, Lancs
Re: No german umlauts with libcurl
Thinking about it, "multibyte string" might not be the right description..
But anyway, UTF-8 uses an 8-bit encoding, which means a Unicode character can take up as little as one byte (8 bits), and as much as 4 bytes.
But WString-based encoding may use a 16 or 32-bit encoding, depending on the character size used - which is platform-dependent. A Unicode character will take up at least one wide character, whatever size it is.
(So basically a WString is what I meant when I said "multibyte string". It's not really a good name, because UTF-8 also allows for multi-byte characters in an 8-bit encoding.)
People who aren't too familiar with encoding issues and UTF-8, might be tempted to say that 8-bit strings such as FB's [Z]String, are not Unicode-compatible. In reality, nothing stops 8-bit strings containing Unicode, it just depends on the encoding used.
UTF-8 encoded strings can be concatenated, searched, and compared for equality OK, regardless of encoding, but other FB string functions come with several caveats, including these:
- Len() will not return the number of unicode characters, only the number of bytes
- Taking substrings or indexing will again work in bytes not characters, and may result in Unicode characters being split
- Case conversions (and case-insensitive comparisons) will not work - although the ASCII forms of UCASE/LCASE at least shouldn't corrupt characters, because a non-ASCII character will not contain any ASCII characters in its encoding.
But anyway, UTF-8 uses an 8-bit encoding, which means a Unicode character can take up as little as one byte (8 bits), and as much as 4 bytes.
But WString-based encoding may use a 16 or 32-bit encoding, depending on the character size used - which is platform-dependent. A Unicode character will take up at least one wide character, whatever size it is.
(So basically a WString is what I meant when I said "multibyte string". It's not really a good name, because UTF-8 also allows for multi-byte characters in an 8-bit encoding.)
People who aren't too familiar with encoding issues and UTF-8, might be tempted to say that 8-bit strings such as FB's [Z]String, are not Unicode-compatible. In reality, nothing stops 8-bit strings containing Unicode, it just depends on the encoding used.
UTF-8 encoded strings can be concatenated, searched, and compared for equality OK, regardless of encoding, but other FB string functions come with several caveats, including these:
- Len() will not return the number of unicode characters, only the number of bytes
- Taking substrings or indexing will again work in bytes not characters, and may result in Unicode characters being split
- Case conversions (and case-insensitive comparisons) will not work - although the ASCII forms of UCASE/LCASE at least shouldn't corrupt characters, because a non-ASCII character will not contain any ASCII characters in its encoding.
Re: No german umlauts with libcurl
The terminology may be confusing, and dealing with the Utf8-encoded Unicode is certainly not trivial, but it isn't rocket science, either. It just needs to be implemented in FB; this is assembler:
Output:
Btw MultiByteToWideChar is the standard function for doing what is recommended for Windows: "For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16" - and "multibyte" stands for Utf8 here. Even here is confusion: They write "Unicode such as Utf8", but in practice you can use only
- Utf16 for the xxxW functions
- Utf8 for printing to the console, provided the codepage is set to 65001 = Utf8
Code: Select all
Let my$="Добро Пожаловать" ; "Welcome" in Russian
PrintLine my$
PrintLine Lower$(my$)
PrintLine Upper$(my$)
Print Str$("The string 'Добро пожаловать' has %i chars", uLen('Добро пожаловать'))
Code: Select all
Добро Пожаловать
добро пожаловать
ДОБРО ПОЖАЛОВАТЬ
The string 'Добро пожаловать' has 16 chars
- Utf16 for the xxxW functions
- Utf8 for printing to the console, provided the codepage is set to 65001 = Utf8
-
- Posts: 862
- Joined: May 05, 2015 5:35
- Location: Germany
Re: No german umlauts with libcurl
Your assumption is wrong. :-) It's all the same to me what I have to pass to the function, as long as it works.counting_pine wrote:I assume you're not content with just passing percent-encoded ASCII URLs to the function?
The URL that I extracted from the (german) homepage of Wikipadia already is percent-encoded ("https://de.wikipedia.org/wiki/Spezial:Z ... lige_Seite"), but libcurl obviously can't handle it.
And alas, your snippet helps neither.
Re: No german umlauts with libcurl
I was about to ask "are you sure?" because it works fine with the Utf8 version of "ä", but you are right, under the hood you actually find this:grindstone wrote:The URL that I extracted from the (german) homepage of Wikipadia already is percent-encoded ("https://de.wikipedia.org/wiki/Spezial:Z ... lige_Seite"), but libcurl obviously can't handle it.
Code: Select all
<li id="n-randompage"><a href="/wiki/Spezial:Zuf%C3%A4llige_Seite" title="Zufällige Seite aufrufen [x]" accesskey="x">Zufälliger Artikel</a></li>
-
- Posts: 862
- Joined: May 05, 2015 5:35
- Location: Germany
Re: No german umlauts with libcurl
If I submit "ä" instead of "%C3%A4" I get a response saying "Spezialseite nicht vorhanden" (special site not present).
Last edited by grindstone on Jan 21, 2019 23:34, edited 1 time in total.
Re: No german umlauts with libcurl
It works with my version, see Download URL with UTF8 (no guarantee that the OS allows the download, though). The point is that "ä" can be Ascii 132 in the Western charset, or &HC3A4 as Utf8. I use the latter, i.e. two bytes=one Utf8 char, no percent encoding.grindstone wrote:If I submit "ä" instead of "%C3%A4" I get a response saying "Zufällige Seite nicht gefunden" (random site not found).
Re: No german umlauts with libcurl
(Note that since this April's Windows 10 update, Windows 10 has an option to set the character set to utf8. I haven't tested yet, but long term this could significantly increase Windows compatibility to *nix)
Re: No german umlauts with libcurl
Microsoft finally coming to their senses.marcov wrote:(Note that since this April's Windows 10 update, Windows 10 has an option to set the character set to utf8. I haven't tested yet, but long term this could significantly increase Windows compatibility to *nix)
Re: No german umlauts with libcurl
Almost unrelated to this thread, I've posted here an exercise in accessing websites with exotic URLs. Scroll down to see the little yellow window. To open the links, it uses ShellExecuteW - without any percent encoding, of course.
Re: No german umlauts with libcurl
IMHO the superiority of UTF8 as an API encoding is severely overrated.Munair wrote:Microsoft finally coming to their senses.marcov wrote:(Note that since this April's Windows 10 update, Windows 10 has an option to set the character set to utf8. I haven't tested yet, but long term this could significantly increase Windows compatibility to *nix)
And, not really. .NET, COM and everything is still two bytes. It is just that legacy Unix-like and console in general (*) finally gets some attention again. Non legacy apps are not using 1-byte character types :-)
(*) the console apis also had a big overhaul. Both because of WSL, the linux subsystem, but also because there is a console only Windows Server version again for use in enterprise containers.