No german umlauts with libcurl

External libraries (GTK, GSL, SDL, Allegro, OpenGL, etc) questions.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: No german umlauts with libcurl

Post by jj2007 »

counting_pine wrote:it expects UTF-8 strings rather than multibyte strings. That's why it only accepts zstrings.
What is the difference between a UTF-8 and a multibyte string? (I've read the Spolsky article thoroughly)
counting_pine
Site Admin
Posts: 6323
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: No german umlauts with libcurl

Post by counting_pine »

Thinking about it, "multibyte string" might not be the right description..

But anyway, UTF-8 uses an 8-bit encoding, which means a Unicode character can take up as little as one byte (8 bits), and as much as 4 bytes.
But WString-based encoding may use a 16 or 32-bit encoding, depending on the character size used - which is platform-dependent. A Unicode character will take up at least one wide character, whatever size it is.
(So basically a WString is what I meant when I said "multibyte string". It's not really a good name, because UTF-8 also allows for multi-byte characters in an 8-bit encoding.)

People who aren't too familiar with encoding issues and UTF-8, might be tempted to say that 8-bit strings such as FB's [Z]String, are not Unicode-compatible. In reality, nothing stops 8-bit strings containing Unicode, it just depends on the encoding used.

UTF-8 encoded strings can be concatenated, searched, and compared for equality OK, regardless of encoding, but other FB string functions come with several caveats, including these:
- Len() will not return the number of unicode characters, only the number of bytes
- Taking substrings or indexing will again work in bytes not characters, and may result in Unicode characters being split
- Case conversions (and case-insensitive comparisons) will not work - although the ASCII forms of UCASE/LCASE at least shouldn't corrupt characters, because a non-ASCII character will not contain any ASCII characters in its encoding.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: No german umlauts with libcurl

Post by jj2007 »

The terminology may be confusing, and dealing with the Utf8-encoded Unicode is certainly not trivial, but it isn't rocket science, either. It just needs to be implemented in FB; this is assembler:

Code: Select all

  Let my$="Добро Пожаловать"	; "Welcome" in Russian
  PrintLine my$
  PrintLine Lower$(my$)
  PrintLine Upper$(my$)
  Print Str$("The string 'Добро пожаловать' has %i chars", uLen('Добро пожаловать'))
Output:

Code: Select all

Добро Пожаловать
добро пожаловать
ДОБРО ПОЖАЛОВАТЬ
The string 'Добро пожаловать' has 16 chars
Btw MultiByteToWideChar is the standard function for doing what is recommended for Windows: "For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16" - and "multibyte" stands for Utf8 here. Even here is confusion: They write "Unicode such as Utf8", but in practice you can use only
- Utf16 for the xxxW functions
- Utf8 for printing to the console, provided the codepage is set to 65001 = Utf8
grindstone
Posts: 862
Joined: May 05, 2015 5:35
Location: Germany

Re: No german umlauts with libcurl

Post by grindstone »

counting_pine wrote:I assume you're not content with just passing percent-encoded ASCII URLs to the function?
Your assumption is wrong. :-) It's all the same to me what I have to pass to the function, as long as it works.

The URL that I extracted from the (german) homepage of Wikipadia already is percent-encoded ("https://de.wikipedia.org/wiki/Spezial:Z ... lige_Seite"), but libcurl obviously can't handle it.

And alas, your snippet helps neither.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: No german umlauts with libcurl

Post by jj2007 »

grindstone wrote:The URL that I extracted from the (german) homepage of Wikipadia already is percent-encoded ("https://de.wikipedia.org/wiki/Spezial:Z ... lige_Seite"), but libcurl obviously can't handle it.
I was about to ask "are you sure?" because it works fine with the Utf8 version of "ä", but you are right, under the hood you actually find this:

Code: Select all

<li id="n-randompage"><a href="/wiki/Spezial:Zuf%C3%A4llige_Seite" title="Zufällige Seite aufrufen [x]" accesskey="x">Zufälliger Artikel</a></li>
grindstone
Posts: 862
Joined: May 05, 2015 5:35
Location: Germany

Re: No german umlauts with libcurl

Post by grindstone »

If I submit "ä" instead of "%C3%A4" I get a response saying "Spezialseite nicht vorhanden" (special site not present).
Last edited by grindstone on Jan 21, 2019 23:34, edited 1 time in total.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: No german umlauts with libcurl

Post by jj2007 »

grindstone wrote:If I submit "ä" instead of "%C3%A4" I get a response saying "Zufällige Seite nicht gefunden" (random site not found).
It works with my version, see Download URL with UTF8 (no guarantee that the OS allows the download, though). The point is that "ä" can be Ascii 132 in the Western charset, or &HC3A4 as Utf8. I use the latter, i.e. two bytes=one Utf8 char, no percent encoding.
marcov
Posts: 3454
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: No german umlauts with libcurl

Post by marcov »

(Note that since this April's Windows 10 update, Windows 10 has an option to set the character set to utf8. I haven't tested yet, but long term this could significantly increase Windows compatibility to *nix)
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: No german umlauts with libcurl

Post by Munair »

marcov wrote:(Note that since this April's Windows 10 update, Windows 10 has an option to set the character set to utf8. I haven't tested yet, but long term this could significantly increase Windows compatibility to *nix)
Microsoft finally coming to their senses.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: No german umlauts with libcurl

Post by jj2007 »

Almost unrelated to this thread, I've posted here an exercise in accessing websites with exotic URLs. Scroll down to see the little yellow window. To open the links, it uses ShellExecuteW - without any percent encoding, of course.
marcov
Posts: 3454
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: No german umlauts with libcurl

Post by marcov »

Munair wrote:
marcov wrote:(Note that since this April's Windows 10 update, Windows 10 has an option to set the character set to utf8. I haven't tested yet, but long term this could significantly increase Windows compatibility to *nix)
Microsoft finally coming to their senses.
IMHO the superiority of UTF8 as an API encoding is severely overrated.

And, not really. .NET, COM and everything is still two bytes. It is just that legacy Unix-like and console in general (*) finally gets some attention again. Non legacy apps are not using 1-byte character types :-)

(*) the console apis also had a big overhaul. Both because of WSL, the linux subsystem, but also because there is a console only Windows Server version again for use in enterprise containers.
Post Reply