No german umlauts with libcurl

External libraries (GTK, GSL, SDL, Allegro, OpenGL, etc) questions.
grindstone
Posts: 862
Joined: May 05, 2015 5:35
Location: Germany

No german umlauts with libcurl

Post by grindstone »

libcurl in general works fine here, but if I call a URL containing german umlauts (eg. https://de.wikipedia.org/wiki/Spezial:Z ... lige_Seite) the receive buffer remains empty (seems to be a general problem with UTF-8-coded URLs).

Does anyone know how to get it working? Or maybe another library without that issue?

Thanks in advance.

(WinXP 32bit / FB 1.05 here)
WQ1980
Posts: 48
Joined: Sep 25, 2015 12:04
Location: Russia

Re: No german umlauts with libcurl

Post by WQ1980 »

grindstone wrote:libcurl in general works fine here, but if I call a URL containing german umlauts (eg. https://de.wikipedia.org/wiki/Spezial:Z ... lige_Seite) the receive buffer remains empty (seems to be a general problem with UTF-8-coded URLs).

Does anyone know how to get it working? Or maybe another library without that issue?

Thanks in advance.

(WinXP 32bit / FB 1.05 here)
libcurl with OpenSSL ?
https supported ?
grindstone
Posts: 862
Joined: May 05, 2015 5:35
Location: Germany

Re: No german umlauts with libcurl

Post by grindstone »

libcurl natively supports https, that's the reason why I'm using it.

As I read there's the same problem with cyrillic characters, too.

I read about a hack that solves that issue, but then I had to compile my own version of libcurl.
WQ1980
Posts: 48
Joined: Sep 25, 2015 12:04
Location: Russia

Re: No german umlauts with libcurl

Post by WQ1980 »

In my libcurl examples url with https does not work
url with http work

Maybe
https://en.wikipedia.org/wiki/Punycode

converter url to punycode:
https://www.punycoder.com/
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: No german umlauts with libcurl

Post by jj2007 »

I've made a quick test with another library, and it works perfectly, both with Umlauts as shown below and with the escapes (->Source & exe):

Code: Select all

Inkey NoTag$(FileRead$("https://de.wikipedia.org/wiki/Spezial:Zufällige_Seite"))
So it really seems a libcurl problem. Btw clicking https://de.wikipedia.org/wiki/Spezial:Zufällige_Seite does not work with the SMF forum software at Masm32 ("Spezialseite nicht vorhanden"), but it works fine here with phpBB.
grindstone
Posts: 862
Joined: May 05, 2015 5:35
Location: Germany

Re: No german umlauts with libcurl

Post by grindstone »

What library did you use?

When I run your DownloadPageWithUmlauts.exe I only get a console window showing "#D-U"
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: No german umlauts with libcurl

Post by jj2007 »

#D-U means that InternetOpenUrlA failed (that's under the hood of FileRead$). It might be your firewall, for example. Here (Windows 7-64, Italian version) it works fine, my PC is not so well protected.

Now I checked on my Win10 notebook, here is what I get, directly from the zip archive:

Code: Select all

Pure minua (2006)
Alasin (2006)
Pirunkieli (2007)
Helvettiin Jäätynyt (2008)
Lihaa Vasten Lihaa (2008)
Ei Koskaan (2008) Musikvideos [ Bearbeiten | Quelltext bearbeiten ]
Kiroan (2002)
Epilogi (2002)
Tuonen viemää (2005)
Mies yli laidan (2006)
Alasin (2006)
Ei koskaan (2008) Weblinks [ Bearbeiten | Quelltext bearbeiten ]
Offizielle Website (finnisch)
Interview mit Frontmann Patrik Mennander
Bandvorstellung: Ruoska Einzelnachweise [ Bearbeiten | Quelltext bearbeiten ] ↑ a b c mindbreed.de RRZN, 1. September 2007 ↑ a b nordische-musik.de RRZN, 2004 ↑ RRZN ↑ RUOSKA in Finnish Charts finnishcharts.com; abgerufen 24. Oktober 2007 Abgerufen von „ https://de.wikipedia.org/w/index.php?title=Ruoska&oldid=179127834 “ Kategorien :
Metal-Band
Dark-Music-Musikgruppe
Finnische Band Navigationsmenü Meine Werkzeuge Nicht angemeldet Diskussionsseite Beiträge Benutzerkonto erstellen Anmelden Namensräume Artikel Diskussion Varianten Ansichten Lesen Bearbeiten Quelltext bearbeiten Versionsgeschichte Mehr Suche Navigation Hauptseite Themenportale Zufälliger Artikel Mitmachen Artikel verbessern Neuen Artikel anlegen Autorenportal Hilfe Letzte Änderungen Kontakt Spenden Werkzeuge Links auf diese Seite Änderungen an verlinkten Seiten Spezialseiten Permanenter Link Seiten­informationen Wikidata-Datenobjekt Artikel zitieren Drucken/­exportieren Buch erstellen Als PDF herunterladen Druckversion In anderen Projekten Commons In anderen Sprachen Български English Español Suomi Français Italiano Polski Português Русский Svenska Українська Links bearbeiten Diese Seite wurde zuletzt am 13. Juli 2018 um 16:28 Uhr bearbeitet. Abrufstatistik
Der Text ist unter der Lizenz „Creative Commons Attribution/Share Alike“ verfügbar; Informationen zu den Urhebern und zum Lizenzstatus eingebundener Mediendateien (etwa Bilder oder Videos) können im Regelfall durch Anklicken dieser abgerufen werden. Möglicherweise unterliegen die Inhalte jeweils zusätzlichen Bedingungen. Durch die Nutzung dieser Website erklären Sie sich mit den Nutzungsbedingungen und der Datenschutzrichtlinie einverstanden.
Wikipedia® ist eine eingetragene Marke der Wikimedia Foundation Inc. Datenschutz Über Wikipedia Impressum Entwickler Stellungnahme zu Cookies Mobile Ansicht
grindstone
Posts: 862
Joined: May 05, 2015 5:35
Location: Germany

Re: No german umlauts with libcurl

Post by grindstone »

Same result with firewall off.
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: No german umlauts with libcurl

Post by caseih »

Definitely a bug in curl. The command-line curl also returns zero bytes. wget, on the other hand, retrieves the document just fine.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: No german umlauts with libcurl

Post by jj2007 »

grindstone wrote:Same result with firewall off.
Sorry, no idea why it doesn't work for you. Anybody else? The archive is here.
counting_pine
Site Admin
Posts: 6323
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: No german umlauts with libcurl

Post by counting_pine »

I wouldn't expect the firewall to get involved, since without inspecting packets it can't see the URL requested, only the IP address of the host, and (based on the port) whether HTTP or HTTPS was used.

It occurs to me that we need to be careful when talking about this issue, because there could be a number of factors:
- whether the ä character or %C3%A4 is used in the string
- the encoding used/detected in the bas file
- whether the function is expecting a string or a wstring
- possibly also the OS used, and the charset or encoding in use
- and of course, potentially the version of curl and more significantly, whether it's the library or executable

I tried curl -v "http://de.wikipedia.org/wiki/Spezial:Zufällige_Seite" on Linux, and found that it would send the GET request in UTF-8. (I was slightly surprised to see that it would use percent-encoding only if that was passed, meaning it doesn't convert either way.)
I'm not sure what it would do on the Windows command line, where I don't think it uses UTF-8.

If wget works, that perhaps means it passes the URLs differently.
curl -d "http://de.wikipedia.org/wiki/Spezial:Zufällige_Seite" on Linux does seem to percent-encode the URL, so maybe that's why wget works.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: No german umlauts with libcurl

Post by jj2007 »

counting_pine wrote:- whether the ä character or %C3%A4 is used in the string
My version

    pBuffer=FileRead$("https://de.wikipedia.org/wiki/Spezial:Zufällige_Seite")

definitely uses (tested successfully with Win7-64 and Win10) the ä character in its UTF8 encoding.

Under the hood, it's InternetOpenUrlA. There is also InternetOpenUrlW for UTF16 but it's broken (Microsoft Social):
Do not use InternetOpenUrlW() with lpszHeaders set

You should not use the Unicode version of this function if you want to send additional headers.

If dwHeadersLength excludes the terminating null, the function crashes with error ERROR_HTTP_HEADER_NOT_FOUND.

If it does include the terminating null character, this null character is also sent to the server. This works on Apache servers, but various servers respond with HTTP 400 Bad Request errors.

You can solve the problem by calling WideCharToMultiByte() to convert the URL to ANSI and then calling InternetOpenUrlA(), which accepts dwHeadersLength not including the terminating null.
grindstone
Posts: 862
Joined: May 05, 2015 5:35
Location: Germany

Re: No german umlauts with libcurl

Post by grindstone »

caseih wrote:wget, on the other hand, retrieves the document just fine.
I can confirm that, thank you for the hint to wget. Is there a way to use wget as a library?
grindstone
Posts: 862
Joined: May 05, 2015 5:35
Location: Germany

Re: No german umlauts with libcurl

Post by grindstone »

As far as I can see, wget uses the libraries libeay32.dll and libssl32.dll. Does anyone know something about a description or header files (.h / .bi)?
counting_pine
Site Admin
Posts: 6323
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: No german umlauts with libcurl

Post by counting_pine »

(I assume you're not content with just passing percent-encoded ASCII URLs to the function?)

I think the only problem with libcurl is that it expects UTF-8 strings rather than multibyte strings. That's why it only accepts zstrings.

You should be able to convert a WString to a UTF-8 String using this function:

Code: Select all

#include once "utf_conv.bi"
function wstr_to_utf8(byref w as wstring) as string
	dim as byte ptr utfstr
	dim as integer bytes

	utfstr = WCharToUTF( UTF_ENCOD_UTF8, w, len( w ) + 1, 0, @bytes )
	function = *cptr(zstring ptr, utfstr)

	deallocate( utfstr )
end function
(I cobbled the above together quickly from https://github.com/freebasic/fbc/blob/m ... nv.bas#L23)

Also, before going further, just to make sure, everyone who hasn't read this Joel Spolsky article should do so:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Post Reply