Unicode and files

New to FreeBASIC? Post your questions here.
Post Reply
Juergen Kuehlwein
Posts: 284
Joined: Mar 07, 2018 13:59
Location: Germany

Unicode and files

Post by Juergen Kuehlwein »

Reading the docs, it seems i cannot have unicode file names, but i can read and write unicode data. And this is independent of the file encoding, i can read and write WSTRINGs from and to an ASCII encoded file, so basically unicode strings work with files, but real unicode file names fail. If you change "s" in my previous code example from string to wstring, it will work with the same file name, but with Russian characters, it fails (so there must be some conversion from wstring to string under the hood for the "open" statement).

Code: Select all

dim s as string
dim w as wstring * 16

  s = "d:\asdf.txt"
  w = "asdf"


  open s for binary as #1
    put #1,, w
  close #1 

  w = ""

  open s for binary as #1
    get #1,, w
  close #1 

  s = ""
  s = w
 
  print s
  print len(s)
  sleep
  
I ran my test in Windows, so for a unicode file name i will have to resort to "_wfopen" from the C runtime or make use of the windows API (createfile ...), i think.

How about Linux? I remember having read somewhere that in Linux _wfopen isn´t available, but "fopen" would accept UTF-8 encoded file names instead. Is this true?


JK
dodicat
Posts: 7976
Joined: Jan 10, 2006 20:30
Location: Scotland

Re: Unicode and files

Post by dodicat »

The problem is the open keyword I believe.
ascii only.
As you say the crt can handle unicode filenames and unicode content.
But I have not tested this code on Linux, perhaps Linux fopen can do the job.
I shall try later.
For this I must use Poseidon ide or direct load to wordpad, all others ides fail with unicode, (or the unicode setup is difficult)

Code: Select all

 

#include "crt.bi"

'extern "windows"
'declare function CopyFileW(byval lpExistingFileName as wstring ptr, byval lpNewFileName as wstring ptr, byval bFailIfExists as BOOLean) as BOOLean
'end extern

const size=5000

sub save(filename as wstring,content as wstring)
    dim as wstring * 20 w="wt+,ccs=utf-8"
   var fp = _wfopen(@filename,@w) 
   if fp then fputws(@content,fp) else print "could not save "+filename
   fclose(fp)
end sub


sub load(filename as wstring,content as wstring)
     dim as wstring * 20 r="rt+,ccs=utf-8"
     dim as wstring * size t
   dim as file ptr fp = _wfopen(@filename,@r)
   if fp=0 then print "Unable to load ";filename:sleep:exit sub
   while 1
    if (fgetws (@t,size,fp)= 0)then exit while
    content+=t
wend
fclose(fp)
end sub

dim as wstring * 50 w= wstr("СергeйСергeй.txt")

dim as wstring * 100 message=wstr("hello from СергeйСергeй.txt")

save(w,message)


'dim as boolean b =false 'creates a new file if false
'print "copied "; copyfileW(w,"a newfile.txt",b)
'print "Folder contents:"
'print
shell "dir /b"
print
dim as wstring * 1000 content

load(wstr("СергeйСергeй.txt"),content)
print content
sleep


  
St_W
Posts: 1619
Joined: Feb 11, 2009 14:24
Location: Austria
Contact:

Re: Unicode and files

Post by St_W »

Yes; Linux does not provide separate APIs for Unicode, like Windows does for UTF-16, but supports UTF-8 in existing APIs instead.
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Unicode and files

Post by marcov »

Afaik linux has no encoding support at the kernel level, and stores all filenames as byte sequences.

IOW a word with German or French accents stored as ISO 8859-x is not the same as one stored as utf8.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Unicode and files

Post by jj2007 »

Under Windows, the trick is to convert, under the hood of the "Open" keyword, the Utf8 string to Utf16 (both encodings are "Unicode"). This is Assembly, and it works as expected:

Code: Select all

  Open "O", #1, "Временный файл.txt"
  Print #1, "Добро пожаловать, مرحبا بكم, 歡迎"
I don't think it would be excessively difficult to implement in FB. All you need is MultiByteToWideChar().
Munair
Posts: 1286
Joined: Oct 19, 2017 15:00
Location: Netherlands
Contact:

Re: Unicode and files

Post by Munair »

On most Unix-like systems there is the convention that filenames are interpreted as UTF-8, even though file system drivers just handle the byte sequence; it doesn't matter to them what the bytes mean. The only two special bytes are the slash and the null-character. There are applications that interpret filenames as characters, like FTP clients. Usually, when the LC_CTYPE environment is set to a UTF8 locale, these applications will handle the filenames correctly.
Juergen Kuehlwein
Posts: 284
Joined: Mar 07, 2018 13:59
Location: Germany

Re: Unicode and files

Post by Juergen Kuehlwein »

Thanks for the info!

JK
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Unicode and files

Post by jj2007 »

I am trying to teach Unicode to FreeBasic. This works so far:

Code: Select all

#Include "windows.bi"
#Include "LoadSaveFile.bi"	' works fine
' #Include "لغة البرمجة الأساسية مجانية.bi"	' no luck

Dim As Wstring * 100 ip = "لغة البرمجة الأساسية مجانية.bi"	' works fine
Dim As String Strg=LoadFile(ip)
Dim As Integer SavedBytes=SaveFile("Temp.asc", Strg) 
If SavedBytes>0 Then
	Print "OK, ";SavedBytes;" bytes saved, here is the file content:"; Chr(13, 10, 10); Strg
else
	Print "SaveFile failed"
endif
Sleep
The necessary files are here (no executables, just plain FB, but one file has an Arabic name).

Line 3 doesn't work. Gas and Gcc throw error 23: File not found, "لغة البرمجة الأساسية مجانية.bi". So apparently, the compilers can't handle UTF8 names for includes, but strings that are part of the code are fine.
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Unicode and files

Post by marcov »

On Linux this might work, but this might be hard on windows. On Windows, the system is UTF16, the fact that more and more apps support UTF-8 as a document encoding there confuses many people, and most things (including the commandline) work with the default encoding (usually windows-125x).

There are two remedies for this:
(1) the classic way to update everything using filenames to use UTF16 api functions and use utf8 or something else internally and convert accordingly. Fat chance that the binutils have not been fixed for this (half of them barf on spaces in pathnames after 25 years of long filenames), so this is not as easy as it sounds.

(2) since the 2018 april update, Windows 10 has an option to enable utf8 as default encoding. This might help, but I haven't heard from anybody that experimented with it (no news, not for good, not for worse)
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Unicode and files

Post by jj2007 »

It's not a Windows problem. The build itself works just fine, and the compiler(s) can handle the Arabic filename ip = "لغة البرمجة الأساسية مجانية.bi" used for LoadFile. It is just the #include that doesn't work, which is an internal issue of the compiler(s).
dodicat
Posts: 7976
Joined: Jan 10, 2006 20:30
Location: Scotland

Re: Unicode and files

Post by dodicat »

I have had a try jj2007.
I usually manage fudges, but this time no luck.
I tried to change the codepage
shell("chcp 65001") In a sub constructor to try and activate it first, and also in the first include (loadsavefile.bi).

I also tried unicode in a pascal unit
لغة البرمجة الأساسية مجانية.pas
But the Lazarus ide and mseide don't compile it although the pascal innards are sound, those particular ides don't seem to like unicode text anyway.

(Using the poseidon ide in win 10 here for fb)
Also I gave up searching for include in the fb source code for obvious reasons.
Anyway, I'll try again later.
Post Reply