Faster file loading

mrminecrafttnt · Post by **mrminecrafttnt** » Dec 17, 2017 9:10

dim as string Filename
dim as integer filenr = freefile
locate 2,2
Input "File to Load ";Filename
dim as integer errorcode = open (Filename for binary access read as #filenr)
    if errorcode = 0 then
        print "Loading.."
        'create buffer
        dim as ubyte buffer(lof(filenr))
        'load file
        get #filenr,,buffer()
        'close file
        close #1
    
        'output
        for i as integer = 0 to ubound(buffer)-1
            print chr(buffer(i));
        next
    
        
else
    locate ,2
    print "File Opening failed with errorcode :";errorcode
    select case errorcode
    case 2
        locate ,2
        print "File not found"
    case else
        locate ,2
        print "Unexepted error."
        end select
    sleep
    end
end if

MrSwiss · Post by **MrSwiss** » Dec 17, 2017 13:12

???
"faster" than what? (comparison missing)
Description of: pro's/con's also missing.

bcohio2001 · Post by **bcohio2001** » Dec 17, 2017 17:27

Sort of looks like the examples in the .chm file

Code: Select all

Dim Shared f As Integer

Sub get_integer()

    Dim buffer As Integer ' Integer variable

    ' Read an Integer (4 bytes) from the file into buffer, using file number "f".
    Get #f, , buffer

    ' print out result
    Print buffer
    Print

End Sub

Sub get_array()

    Dim an_array(0 To 10-1) As Integer ' array of Integers

    ' Read 10 Integers (10 * 4 = 40 bytes) from the file into an_array, using file number "f".
    Get #f, , an_array()

    ' print out result
    For i As Integer = 0 To 10-1
        Print an_array(i)
    Next
    Print

End Sub

Sub get_mem

    Dim pmem As Integer Ptr

    ' allocate memory for 5 Integers
    pmem = Allocate(5 * SizeOf(Integer))

    ' Read 5 integers (5 * 4 = 20 bytes) from the file into allocated memory
    Get #f, , *pmem, 5 ' Note pmem must be dereferenced (*pmem, or pmem[0])

    ' print out result using [] Pointer Indexing
    For i As Integer = 0 To 5-1
        Print pmem[i]
    Next
    Print

    ' free pointer memory to prevent memory leak
    Deallocate pmem

End Sub

' Find the first free file file number.
f = FreeFile

' Open the file "file.ext" for binary usage, using the file number "f".
Open "file.ext" For Binary As #f

  get_integer()

  get_array()

  get_mem()

' Close the file.  
Close #f




' Load a small text file to a string

Function LoadFile(ByRef filename As String) As String
    
    Dim h As Integer
    Dim txt As String
    
    h = FreeFile
    
    If Open( filename For Binary Access Read As #h ) <> 0 Then Return ""
    
    If LOF(h) > 0 Then
        
        txt = String(LOF(h), 0)
        If Get( #h, ,txt ) <> 0 Then txt = ""
        
    End If
    
    Close #h
    
    Return txt
    
End Function

Dim ExampleStr As String
ExampleStr = LoadFile("smallfile.txt")
Print ExampleStr

There are various ways to load a file. And depends on how you need to use the data in the file.

St_W · Post by **St_W** » Dec 17, 2017 18:25

I guess faster than reading small pieces of data from a file.
I can also recommend to read small files into a buffer as a whole or used a buffered reader for larger file that reads bigger blocks from files instead of just small junks. The processing can then be done in memory.

BasicScience · Post by **BasicScience** » Dec 23, 2017 17:21

I've used BLoad and BSave to move large chunks of data between disk and memory. It would be interesting to perform a quantitative test of the two methods.

deltarho[1859] · Post by **deltarho[1859]** » Dec 23, 2017 23:12

Some years ago I wrote some code to either encrypt or hash large files and with authenticated encryption would encrypt and hash. Originally, I used to use PowerBASIC's OPEN command. I then learned about the filecache.

When we load our browser, for example, for the first time after a boot or Restart a copy is made in the filecache, which is in RAM. On closing and opening again our browser comes in via the filecache. There is a little more to it than that but that is the basic idea. We can see this when encrypting or hashing a large file and then repeat the operation. Second and subsequent operations are completed in very much less time.

I then learned that we could turn the cacheing off. When encrypting or hashing a large file I did not need a copy kept in the filecache. I could not turn off caheing via OPEN but I could if I used the Windows API CreateFile in conjunction with a flag called FILE_FLAG_NO_BUFFERING with a buffer on a 256 byte boundary. Large files then came in like greased lightning and any files in the filecache did not get 'kicked out' if the fiecache could not be made larger. For very large files, too large to fit into RAM, I had to use a buffer.

The next innovation was read ahead. If I used a 256KB buffer, say, on a 500MB target file the system would present me with 256KB and would then start reading ahead. When I wanted the next buffer it was already in RAM. I would grab that and then the read head would 'kick in' again.

Of course, we defeat the read ahead if we requested a a full 500MB read. So, the secret is to use buffering whether we have bucket loads of RAM or not.

I also found that, and confirmed by a few members at PowerBASIC, that the 'sweet spot' for a buffer was 256KB.

For large files the opening post is the slowest way to read them. Reading a 500MB file, for example, via a 256KB buffer would leave the opening post standing.

Just as a point of interest, with authenticated encryption where the hash is computed on the encrypted data then encryption and hashing had to be done synchronously. With decryption the decryption and hashing could be done asynchronously so the hashing was done in a separate thread of execution. Decryption was faster than hashing so I had to wait for the hashing to complete before loading the buffer again for decryption. In effect then the time to complete a job was down to the hashing - the encryption was free. I was using SHA256. My next step, which I did not get around to trying, would be to use BLAKE2 for the hashing. I was thinking that I might have to wait for the decryption to complete. <smile>

It is a big subject, file handling, and, like may things in coding, it requires of us to do a fair amount of reading and that reading does not stop as operating systems improve.

deltarho[1859] · Post by **deltarho[1859]** » Dec 24, 2017 1:42

Here is a comparison with the opening post and buffering on Windows.

The first thing to do is to create a large file.

Code: Select all

Dim As Long i, f
Dim as String*262144 sData
 
f = Freefile
Open "500MB.dat" for Binary as #f
  For  i = 1 to 2000
    Put #f, , sData
  Next
Close #1
 
Print "Done"
 
sleep

We now have a 500MB file.

The following uses buffering and then the filecache is cleared of "500MB.dat" otherwise when we come to read "500MB.dat" again it will 'fly in'. You will have to take my word for it that ClearFileCache3 does what it says. ClearFileCache was written by me about 7 seven years ago, swiftly followed by ClearFileCache2 - a more robust version. Many folk at PowerBASIC used it when developing code as it saved them restarting their machine to clear the filecache. Comparing hash functions, for example, requires 'flushing' the filecache between functions. A couple of years ago a guy, Paul Purvis, played around with my code and found that the file did not have to be read, it only needed to be opened and then closed again. The operative flag is, of course, FILE_FLAG_NO_BUFFERING. I am not even sure that Microsoft are aware of this trick.

OK, testing time.

Buffering with a HDD, on my Win10 machine, comes in at 0.105s. The opening post method comes in at 4.063s. That is pushing 40 times faster.

If we comment ClearFileCache3( "500MB.dat" ) the opening post comes in at 0.197s. It is still slower than buffering and it is not getting data from the hard drive.

Needless to say, it we are only reading small files then the opening post is fine. With a 512KB file we get 0.000365s and 0.001938s giving buffering a 5 times edge. With the opening post coming in at 2ms do we still need to be clever? Probably not. <smille>

Code: Select all

#include once "windows.bi"
 
Dim buffer1 As String * 262144
Dim As Long f
Dim As Double t
 
Declare Function ClearFileCache3( As String ) As Long
 
f = Freefile
t = Timer
 
' Buffering method
Open "500mb.dat" For Binary As #f
  Do
    Get #f, ,buffer1
    'Do some stuff
  Loop Until Eof(f)
Close #f
t = Timer - t
Print t
 
ClearFileCache3( "500MB.dat" )
 
f = Freefile
t = Timer
 
' Opening post method
Open "500MB.dat" For Binary As #f
  Dim As Ubyte buffer2(Lof(f))
  Get #f, ,buffer2()
  ' Do some stuff
  Close #f
t = Timer - t
Print t
 
Sleep
 
Function ClearFileCache3( sFile As String ) As Long
  Dim As Long lRes
  Dim As HANDLE hSys
  Dim As Zstring*256 zFile
 
  zFile = sFile
  ' Open file without buffering - ie bypass the filecache
  hSys = CreateFile( zFile, GENERIC_READ, 0, Byval 0, OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN Or FILE_FLAG_NO_BUFFERING, 0 )
  If hSys < 0 Then
    lRes = GetLastError
    If lRes = ERROR_FILE_NOT_FOUND Then
      Print "Unable to find" + " " + sFile
    Else
      Print Str$( lRes ) + " " + "Unknown error on opening file"
    End If
    Function = True
    Exit Function
  End If
 
  CloseHandle( hSys )
 
End Function

jj2007 · Post by **jj2007** » Dec 24, 2017 3:02

deltarho[1859] wrote:Here is a comparison with the opening post and buffering on Windows.

Really nice, thanks for sharing this! I allowed myself to quote you here:

Just learned a new trick from deltarho[1859] - clear the file cache for better testing

Results on my Win7-64 machine:

Code: Select all

with file caching, Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz:
26902 strings, 870 µs   977412 bytes, 654 µs
26902 strings, 840 µs   977412 bytes, 565 µs
26902 strings, 793 µs   977412 bytes, 547 µs
26902 strings, 808 µs   977412 bytes, 558 µs

no caching:
26902 strings, 5289 µs  977412 bytes, 4970 µs
26902 strings, 5223 µs  977412 bytes, 4981 µs
26902 strings, 5196 µs  977412 bytes, 5107 µs
26902 strings, 5274 µs  977412 bytes, 5000 µs

deltarho[1859] · Post by **deltarho[1859]** » Dec 24, 2017 7:45

Hi jj2007

Unfortunately you missed out the very first run of the file caching sequence.

With the 'Results for a 40MB file:' table at your MASM link the second and subsequent runs of the file caching sequence are clearly faster than the first run. With the runs of the no caching sequence we see the timings are of the same magnitude as the first run of the file caching sequence.

To coin a phrase, you threw the baby out with the bath water. <smile>

Off topic, but the filecache is one of the reasons I only ever put my machine to sleep - unless I have to Restart for some reason. I prefer to give my SSD an easy life by getting applications from RAM on second and subsequent openings. Giving my SSD an easy life is why I use a RAM disk for my internet cache. Yes, it is slightly faster but the main reason is that I don't want my SSD being bombarded with hundreds of little files. All my development work is on an internal HDD. I hadn't thought about it before but my SSD probably isn't used that much. <laugh>

BTW, Hutch knows me as David Roberts, but you didn't know that.

jj2007 · Post by **jj2007** » Dec 24, 2017 10:34

deltarho[1859] wrote:Unfortunately you missed out the very first run of the file caching sequence.

Hi David,
Right, since I used a Masm32 include file, even the first run uses the cache (since the file is used in the building). Here are results for a 800MB file that I haven't touched for months, and as you rightly claim, it clearly shows that the cache is not yet loaded in the first (and second!) run:

Code: Select all

with file caching, Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz:
6076400 strings, 12 s   815553800 bytes, 17 s
6076400 strings, 646 ms 815553800 bytes, 575 ms
6076400 strings, 602 ms 815553800 bytes, 451 ms
6076400 strings, 602 ms 815553800 bytes, 456 ms
6076400 strings, 601 ms 815553800 bytes, 453 ms
6076400 strings, 610 ms 815553800 bytes, 460 ms
6076400 strings, 608 ms 815553800 bytes, 453 ms
6076400 strings, 605 ms 815553800 bytes, 457 ms
6076400 strings, 610 ms 815553800 bytes, 457 ms
6076400 strings, 607 ms 815553800 bytes, 453 ms

no caching:
6076400 strings, 12 s   815553800 bytes, 11 s
6076400 strings, 11 s   815553800 bytes, 11 s
6076400 strings, 11 s   815553800 bytes, 11 s
6076400 strings, 11 s   815553800 bytes, 11 s
6076400 strings, 11 s   815553800 bytes, 11 s
6076400 strings, 11 s   815553800 bytes, 11 s
6076400 strings, 11 s   815553800 bytes, 11 s
6076400 strings, 11 s   815553800 bytes, 11 s
6076400 strings, 11 s   815553800 bytes, 11 s
6076400 strings, 11 s   815553800 bytes, 11 s

Btw the difference between the left and right column are the processing time for turning a flat buffer into an array of strings with address and len for each of the 6076400 text lines. Small compared to a "fresh" disk I/O but significant if caching is active.

deltarho[1859] · Post by **deltarho[1859]** » Dec 24, 2017 16:10

@jj2007

I see.

deltarho[1859] · Post by **deltarho[1859]** » Dec 24, 2017 21:29

Just out of interest I ran the 500MB test on my internal SSD and got 0.107s and 1.262s.

You may question why is the first figure pretty much the same as the HDD figure. Well, that is because of the read ahead. The SSD shines with the second figure giving a 3.2 x HDD speed. My SSD is a bit long in the tooth now - modern SSDs will give a better multiplier. The buffering method is still faster than the opening post at nearly 12 times faster.

On the subject of SSDs I have an external SSD exclusively used for system backups. At 2100, everyday, my SSD C: drive is backed up. I run a lean system, only 34GB, and the backup is done in less than four minutes. Three times in the last twelve months my Win10 system has gone 'down the plug hole'. I was back at the desktop using the previous evening's backup in about 13 minutes. I live dangerously and one of the failures was of my own doing. As for the other two, I am putting them squarely at Windows 10 doorstep. <smile>

deltarho[1859] · Post by **deltarho[1859]** » Dec 25, 2017 16:29

I have made a monumental mistake.

The reason I wrote the original ClearFileCache was in deciding which hash function to use in a particular application. If I timed a run using MD5, for example, I had to do a system Restart for timing a run using SHA256, for example, otherwise the SHA256 run would be faster because it was getting its data from the filecache; resulting in an erroneous conclusion. After a few days of testing I figured that there must be a better way of doing this and ClearFileCache was born. I finally opted to us SHA-1 for the application in question.

I had forgotten, it is a while since thoses days, that when we create a file it is also put into the filecache, not just on reading for the first time.

With the tests above I created the 500MB.dat first and then tested the two methods. What I should have done was to clear the filecache before the first test. If the file already existed and had not been read in the Wndows session that the test took place then it would not have been in the filecache. In the latter case, if we repeated the test then the file would be in the filecache after the second test.

By clearing the filecache before the first test there is now little difference between buffering and the opening post. Buffering is still faster but not to the extent that I wrote. Interestingly, if both methods got the file from the filecache then the buffering method is twice as fast. I think read ahead is the reason for that. I must qualify this by saying 'on Windows 10'. The original ClearFileCache was written on Windows XP and I cannot say that nothing has changed since then. ClearFileCache is still OK but file handling in general may not be the same.

There is another aspect to getting files from the filecache which some folks may not have considered. The filecache is in RAM so whether the file is fragmented or not is neither here nor there. This is, of course, another reason why SSDs are fast - it makes no difference whether a file is fragmented or not, making defragging a SSD a pointless exercise.

The lesson to be learnt here is that even when we think that we know what we are talking about is no guarantee that we will get things right, it just means that we are less likely to get things wrong. <Ha, ha>

Faster file loading

Faster file loading

Re: Faster file loading

Re: Faster file loading

Re: Faster file loading

Re: Faster file loading

Re: Faster file loading

Re: Faster file loading

Re: Faster file loading

Re: Faster file loading

Re: Faster file loading

Re: Faster file loading

Re: Faster file loading

Re: Faster file loading