Need help with ASM

New to FreeBASIC? Post your questions here.
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Need help with ASM

Post by fzabkar »

Thanks.

On my old machine the improvement is from 7 nanosecs per OR to 5ns and then to 1.3ns.

Code: Select all

 99999999 elements
asm:  638 ms for or'ing 99999999 elements
asm:  521 ms for or'ing 99999999 elements
asm:  442 ms for or'ing 99999999 elements
asm:  518 ms for or'ing 99999999 elements
asm:  466 ms for or'ing 99999999 elements
 4095
 15791
 8187
 11977
 12247
 14803
 9085
 6078
 6143
 10239

Code: Select all

 99999999 elements
Gcc:   702 ms for or'ing 99999999 elements
Gcc:   677 ms for or'ing 99999999 elements
Gcc:   714 ms for or'ing 99999999 elements
Gcc:   706 ms for or'ing 99999999 elements
Gcc:   635 ms for or'ing 99999999 elements
parallel processing with SIMD or AVX

Code: Select all

 99999999 elements
asm:  131 ms for or'ing 99999999 elements
asm:  149 ms for or'ing 99999999 elements
asm:  136 ms for or'ing 99999999 elements
asm:  127 ms for or'ing 99999999 elements
asm:  124 ms for or'ing 99999999 elements
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Need help with ASM

Post by fzabkar »

Thanks for all the input.

I am well aware that the application is I/O bound. I have already optimised the disk I/O as best as I can by reading as much of each file into RAM as possible. My program checks the free space (FRE) and tries to use up to 90% of it, if the file size warrants it. (Are there any caveats to this approach?)

As I see it, there are two components to disc reading. One is the actual data transfer rate which is independent of the chunk size (how does caching affect this?), and the other is the access time, ie the time taken to arrive at the target sector. The access time consists of seek time (the time to arrive at the target track) plus rotational latency (the time for the target sector to pass under the head). This access time could be 10msec on average.

For example, if we assume that we have a contiguous 1GB file, and that it fits into RAM, then the access time penalty would be only 10ms. OTOH, if we read the file in 1MB chunks, then the total access time overhead could be 10 seconds, depending on disc caching and file fragmentation.

The reason I am investigating ASM code is that I am looking for additional optimisations.
marcov
Posts: 3462
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Need help with ASM

Post by marcov »

fzabkar wrote:
As I see it, there are two components to disc reading. One is the actual data transfer rate which is independent of the chunk size (how does caching affect this?), and the other is the access time, ie the time taken to arrive at the target sector. The access time consists of seek time (the time to arrive at the target track) plus rotational latency (the time for the target sector to pass under the head). This access time could be 10msec on average.
Yeah but if you read everything sequentially, the drive firmware will read whole cylinders into the drive's RAM, and you read from that. Also SATA can have multiple requests blocks outstanding, allowing the firmware to reorder requests up to typically 64 or 128 commands depth (a simplified form of SCSI tagged command queueing). 64 * 4k advanced format sector size might be your 256kb sweetspot.

And of course linearly reading all sectors is the most predictable pattern of all. This is why sequentially read rate is so much faster than random 4k.

Still with SSE/AVX you can get near the memory read/write speed which is around 30-50GB/s (divide by 3 for the 3 accesses of A=B+C). Even your old Core2 can probably get well above 1GB/s with only one core. That leaves the other core to handle I/O tasks, and there is no way a HDD can match that unless it is a very wide array of very expensive disks. (e.g. 3 15000RPM discs with an expensive controller can sustain 400MB/s)
Last edited by marcov on Mar 17, 2021 8:38, edited 1 time in total.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Need help with ASM

Post by deltarho[1859] »

@fzabkar

Read ahead uses a secondary thread of execution, and you have two cores.

Experiment with different buffer sizes. If you use 90% of free memory then virtual memory will come into play and your performance will nose dive.
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Need help with ASM

Post by fzabkar »

I had wondered whether grabbing all the free memory would affect swapping, even if I could guarantee that no other tasks were running. Unfortunately I am unable to do any serious experimenting, as I'm writing this tool for someone else (for free), and I don't have any spare drives for testing purposes. Worse stiil, the user is in a different time zone. I suppose I could recompile the program with alternative free memory options, say 25%, 50%, 75% and 90%, and the user can they test them at his own convenience.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Need help with ASM

Post by jj2007 »

deltarho[1859] wrote:On my HDD the sweet spot tends to be 256KB
That might be difficult to handle if you have files with widely differing sizes, but it is worth a try. Even more important: instead of loading the file into RAM, you should use a memory-mapped file.
badidea
Posts: 2591
Joined: May 24, 2007 22:10
Location: The Netherlands

Re: Need help with ASM

Post by badidea »

Without having done any tests and without being an expert on this, I would go with 50% memory usage. It would not be surprised if an OS starts swapping data to disc before 90% memory usage. Also I can imagine that memory fragmentation can have a negative performance impact if you try to use every last bit of memory.
Also, on what OS will the program run? Be sure that no virusscanner and similar software is running. That will annihilate the gained nano-seconds completely and worse.
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Need help with ASM

Post by fzabkar »

The OS is Windows 10, 64 bit, and the machine is dedicated to data recovery, with no AV software. The reason I have been asked to write this tool, and others, is that there are limitations in the user's commercial tool (MRT Lab) which are not being addressed by the developers (in China).

The other major problem with MRT Lab is that it doesn't separate good and bad files into separate directories. The user then needs to check each file for bad sectors. However, this creates its own problems. If we use a recognisable fill pattern ("bad!") for bad sectors, then directly OR-ing two bad files is no longer possible. If we use a zero fill pattern, then OR-ing is OK, but a good file cannot then be distinguished from a bad one.
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: Need help with ASM

Post by caseih »

fzabkar wrote:I am well aware that the application is I/O bound. I have already optimised the disk I/O as best as I can by reading as much of each file into RAM as possible. My program checks the free space (FRE) and tries to use up to 90% of it, if the file size warrants it. (Are there any caveats to this approach?)
I don't see why you are reading it into ram at all. That doesn't buy you anything. Obviously data has to move from the disk to the CPU still via the RAM, but there's no need to have an additional, explicit, copy step. There's no need to move it around twice which is what you're doing. Besides that, there's every possibility the virtual memory manager will shuttle a lot of that back to swap anyway.

I've not done a lot of work with large files, but I think mapping a file into memory might be the best solution for this sort of access. Map the file into memory (which uses the virtual memory system), process it directly as if it were already in memory, then unload it and map a new file. No issues with memory allocation and swap use, since the file is the swap.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Need help with ASM

Post by deltarho[1859] »

jj2007 wrote:That might be difficult to handle if you have files with widely differing sizes
I was referring to buffer size - I wrote earlier "Experiment with different buffer sizes."

My Encrypternet application uses Const BufferSize = 256 * 1024.

Experiment with the 256 and do timings. We can encrypt 10MB, 100Mb, 1GB, 2GB, 4GB or whatever. It is a pointless exercise to use a percentage of free RAM because once we get past a particular percentage the performance goes horizontal, and we knock seven bells out of the filecache defeating its reason to exist.
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Need help with ASM

Post by fzabkar »

Please excuse my ignorance, but how would I "map a file into memory"?
marcov
Posts: 3462
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Need help with ASM

Post by marcov »

fzabkar wrote:Please excuse my ignorance, but how would I "map a file into memory"?
Google "windows" + some terms (like "memory map file" and you usually get MSDN links. Sometimes you need to try a few times to find the right terms.

https://docs.microsoft.com/en-us/window ... le-mapping
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Need help with ASM

Post by jj2007 »

deltarho[1859] wrote:
jj2007 wrote:That might be difficult to handle if you have files with widely differing sizes
I was referring to buffer size - I wrote earlier "Experiment with different buffer sizes."

My Encrypternet application uses Const BufferSize = 256 * 1024.

Experiment with the 256 and do timings. We can encrypt 10MB, 100Mb, 1GB, 2GB, 4GB or whatever. It is a pointless exercise to use a percentage of free RAM because once we get past a particular percentage the performance goes horizontal, and we knock seven bells out of the filecache defeating its reason to exist.
That's why I suggested memory-mapped files:
jj2007 wrote:instead of loading the file into RAM, you should use a memory-mapped file.
Afaik the mapping does not need the file cache. But there are other problems, see e.g. page faults in the answers of Jack, Dietrich and Alexey in Does mmap really copy data to the memory? - in case of doubt, time it ;-)
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Need help with ASM

Post by fzabkar »

jj2007 wrote:In any case, parallel processing with SIMD or AVX would be a lot faster. Try this one (a factor 5 faster than Gcc with -O3):

Code: Select all

	Dim as integer ps, pd
	ps=@dwinptrS(0)
	pd=@dwinptrB(0)
  asm
	mov ecx, [dwNumLongs]
	lea ecx, [ecx-16]
	mov esi, [ps]
	mov edi, [pd]
L0:	movups xmm0, [esi+ecx]
	movups xmm1, [edi+ecx]
	por xmm0, xmm1
	movups [edi+ecx], xmm0
	sub ecx, 16
	jns L0
  end asm
Shouldn't ...

Code: Select all

mov ecx, [dwNumLongs]
... be ...

Code: Select all

mov ecx, [dwNumBytes]
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Need help with ASM

Post by jj2007 »

Yep, you are right - and then the speed is only 35% faster than the compiler...

Code: Select all

	mov ecx, [dwNumLongs]
	shl ecx, 2	' longs to bytes
Post Reply