A very slow prng

General FreeBASIC programming questions.
deltarho[1859]
Posts: 4305
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A very slow prng

Post by deltarho[1859] »

As expected, a little faster than QPC, but not by much.

I didn't expect PractRand tolerating 16GB. Your 'SumOf...' procedure seems to be doing a good job at injecting entropy into the proceedings.

Time to move on.
deltarho[1859]
Posts: 4305
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A very slow prng

Post by deltarho[1859] »

Yours truly wrote:Time to move on.
Except. :)

One of the problems with your method is getting data from the system either via Timer, QPC, or TSC all of which are, to some extent, expensive.

Another way is to use a cheap and cheerful algorithm instead to generate a 64-bit value. To this end, I used Knuth's 64-bit LCG. That PRNG will fail PractRand in the early KiBs. So, how will it fair using the 'SumOf...' approach?

The 'count/64/10^8' using the TSC was 0.49999128125. Using the LCG I get 0.49999466125.

The graphic test timing has dropped from 18.06s to 13.84s which is nearly 6.5 times faster than the Timer method instead of nearly five times faster using the TSC.

If this works with PractRand, then LCG is given a new lease of life as did with Melissa O'Neill's PCG32 where Knuth's 64-bit LCG is permuted.

The SumOfLCGValues procedure is now quite fast. The Parity64 is blindingly fast.

However, the Achilles heel of the method is going past Parity64(SumOfLCGValues(seed)) 64 times. It doesn't matter how fast the 'SumOf...' procedure is, going past it 64 times will be costly.

We can probably half the time by using a 32-bit LCG. Knuth's 32-bit LCG from Numerical Recipes uses 1664525 as the multiplier and 1013904223 as the increment. We now need a Parity32() and go past that 32 times.

Here is the LCG code: Note that SumOfLCGValues uses a Byref seed.

Code: Select all

Function SumOfLCGValues( Byref seed As UlongInt ) As UlongInt
Static As UlongInt sum
  seed = 6364136223846793005ull * seed + 1442695040888963407ull
  sum += seed
  Return sum
End Function
 
Function Parity64(Byval n As Ulongint) As Ulongint
   n = n Xor (n Shr 32)
   n = n Xor (n Shr 16)
   n = n Xor (n Shr 8)
   n = n Xor (n Shr 4)
   n = n Xor (n Shr 2)
   n = n Xor (n Shr 1)
   Return n And 1
End Function
 
Function NextNumber64 As Ulongint
   Dim As Ulongint y, seed = 987654
   Dim As Ubyte i
   For i = 0 To 63
      If Parity64(SumOfLCGValues( seed )) Then y = Bitset(y, i)
   Next i
   Return y
End Function
deltarho[1859]
Posts: 4305
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A very slow prng

Post by deltarho[1859] »

As a matter of interest, using a 32-bit LCG, for 'count/32/10^8', I get 0.4999995040625.

The graphic test timing has dropped from 13.84s to 2.87s which is just over 31 times faster than the Timer method instead of nearly 6.5 times faster using the 64-bit LCG.

I did not see that coming. :)

You will get a pat on the head from PractRand for that. :D

Of course, we now have a 32-bit PRNG, but most of them are.

Bearing in mind that LCGs get treated with the utmost contempt from PractRand it will be interesting to see how they fair using the 'SumOf...' approach.

Added: If the 32-bit LCG works, then the Mersenne Twister (MT) will be about 3.8 times faster. The hhr PRNG is still slow, but no longer 'very' slow. In fact, with some applications, there may be no apparent difference between MT and hhr.

Here is the 32-bit LCG code:

Code: Select all

Function SumOfLCG32Values( Byref seed as Ulong ) As Ulong
Static As Ulong sum
  seed = 1664525ul * seed + 1013904223ul
  sum += seed
  Return sum
End Function

Function Parity32(Byval n As Ulong) As Ulong
   n = n Xor (n Shr 16)
   n = n Xor (n Shr 8)
   n = n Xor (n Shr 4)
   n = n Xor (n Shr 2)
   n = n Xor (n Shr 1)
   Return n And 1
End Function 

Function NextNumber32 As Ulong
   Dim As Ulong y, seed = 987654
   Dim As Ubyte i
   For i = 0 To 31
      If Parity32(SumOfLCG32Values( seed )) Then y = Bitset(y, i)
   Next i
   Return y
End Function
deltarho[1859]
Posts: 4305
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A very slow prng

Post by deltarho[1859] »

@hhr

My PractRand is running faster than yours; for some reason. I have just got to 1GiB in 138 seconds with one 'unusual' for the 64-bit LCG.

With the 32-bit LCG I cannot get past 4MiB. The 32-bit LCG could be so bad that the 'SumOf...' is not helping. The 32-bit LCG would normally fail PractRand after only a couple of KiBs. I need to double-check that.

Added: I went back to the 64-bit LCG and that is now failing. I need some sleep. I check tomorrow. :)
deltarho[1859]
Posts: 4305
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A very slow prng

Post by deltarho[1859] »

Yours truly wrote:I have just got to 1GiB in 138 seconds with one 'unusual' for the 64-bit LCG.
I'm afraid that's not right. It is easy to get things wrong with PractRand. We need to keep our wits about us. :)

With the 32-bit LCG I still cannot get past 4MiB. With the 64-bit LCG I cannot past 2MiB.

They are both doing better than on their own, so 'SumOf...' is helping but, clearly, not enough.

So using LCGs and 'SumOf...' doesn't work.

It was worth a try, but we cannot win them all. If a PRNG is not good, PractRand will slaughter them.

The distribution uniformity is excellent, but no good on the randomness front; similar to FB's #4 generator.

It is definitely time to move on now.
hhr
Posts: 208
Joined: Nov 29, 2019 10:41

Re: A very slow prng

Post by hhr »

I looked at the 64bit generators in https://de.wikipedia.org/wiki/KISS_(Zuf ... generator) and tried the following program.
With Lcg64 and Xorshift64 I had no success. With Mwc64 it worked. I tested up to 256GB and planned to do a 1TB test.
That would have taken 5 to 6 hours. Then I noticed that the upper 4 bytes of SumOfTSCValues rarely change.
The addition carries don't go far enough to change the upper 4 bytes of Mwc64. So now I want to try a 32 bit generator.

Code: Select all

Function SumOfTSCValues As Ulongint
   Dim As Ulongint TimeNow
   Static As Ulongint sum
   Asm
      rdtsc
      Mov Dword Ptr [TimeNow], eax
      Mov Dword Ptr [TimeNow+4], edx
   End Asm
   sum += TimeNow
   Return sum
End Function

Function Mwc64 As Ulongint  ' Multiply-with-carry
   Static As Ulongint z = 1234567890987654321ULL
   Static As Ulongint c = 123456123456123456ULL
   Dim As Ulongint t
   t = (z Shl 58) + c
   c = z Shr 6
   z += t
   c += Iif(z < t, 1, 0)
   Return z
End Function

Function NextNumber64 As Ulongint
   Return Mwc64 + SumOfTSCValues
End Function

'===========================================

Function Test As Ulongint
   Dim As Ulongint a, b
   a = SumOfTSCValues
   b = Mwc64
   Print "TSC:"; Tab(10); Bin(a, 64)
   Print "Mwc:"; Tab(10); Bin(b, 64)
   Print "Mwc+TSC:"; Tab(10); Bin(b + a, 64)
   Print
   Return 0
End Function

Do
   Test
Loop Until Getkey = 27
Addendum, maybe this can be a solution:

Code: Select all

Function NextNumber64 As Ulongint
   Return SumOfTSCValues + Mwc64 + (SumOfTSCValues Shl 32)
End Function

'===========================================

Sub Test
   Dim As Ulongint a, b, c
   a = SumOfTSCValues
   b = Mwc64
   c = SumOfTSCValues Shl 32
   Print "a:"; Tab(10); Bin(a, 64)
   Print "b:"; Tab(10); Bin(b, 64)
   Print "c:"; Tab(10); Bin(c, 64)
   Print "a+b+c:"; Tab(10); Bin(a+b+c, 64)
   Print
End Sub

Do
   Test
Loop Until Getkey = 27
Translated with www.DeepL.com/Translator (free version)
Last edited by hhr on May 07, 2023 16:08, edited 1 time in total.
deltarho[1859]
Posts: 4305
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A very slow prng

Post by deltarho[1859] »

hhr wrote:Then I noticed that the upper 4 bytes of SumOfTSCValues rarely change.
That is why I thought that the TSC approach would fail PractRand.

We can always swap eax with edx and have the lower 4 bytes rarely change. I doubt that will help.

I hadn't realized how big a buffer you were using, so I changed my 2MiB to your 128MiB. The 32-bit LCG still went down at 4MiB.

With the 64-bit LCG I upped the buffer to 256MiB and it failed at 256MiB. It looks like it is failing at the buffer size. So, I tried a buffer of 512MiB and it failed at 512MiB.

So PractRand waited to fill the buffer initially, but wouldn't wait for the buffer to be filled again.

We now have an issue with PractRand.
So now I want to try a 32 bit generator.
I think that is where the answer is.

However, we should avoid using a 32-bit generator that is too clever; otherwise we may as well use that as a generator and be done with it.

A cheap and nasty 32-bit LCG plus the 'SumOf...' should do the trick. Perhaps I have miscoded the 32-bit LCG – I am prone to the odd silly mistake. :)

Anyway, you seem to be on top of this exercise, so I will leave you to it.

Keep posting about your 'trials and tribulations' – a lot of folk are viewing this thread. :wink:
hhr
Posts: 208
Joined: Nov 29, 2019 10:41

Re: A very slow prng

Post by hhr »

@deltarho
I tested your 64bit Lcg program from 1KB up to 1GB and had only one 'unusual'. Then I aborted the test.
I used my own routine because it is easy when switching from 32 bit to 64 bit.

Code: Select all

Chdir "Path to PractRand's RNG_test.exe" '' PractRand Version 0.94
Dim As String cs = "RNG_test stdin64"
cs &= " -tlmin 1KB"
'cs &= " -tlmaxonly"
cs &= " -multithreaded"
Open Pipe cs For Binary Access Write As #1
Do
   Put #1,,NextNumber64
Loop
Then I tested your 32bit Lcg progam. It failed at 4MB. With an additional sum it failed at 1GB.

Code: Select all

Function SumOfLCG32Values( Byref seed as Ulong ) As Ulong
Static As Ulong sum, sum2
  seed = 1664525ul * seed + 1013904223ul
  sum += seed
  sum2 += sum
  Return sum2
End Function
deltarho[1859]
Posts: 4305
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A very slow prng

Post by deltarho[1859] »

@hhr

I cannot figure it but when I use my 'piping' program with the 64-bit algorithm, PractRand fails at the size of the buffer chosen.

Anyway, it is good to see the 64-bit algorithm working. The problem I have with it is going past 'SumOf...' 64 times.

If you look at Melissa O'Neill's pcg32 you will see her using Knuth's 64-bit LCG's multiplier. She does not use his increment, but a variable increment to give different sequences. The Knuth output is then permuted using some clever bit shifting, the resulting PRNG passing PractRand to 16TB. The resulting PRNG is also fast. The Knuth algorithm is only passed once per random number generated.

Your 'SumOf...' does not do much 'massaging' on the LCG output. It is enough to elevate the 32-bit LCG from a KiB failure to a MiB failure, but more is needed to get us into the GiB domain; which you did with the double sum method. We are still going past 'SumOf...' 32 times per random number generated, so any extra 'massaging' is going to slow us down further.

I looked at the 32-bit LCG used in Turbo Pascal and that failed PractRand at 16MiB.
hhr
Posts: 208
Joined: Nov 29, 2019 10:41

Re: A very slow prng

Post by hhr »

@deltarho
I tested the programs again with your routine and got the same results as with my routine.

I commented on the things where I easily make mistakes:
For the 32 bit program:

Code: Select all

'' 32:
'' prng32 | rng_test stdin32 -tlmin 1KB -multithreaded
Dim Shared S As String * 2097152 ''= 524288 * 4
Dim As Ulong Ptr SPtr, BasePtr ''Ulong Ptr
Dim As Long j
 
SPtr = Cptr(Ulong Ptr, StrPtr( S )) ''Ulong Ptr
BasePtr = SPtr
 
Do
  For j = 1 to 524288 ''= 2097152 / 4
    *SPtr = NextNumber32 '' 32
    SPtr += 1
  Next
  Print S;
  SPtr = BasePtr
Loop
For the 64 bit program:

Code: Select all

'' 64:
'' prng64 | rng_test stdin64 -tlmin 1KB -multithreaded
Dim Shared S As String * 2097152 ''= 262144 * 8
Dim As Ulongint Ptr SPtr, BasePtr ''Ulongint Ptr
Dim As Long j
 
SPtr = Cptr(Ulongint Ptr, StrPtr( S )) ''Ulongint Ptr
BasePtr = SPtr
 
Do
  For j = 1 to 262144 ''= 2097152 / 8
    *SPtr = NextNumber64 '' 64
    SPtr += 1
  Next
  Print S;
  SPtr = BasePtr
Loop
deltarho[1859]
Posts: 4305
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A very slow prng

Post by deltarho[1859] »

@hhr

Thanks for that.

There were some slight differences to my 64-bit test. I need to look at them to see why I get a failure at the buffer size. I've made so many adjustments over the years. My 'master' version is in my PowerBASIC development folder.

Anyway, I have just gone sailing past 8GiB. I will let it rattle on for a while. It is a pity that the 32-bit LCG is poor. The issues with LCGs are less pronounced with 64-bit versions. What we gain on speed with 32-bit, we lose by needing more 'massaging'. Finding the 'right' massaging with 32-bit is the trick. Some bit shifting might do it.
deltarho[1859]
Posts: 4305
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A very slow prng

Post by deltarho[1859] »

I pulled out of the 64-bit run at 32GiB taking 70.2 minutes. That suggests over 37 hours for 1TiB which is a heck of a lot longer than my latest PRNG implementations. :)
deltarho[1859]
Posts: 4305
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A very slow prng

Post by deltarho[1859] »

I wrote my buffered piping program at PowerBASIC. It was in 32-bit mode because PowerBASIC is only a 32-bit compiler.

Most of my code at FreeBASIC is in 32-bit mode, and I compiled the piping program in 32-bit. I don't understand why, but it never occurred to me to use 64-bit even though I was using the 64-bit PractRand.

The 70.2 minutes for a 32GiB PractRand run dropped to 28.9 minutes. If anyone told me that I could halve a PractRand time by using a 64-bit piping program, I would not have believed them.

Having got to 32Gib much quicker, I let it run a bit longer and got to 128GiB in 115 minutes; just less than two hours.

There were only a few anomalies, and the last one was at 4GiB. It seems to me that a 1TiB pass is on the cards, but I will not run PractRand for 15 hours to find out.

With a 64-bit output, 128GiB will give over 16 million [0,1) random numbers; more than enough for most FB users. It is worth remembering that the Mersenne Twister is only good for 128GiB because it fails PractRand at 256GiB.

The only issue with the permuted 64-bit LCG and hhr's approach is speed – Sebastiano Vigna's 64-bit output xoroshiro128** leaves it standing. However, even though it is slow, it may not impact adversely on our applications, so it is worth looking at given that its distribution uniformity is top drawer.
deltarho[1859]
Posts: 4305
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A very slow prng

Post by deltarho[1859] »

I don't know why I am still playing around with this. I keep having ideas, I suppose.

When we permute a LCG the increment is not relevant; in our case 1442695040888963407ull. I mentioned earlier that Melissa O'Neill's PCG32 has a variable increment. I cannot remember if she explained why, but the increment should be odd.

When we run a new instance of hhr's PRNG it is possible that we may step on the toes of a previous instance's NextNumber sequence, albeit unlikely. However, if we use a random increment, giving a different sequence, in addition to a random seed, the odds against such an incursion increases from 2^64 to 2^127 (2^64 seeds x 2^63 increments) so that isn't going to happen. Probably we will get a NextNumber sequence never seen before and never to be seen again.

I did not expect any problems looking at the average of 10^8 [0,1) using a random seed and a random increment. I didn't get one and got 0.499968357224125.

I didn't expect a issue with PractRand either since we are looking at the parity of the 'SumOf...'.

A test was done anywaay. The best time, for me, to run PractRand is just before going out somewhere or just before going to bed. I hate being at my desk with Practrand running – it is like standing over a kettle waiting for it to boil, even when minimized. We all have our little foibles. :) I was going out so PractRand time. On this occasion, I used PractRand 0.95.

When I got back, I saw 256GiB had been passed not long before. There were four 'unusual' and the time taken was 13826 seconds; just under four hours. Of course, I could have been lucky with seed and increment, but I don't think luck has anything to do with it.

Here is the code for a random seed and random increment.

Code: Select all

Randomize , 5 ' Cryotographic
 
Function Get64Bit() As UlongInt
  Return (Cast( UlongInt, Rnd*(2^32) ) Shl 32) Or Cast( UlongInt, Rnd*(2^32) )
End Function
 
Dim Shared As UlongInt seed, increment
seed = Get64Bit
increment = Get64Bit Or 1 ' Make increment odd
 
Function SumOfLCGValues( Byref seed As UlongInt ) As UlongInt
Static As UlongInt sum
  seed = 6364136223846793005ull * seed + increment
  sum += seed
  Return sum
End Function
 
Function Parity64( Byval n As Ulongint ) As Ulongint
   n = n Xor (n Shr 32)
   n = n Xor (n Shr 16)
   n = n Xor (n Shr 8)
   n = n Xor (n Shr 4)
   n = n Xor (n Shr 2)
   n = n Xor (n Shr 1)
   Return n And 1
End Function
 
Function NextNumber64 As Ulongint
   Dim As Ulongint y
   Dim As Ubyte i
   For i = 0 To 63
      If Parity64( SumOfLCGValues( seed ) ) Then y = Bitset(y, i)
   Next i
   Return y
End Function
deltarho[1859]
Posts: 4305
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A very slow prng

Post by deltarho[1859] »

To put things into perspective, the graphics testing program was used with Mersenne Twister (MT) as a baseline. This tends to compress the relationship between MHz comparisons, especially when a random number only plays a small part timingwise of a statement.

Code: Select all

PRNG           Rank  Output  Period
 
MsWsII(32)     73.90   32      64
PCG32II        75.27   32      64
MsWsII(64)     87.50   64      64
xoroshiro128** 87.82   64     128
xoshiro256**   99.08   64     256
MT            100      32    19937
CryptoRndIV   114.52   32     N/A
LCG32         478      32      32
LCG64        2298      64      64
The fastest 32-bit and 64-bit outputs were with MsWsII which has two engines and the greatest functionality; with a Help file. PCG32II was my long-term favourite. The xoroshiro128** and xoshiro256** would satisfy the scientific community, who tend to require periods exceeding 2^64.

CryptoRndIV ranks a little above MT, but has the best randomness using Intel's RdRand. It is not designed for cryptographic work. If randomness is an important issue and a 32-bit output is OK then this is the one to go for. However, there is no facility for repeating sequences as with MsWsII.

LCG32 is a permuted LCG used above but fails PractRand rapidly and has a comparatively small period – not recommended.

LCG64 is a permuted LCG used above, passing PractRand to 256GiB, aborted at that point, with excellent distribution uniformity. Its usefulness is debatable considering its ranking – 2298 is pitifully slow, being only 3.8% of the MsWsII(64) ranking. I find it hard to recommend, and should be tested in code to see if it impacts upon an application.

I could have included other PRNG implementations, but none of them shined enough to include in my toolbox.
Post Reply