Single and Range wrappers for minimal PCG PRNG

General FreeBASIC programming questions.
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

The following code simply provides single precision and range wrappers for the minimal PCG32 code published in the thread Has anyone ported the PCG PRNG to FreeBASIC [1].

Firstly, credit to cbruce for bringing this PRNG to our attention and, secondly, credit to counting_pine for porting the minimal C implementation to FreeBASIC.

The wrapper overhead was expensive so I have used the age old method of code repetition by using the same pcg code in the head of each of the wrappers.

If we employ no seeding the first two random 32 bit values will be zero. All random number generators should be 'warmed up'. In An experimental exploration of Marsaglia’s xorshift generators,scrambled by Sebastiano Vigna he produces a graph showing the number of samples which are required to warm up some PRNGs, including his own. It is quite clear that the the larger the period of a generator the greater the number of samples should be executed before using a generator. Not surprisingly the Mersenne Twister needs a lot of warming up. Since PCG shares some similarities with Vigna's PRNGs and we have a period of 2^64 with this minimal PCG I have opted for a warm up of 200 samples. This should prove more than adequate and only takes one micro-second to complete, so you will not have time to put the kettle on. <smile>

The warm up is done in the function RandomizePCG32( Seed )

Seed is optional. If no seed is given a cryptographic random Ulongint is used from FB's PRNG option 5. To repeat a sequence a seed is required. Quite frankly, anything will do: 123456, say, is fine.

For single precision we simply execute PCG32S. For range, 0 to 100 for example, we simply execute PCG32R(0,100).

The code at the end of this post has the code to add to your apps and an example usage.

I now use these compiler option switches at all times: -gen gcc -Wc -O1

With no compiler optimization the throughput collapses.

This is a typical output using RandomizePCG32 - that is an unknown seed.

Code: Select all

 0.3077029
 0.5957927
 0.65415
 0.7683138
 0.6042935
 0.1181773
 0.07335913
 0.6330329
 0.7821538
 0.4110711
 
 116
 204
 86
 7
 211
 37
 128
 147
 240
 157
 
Averge of 10^8 singles:  0.5000328068581712
 
Average of 10^8 range (0,100):  49.99446299
 
Estimate of 'raw' throughput (Millions per second)
 
Singles: 219.113
Range: 180.463
That 219.113 is twice as fast as FB's PRNG option 2.

In 64 bit mode I get 222.410 and 206.462 for singles and range respectively.

A period of 2^64 may be regarded as small but just to put that into perspective it would take over 3,000 years on my machine before a sequence repetition and that is just executing random numbers and doing nothing with them. Also, if we used 2^32 random numbers then we would be using 2^32/2^64 = 1/2^32 of those available, that is 23*10^-9% so the chance of 'landing' on numbers previously used is very small. Of course,the chance decreases as the number used decreases.

srvaldez has published the full implementation of PCG in [1] but I have not had a chance to look at it yet.

Some folk may find the following code adequate for their purposes.

Code: Select all

'' *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org
'' Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)
'' FreeBASIC translation by Matthew Fearnley 2017-06-04
 
' Include the code between ' ******/' ****** in your apps
 
' **********************************************
 
Type pcg32_random_t
  As Ulongint state, inc
End Type
 
Dim Shared pcg32 As pcg32_random_t
 
Function pcg32_random_r(Byval rng As pcg32_random_t Ptr) As Ulong
  Dim As Ulongint oldstate = rng->state
  '' Advance internal state
  rng->state = oldstate * 6364136223846793005ULL + (rng->inc Or 1)
  '' Calculate output function (XSH RR), uses old state for max ILP
  Dim As Ulong xorshifted = ((oldstate Shr 18u) xor oldstate) Shr 27u
  Dim As Ulong rot = oldstate Shr 59u
  Return (xorshifted Shr rot) Or (xorshifted Shl ((-rot) And 31))
End Function
 
Sub RandomizePCG32( Seed As Ulongint = 0 )
  Dim i As Integer
  If seed = 0 Then
    Randomize , 5
    pcg32.state = Rnd*(2^64-1)
  Else
    pcg32.state = seed
  End If
  For i As Integer = 1 To 200
    pcg32_random_r(@pcg32)
  Next i
End Sub
 
Function PCG32S() As Single
  Dim tempvar As Ulong
  Dim rng As pcg32_random_t Ptr = @pcg32
  Dim As Ulongint oldstate = rng->state
  '' Advance internal state
  rng->state = oldstate * 6364136223846793005ULL + (rng->inc Or 1)
  '' Calculate output function (XSH RR), uses old state for max ILP
  Dim As Ulong xorshifted = ((oldstate Shr 18u) xor oldstate) Shr 27u
  Dim As Ulong rot = oldstate Shr 59u
  Tempvar =  (xorshifted Shr rot) Or (xorshifted Shl ((-rot) And 31))
  Asm
    mov eax, dword Ptr [TempVar]
    movd xmm0, eax
    psrlq xmm0, 9
    mov eax, 1
    cvtsi2ss xmm1, eax
    por xmm0, xmm1
    subss xmm0, xmm1
    movd [Function], xmm0
  End Asm
End Function
 
Function PCG32R( Byval One As Long, Byval Two As Long ) As Long
  Dim tempvar As Ulong
  Dim rng As pcg32_random_t Ptr = @pcg32
  Dim As Ulongint oldstate = rng->state
  '' Advance internal state
  rng->state = oldstate * 6364136223846793005ULL + (rng->inc Or 1)
  '' Calculate output function (XSH RR), uses old state for max ILP
  Dim As Ulong xorshifted = ((oldstate Shr 18u) xor oldstate) Shr 27u
  Dim As Ulong rot = oldstate Shr 59u
  Tempvar =  (xorshifted Shr rot) Or (xorshifted Shl ((-rot) And 31))
  Asm
    mov edx, dword Ptr [TempVar]
    mov ecx, [One]
    mov eax, [Two]
    cmp ecx, eax
    jl Now1LT2
    xchg eax, ecx
  Now1LT2:
    Sub eax, ecx
    inc eax
    jz doTheRnd
    mul edx
    Add edx, ecx
  doTheRnd:
    mov [Function], edx
  End Asm
End Function
 
' *****************************************************
 
RandomizePCG32
 
Dim As Ulong I, N
 
' Knock out 10 singles
For i = 1 To 10
  Print PCG32S
Next
Print
' Knock out 10 range
For i = 1 To 10
  Print PCG32R(0,255)
Next
Print
 
Dim As Double tot
 
' Knock out average of 10^8 singles
For i = 1 To 10^8
  tot += PCG32S
Next
tot = tot/10^8
Print "Averge of 10^8 singles: ";tot
 
Print
tot = 0
' Knock out average of 10^8 range (0,100)
For i = 1 To 10^8
  tot += PCG32R(0,100)
Next
tot = tot/10^8
Print "Average of 10^8 range (0,100): ";tot
Print
Dim As Double t
Print "Estimate of 'raw' throughput (Millions per second)"
Print
t = Timer
For i = 1 To 10^8
  PCG32S
Next
t = Timer - t
Print "Singles: ";Using "###.###";10^8/(t*1000000)
t = Timer
For i = 1 To 10^8
  PCG32R(0,100)
Next
t = Timer - t
Print "Range: ";Using "###.###";10^8/(t*1000000)
 
Sleep
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

OK, as is we are seeding via pcg32.state.

pcg32.inc is zero and never changes. I think inc is the entry point into the sequence and state is the state vector at inc.

This post has been edited in view of the next post.

So, if we use a random state and a random inc then we are effectively getting 2^63 possible sequences to enter and not just one sequence to enter.

Replace RandomizePCG32 with the following:

Code: Select all

Sub RandomizePCG32( Seed As Ulongint = 0 )
  Dim i As Integer
  If seed = 0 Then
    Randomize , 5
    pcg32.state = Rnd*(2^64)
    pcg32.inc = Rnd*(2^63)
  Else
    pcg32.state = Seed
  End If
  For i As Integer = 1 To 200
    pcg32_random_r(@pcg32)
  Next i
End Sub
The only difference between this version and the original version is the statement: pcg32.inc = Rnd*(2^63)

The point I made in the first post about taking over 3,000 years on my machine for a sequence repetition is still true but the chance of 'landing' on numbers previously used is now almost impossible as we are no longer using one sequence but one of 2^63 possible sequences.

If we supply a Seed, pcg32.inc remains at zero. To repeat a sequence, for whatever reason, all that we are interested in is repeating a sequence.<smile>

There is no change in usage with this version: RandomizePCG32 will give random seeding and RandomizePCG32 Seed, where Seed is any non-zero value, will give a repetitive sequence.
Last edited by deltarho[1859] on Jun 07, 2017 5:24, edited 1 time in total.
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

I was guessing about what inc does but I have found confirmation at pcg-random.org.

I have edited the previous post to this into account.

Yes, I am aware that 0 <= Rnd < 1 so the above is not fully exploitative, but they are close enough.<smile>
The minimal C library provides just one member of the PCG family, a RNG with 32-bits of output.

Internally, it uses two 64-bit integers for its internal state, consisting of:

the current state — the RNG iterates through all 2^64 possible internal states.
the RNG-sequence constant — a value that defines which of 2^63 possible random sequences the current state is iterating through; it holds the same value over the lifetime of the RNG.

Different values for sequence constant cause the generator to produce a different (and unique) sequence of random numbers (sometimes called the stream). In other words, it's as if you have 2^63 different RNGs available, and this constant lets you choose which one you're using.
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

Yes, I am aware that 0 <= Rnd < 1 so the above is not fully exploitative, but they are close enough.<smile>
I wish I wasn't so pedantic - I cannot accept that.

I have changed the random state to: pcg32.state = Rnd*(2^64)

We will actually get just shy of 2^64 at the top end of Rnd but since state is a Ulongint it will get clipped to 2^64 - 1.
Similarly, with inc, it will be clipped to 2^63 - 1. I am assuming that "2^63 possible random sequences" means 0 to 2^63 - 1 since inc is zero by default.
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

After further reading it seems that we are looking at the fastest single precision generator. Other generators give either a greater output in bits for the same period or a greater output in bits for a greater period. These would be OK if we needed double precision but I don't think that there is much call for double precision. I cannot find a period of 2^128 which outputs 32 bits. That would be handy for a jump forward but I don't think there is much call for that either.

I can live with a period of 2^64 outputting 32 bits especially since we have access to 2^63 possible sequences.

All the FB PRNGs are limited to a single sequence.

Given that this minimal PCG is faster than any of the FB PRNGs, has an excellent statistical quality and access to 2^63 possible sequences then it is a viable alternative to the FB PRNGs when random seeding. For fixed seeding it is also a viable alternative.
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

I have seen on more than one occasion a comment against using assembler for portability reasons. I can understand that but I have many applications freeware, paid for and so on which are unashamedly Windows' applications and that is the end of the matter. Most of the stuff that I write is for yours truly so portability just does not come into it.

Having said that here are asm free versions of Single and Range:

Code: Select all

Function FBPCG32S() As Single
  Dim TempVar As Ulong
  Dim rng As pcg32_random_t Ptr = @pcg32
  Dim As Ulongint oldstate = rng->state
  '' Advance internal state
  rng->state = oldstate * 6364136223846793005ULL + (rng->inc Or 1)
  '' Calculate output function (XSH RR), uses old state for max ILP
  Dim As Ulong xorshifted = ((oldstate Shr 18u) xor oldstate) Shr 27u
  Dim As Ulong rot = oldstate Shr 59u
  Tempvar = (xorshifted Shr rot) Or (xorshifted Shl ((-rot) And 31))
  Return TempVar/4294967296.0
End Function
 
Function FBPCG32R( Byval One As Long, Byval Two As Long ) As Long
  Dim TempVar As Ulongint
  Dim rng As pcg32_random_t Ptr = @pcg32
  Dim As Ulongint oldstate = rng->state
  '' Advance internal state
  rng->state = oldstate * 6364136223846793005ULL + (rng->inc Or 1)
  '' Calculate output function (XSH RR), uses old state for max ILP
  Dim As Ulong xorshifted = ((oldstate Shr 18u) xor oldstate) Shr 27u
  Dim As Ulong rot = oldstate Shr 59u
  Tempvar =  (xorshifted Shr rot) Or (xorshifted Shl ((-rot) And 31))
  Return TempVar * (Two - One)/4294967296.0 + One
End Function
The speeds are now 88.603/58.192 compared with 219.33/180.463.

You can see why I am a fan of asm.<smile>

However, these reduced speeds are by no means rubbish. I have just done a run using Mersenne Twister and got 87.600 so asm free PCG32 and MT are 'neck and neck' speed-wise. Of the two I would go for PCG32. PCG32 is stable up to at least one terabyte with PractRand whereas MT fails at 256GB; indicating an inherent weakness in the algorithm but it takes 256GB of samples to manifest itself. With random seeding we use a 64 bit seed and a random selection from 2^63 possible sequences for PCG32. With MT we can use a random 32 bit seed but just have the one sequence. With my RndMT we can have a random state vector, that is 624 random 32 bit seeds, but will still only have the one sequence. RndMT uses that dreadful asm stuff and wipes the floor with FB's MT. Of course, I could write an asm free RndMT but then RndMT uses those dreadful Windows APIs. Oh dear. <laugh>
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

In a recent thread there is a discussion of thread safety with regard random number generators. On my machine none of the FreeBASIC random number options are thread safe. PCG32 is not thread safe either.

My solution was code duplication. The duplicated code is edited in such a way that we have two sets of code. One thread could use the original code and another thread could use the edited duplicate. Thread safety is then guaranteed - the two threads are using different code. This method is not elegant - it is best described as brutish.

St_W came up with a solution which most certainly can be described as elegant. Essentially we create an object template X, say. We then copy that a couple of times using A and B, say, by A = X and B = X. We then have one thread using object A and another thread using object B. So, instead of code duplication we have object duplication.

This is the object template:

Code: Select all

Type pcg32
  Public:
  Declare Sub MyRandomize( seed As Ulongint = 0, seq As Ulongint = 0 )
  Declare Function rand() As Ulong
  Declare Function rands() As Single
  Declare Function range( Byval One As Long, Byval Two As Long ) As Long
  Private:
  state As Ulongint
  seq As Ulongint
End Type
With code duplication each of the Subs/Functions used are duplicated for each thread which requires a random number generator. With the object template the declarations do not use any space so its size is just 16 bytes. Each additional copy is then only using an additional 16 bytes. If that is not elegant then I would like to know what is.<smile>

I only mentioned enough Sub/Functions for the discussion mentioned and have taken St_W's lead to include the range function.

The original PCG32 code used 'Sub RandomizePCG32' and St_W uses 'Sub init'. Init, obviously short for initialization, is fine but may be understood to be a requirement, which it is not. I have changed it to MyRandomize. I am not being awkward. I think that folk will relate to MyRandomize by regarding it in the same sense as FreeBASIC's Randomize and they should because it is written in the same sense.

MyRandomize is an extended RandomizePCG32.

If MyRandomize is not executed then the generator will use zero for both the seed value and the seq value. Seed and state are one in the same. When we supply a seed were are actually supplying an initial state. The state is advanced on each random number request. Seq, on the the hand, does not change from its initial value - it persists throughout an application session; unless, of course, we re-sequence via another execution of MyRandomize with a different value of seq. A different value of seq simply generates a different sequence. With FreeBASIC's generators we are limited to just one sequence and different seeds simply give a different entry point to that sequence.

Seed and Sequence.

We have four possible permutations

Code: Select all

   Seed      Sequence
None given  None given [1]
None given    Given    [2]
  Given     None given [3]
  Given       Given    [4]
So, we would use something like: .MyRandomize, .MyRandonize(,99), .MyRandomize(5,) or .MyRandomize(5,99)

[1] has an obvious use - a completely unknown initialisation. Here we have both a random seed and a random sequence giving then 128 bits of unknown input. The random numbers are got from FreeBASIC's generator algorithm 5 which should be OK for both Windows and Linux.

[4] also has an obvious use. Here we get a sequence repeat analogous to 'Randomize seed' where seed is used more than once.

I cannot imagine why anyone would want [2] or [3] but I don't want my code limited by my imagination.<smile>

[2] gives us random entry points to a given sequence. I am going to let my imagination play a bit: Random times in our universe.

[3] gives us a fixed entry point to a random sequence. I find this one a little weird unless I let my imagination play again and think of the same time in multiple universes.

OK, perhaps I should rein in my imagination but I would interested in why anyone would use [2] or [3].

After reading a fair amount on the internet about the optimization levels available with gcc the consensus of opinion recommends -O2. One said of -O3 "Your binary will almost certainly be larger. Your program may be faster.". I had not noticed any speed increases with any of my code so far. In particular, the original PCG32 did not seem to benefit from using -O3.

St_W's code took a performance hit compared with my code duplication method - about one third down. However, -O3 lifted it to the same level as my duplication code.

We now find ourselves with a thread safe generator which is as fast as the non thread safe generator. I have called the new code PCG32II. PCG32II is then not to be used only if we have multi-threaded random number generation but may be used as a replacement for PCG32.

I had already decided that PCG32 would be my default seeding generator - I now have another reason for having PCG32II as my default seeding generator. For non-seeding I would use CryptoRndII but not in a multi-threaded environment where more than one thread required random numbers because CryptoRndII is not thread safe, yet. <smile>

Here is PCG32II.bas ( 13 July 2017 - latest version )

Code: Select all

' Generator PCG32II
' *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org
' Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)
' pcg32.rand is an adaptation of a FreeBASIC translation by Matthew Fearnley
' (aka FreeBASIC's counting_pine) 2017-06-04
' Object duplication method by FreeBASIC's St_W

Type pcg32
  Public:
  Declare Sub MyRandomize( seed As Ulongint = 0, seq As Ulongint = 0 )
  Declare Function rand() As Ulong
	Declare Function randse() As Double
  Declare Function range( Byval One As Long, Byval Two As Long ) As Long
	Declare Function GetSeed() As Ulongint
	Declare Function GetSeq() As Ulongint
  Private:
  state As Ulongint
  seq   As Ulongint
	seed  As Ulongint
End Type

Function Get64Bit() As UlongInt
	Return (Cast( Ulongint, Rnd*(2^32) ) Shl 32) Or Cast( Ulongint, Rnd*(2^32) )
End Function

Function pcg32.rand() As Ulong
  Dim As Ulongint oldstate = this.state
  ' Advance internal state
  this.state = oldstate * 6364136223846793005ULL + (this.seq Or 1)
  ' rotate32((state ^ (state >> 18)) >> 27, state >> 59)
  Dim As Ulong xorshifted = ((oldstate Shr 18u) xor oldstate) Shr 27u
  Dim As Ulong rot = oldstate Shr 59u
  Return (xorshifted Shr rot) Or (xorshifted Shl ((-rot) And 31))
End Function

Sub pcg32.MyRandomize( seed As Ulongint = 0, seq As Ulongint = 0 )
  Dim i As Integer
  
  Randomize , 5
  If seed = 0 Then
    If seq = 0 Then
      this.state = Get64Bit
      this.seq = Get64Bit\2
    Else
      this.state = Get64Bit
      this.seq = seq
    End If
  Else
    If seq = 0 Then
      this.state = seed
      this.seq = Get64Bit\2
    Else
      this.state = seed
      this.seq = seq
    End If
  End If
  This.Seed = This.state ' Save initial state
  ' Warm up generator - essential
  ' We probably don't need 200 samples but it only takes one micro-second on my machine
  For i As Integer = 1 To 200
    this.rand()
  Next i
End Sub

Function pcg32.randse() As Double
  Dim TempVar As Ulong
  Dim As Ulongint oldstate = this.state
  ' Advance internal state
  this.state = oldstate * 6364136223846793005ULL + (this.seq Or 1)
  ' rotate32((state ^ (state >> 18)) >> 27, state >> 59)
  Dim As Ulong xorshifted = ((oldstate Shr 18u) xor oldstate) Shr 27u
  Dim As Ulong rot = oldstate Shr 59u
  TempVar =  (xorshifted Shr rot) Or (xorshifted Shl ((-rot) And 31))
 	Return TempVar/4294967296.0
End Function

Function pcg32.range( Byval One As Long, Byval Two As Long ) As Long
  Dim TempVar As Ulong
  Dim As Ulongint oldstate = this.state
  ' Advance internal state
  this.state = oldstate * 6364136223846793005ULL + (this.seq Or 1)
  ' rotate32((state ^ (state >> 18)) >> 27, state >> 59)
  Dim As Ulong xorshifted = ((oldstate Shr 18u) xor oldstate) Shr 27u
  Dim As Ulong rot = oldstate Shr 59u
  Tempvar =  (xorshifted Shr rot) Or (xorshifted Shl ((-rot) And 31))
  
	' ASM by John Gleason @ PowerBASIC forums 
	Asm
		mov edx, dword Ptr [TempVar]
    mov ecx, [One]
    mov eax, [Two]
    cmp ecx, eax
    jl 0f
    xchg eax, ecx
    0:
    Sub eax, ecx
    inc eax
    jz 1f
    mul edx
    Add edx, ecx
    1:
    mov [Function], edx
  End Asm
End Function

Function pcg32.GetSeed() As Ulongint
	Return This.Seed
End Function

Function pcg32.GetSeq() As Ulongint
	Return This.Seq
End Function
Here is an example using two threads with their own random number generator.

As is the code uses just a primary thread when we use '#define Twin False'. If we then use '#define Twin True' then both the primary thread and a secondary thread of execution use their own random number generators. Because PCG32II is thread safe there is no loss of throughput. The second run may have marginally different throughput but then the generators are running in different environments.

Code: Select all

#Include "PCG32II.bas"
 
#define Twin False
 
Dim Shared pcg32A As pcg32
Dim Shared pcg32B As pcg32
 
pcg32A.MyRandomize
pcg32B.MyRandomize
 
Dim As Ulong i
Dim As Double t1, y
Dim Shared As Double t2
Dim shared As Ulong counter
Dim As Any Ptr hThread
 
Sub SecondThread( dummy as any ptr)
  Dim As Ulong i
  Dim As Double y
 
  t2 = Timer
  For i = 1 To counter
    y = pcg32A.rands()
  Next
  t2 = Timer - t2
 
End Sub
 
counter = 10^8 ' 100 million
 
#If Twin
  hThread = Threadcreate( @SecondThread, 0 )
#endif
 
t1 = Timer
For i = 1 To counter
  y = pcg32B.rands()
Next
t1 = Timer - t1
 
#if Twin
  Threadwait( hThread )
#endif
 
#if Twin
  Print Int(100/t1);" miil/sec", Int(100/t2);" mill/sec"
#else
  Print Int(100/t1);" mill/sec"
#endif
 
Sleep
Last edited by deltarho[1859] on Jul 12, 2017 23:57, edited 4 times in total.
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

Here is my latest comparison chart

Code: Select all

Random number generation
Millions per second
 
                 32 bit   64 bit
Default   0      87.502   200.707
CRT       1      69.835    74.571
Fast      2     107.518   347.532
Mersenne  3      87.598   199.167
Qbasic    4      99.215   318.980
CryptoRndII     512.761   518.081
PCG32II         206.456   348.571

I have noticed that the more 32 bit asm in my code the less of an increase when compiling in 64 bit mode. I suppose that makes sense - what does a 64 bt compiler do with 32 bit asm - probably nothing. The CryptoRndII single precision function is almost 100% asm. PCG32II's single precision function is a mix of FreeBASIC and asm. This makes me think that FreeBASIC's option 1 is heavy on asm resulting in a poor response to being compiled in 64 bit mode.

PCG32II is doing very well in 32 bit mode and is 'neck and neck' with option 2 in 64 bit mode. However, option 2 is not thread safe.
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

OK, so we have some FreeBASIC code bereft of any asm. We can then compile that into either 32 bit native code or 64 bit native code. So, far so good. Now suppose we have a function which is clearly a bottleneck. It may be worthwhile to rewrite that in asm. So, with a 32 bit compiler we have the FreeBASIC part being compiled into 32 bit native code and the asm only needs to be converted from assembler to native code.

Now, suppose we now compile with a 64 bit compiler. The FreeBASIC part will be compiled into 64 bit native code but how is the 32 bit asm treated?

As mentioned above "PCG32II's single precision function is a mix of FreeBASIC and asm.". The asm can be replaced by a single line of FreeBASIC. Now this single line I would normally regard as being expensive but maybe not in 64 bit and goodness knows what an optimised gcc is up to.

Well, there is only one way to find out.

That 348.571 figure can now be replaced with 949. That was using FBEdit. I sneaked up on poseidonFB a few minutes later before it had chance to bite me and it came back with 974.

I have decided to write a check list of a testing protocol and stick it on my desk. I have been having so much trouble with compilers lately turning gcc and optimizing off and so on that I may have ended up with results not played on the same field.

A lot of checking needs to be done because I am having trouble believing a near one billion single precision output. I will go to the pcg-random.org site to see what figures they are getting.
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

OK, pcg-random are getting about 38Gb/s with -O2 optimization.

That converts to 1.28 G Ulongs/s.

My 949 million singles translates to 0.88 G singles/s.

Taking into account the conversion from Ulong to single then it looks like our results are not giving a contradiction.

I have a lot crunching to do and I need to look at the range aspect and not just the single aspect.

Applying the same logic to CryptoRndII, ie replacing the asm with FB, I am getting 898 million/s in 64 bit mode. So, in 64 bit mode CryptoRndII has lost its title. It is still ahead of PCG32II in 32 bit mode with them coming in at 617 million/s and 489 million/s respectively. This is quite remarkable considering FB's option 2 coming in at 108 million/s in 32 bit mode.

I have had a word with an asm guru, in Sydney Australia, who has his own asm website and forum and he is of the opinion that getting a 64 bit compiler to compile 32 bit asm is not a particularly good idea. Two paths should be pursued: Compile BASIC to 64 bit native code or write 64 bit code. So, if we have to use inline assembler then for 64 bit compilation it is better if the inline is 64 bit. 'Kinda' makes sense.
srvaldez
Posts: 3373
Joined: Sep 25, 2005 21:54

Re: Single and Range wrappers for minimal PCG PRNG

Post by srvaldez »

deltarho[1859] wrote:Now, suppose we now compile with a 64 bit compiler. The FreeBASIC part will be compiled into 64 bit native code but how is the 32 bit asm treated?
you can use conditional compilation

Code: Select all

function add (byval operand1 as Integer, byval operand2 as Integer) as Integer
	#ifdef __FB_WIN32__ 'if compiling on Windows
		#ifdef __FB_64BIT__ 'if 64-bit
			asm
				mov rax, qword ptr [operand1]
				add rax, qword ptr [operand2]
				mov qword ptr [function], rax
			end asm
		#else 'compiling for Windows 32-bit
			asm
				mov eax, dword ptr [operand1]
				add eax, dword ptr [operand2]
				mov dword ptr [function], eax
			end asm
		#endif
	#endif
end function
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

Hi srvaldez

I use conditional compilation in CryptoRndII - one condition is compiler bit mode and the other is algorithm based.

What I meant by "how is the 32 bit asm treated?" is analogous to when all Windows where 32 bit. A 'For' loop executed faster when the counter was 32 bit as opposed to 16 bit or 64 bit ie the same size as the CPU registers.
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

I had to take a break from this because new test bench code was not giving me the massive increase in speeds I reported above. I just could not figure out what was going wrong.

It is amazing how useful it is to walk away and do sometime else. When I came back I found the problem straight away. I was not using PCG32II but PCG32.

So, what we have is this: PCG32 will give earth shattering results when tweaked in a certain way. PCG32II does not benefit from the same tweak - in fact, it behaves as if not tweaked at all. So, we could end up with two versions: One for single thread applications and on for multiple thread applications where more than one thread requires random numbers. However, I need to fathom out why PCG32II does not benefit from the tweaking. Not being able to see the wood from the trees springs to mind.<flaming heck>
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

Right, PCG32II now benefits from tweaking.

However, the range function was very much slower when the asm was replaced with FreeBASIC. It uses a bit more arithmetic than singles giving asm the edge. The range function uses labels and sometimes the gcc optimization level -O3 'fell over'. So, we had to use -O2. When we did that the single speed took a 30% speed hit. So, we are damned if we do, and damned if we don't.

I have good news. We can avoid the -O3 problem by using temporary labels in the asm.

So, we can now use FreeBASIC for the singles, asm for the range and compile using -O3. Phew!

Work is now required to update CryptoRndII and PCG32II.

RndMT has been updated because that would not compile with -O3. It does now.<smile>
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Single and Range wrappers for minimal PCG PRNG

Post by deltarho[1859] »

The code PCG32II.bas above has been replaced at 'Here is PCG32II.bas ( 2 July 2017 )'

Here is the latest comparison table: ( -gen gcc -asm intel -Wc -O3 )

Code: Select all

Random number generation
Millions per second
 
               32 bit   64 bit
Default   0    84.420   185.919
CRT       1    69.564    72.295
Fast      2    106.741  331.956
Mersenne  3    86.903   192.040
QBasic    4    99.874   323.384
CryptoRndII   588.233   972.532
PCG32II       493.647   974.215
CryptoRndII, which is now faster than it was, is still the fastest generator on my machine in 32 bit mode. In reality the speed difference between CryptoRndII and PCG32II may not be noticeable. In 64 bit mode CryptoRndII and PCG32II are neck and neck, speedwise.

I did a PractRand one terabyte test on PCG32II to make sure that nothing untoward had crept in and all was well on that front. I also did a 10^8 average single and range test to make sure they were OK and got 0.5000419348314485 and 50.00090511 ( 0 to 100 ) respectively.

PCG32II is the only thread safe generator above.

It is worth remembering the importance of using .myRandomize whether for a random seeding or fixed seeding. Early samples of both xorshift128 and xoroshiro128 exhibit a bias in some bits until a sufficient number of samples have been generated. I have not seen proof but I suspect that all generators which use shl and shr need a warm-up. This includes PCG and Mersenne Twister. Marsaglia's Multiply-With-Carry probably does not need a warm-up and CryytoRndII certainly doesn't.
Post Reply