Thread status polling - safe?

General FreeBASIC programming questions.
badidea
Posts: 2628
Joined: May 24, 2007 22:10
Location: The Netherlands

Thread status polling - safe?

Post by badidea »

Hello all, I am still working on a crossword generator and I am out of ideas to optimize the algorithm so I want to try multi-threading to speed up the process. The plan is to let each thread work on a different data (sub)set and when all done select the best solution in the main thread. While the treads are running, I would like some feedback on the progress, so I come up with the code below. It this safe, can each thread (including main) access this shared threadStatus array without trouble?

Code: Select all

dim shared as integer threadStatus(0 to 2)

sub mythread(param as any ptr)
	'warning pointer param does not point to a valid memory adress!
	dim as integer userData = cast(integer, param) 'unpack
	dim as integer threadId = userData and &h0f 'last nibble = id
	dim as integer delay = (userData shr 4) \ 100
	'start work and update status
	for i as integer = 1 to 100
		sleep delay, 1 'sleeping is hard work
		threadStatus(threadId) = i
	next
end sub

'launch threads. sleep time & threadId in packed 1 value
dim as any ptr thread1 = threadCreate(@mythread, cptr(any ptr, (2000 shl 4) + 0))
dim as any ptr thread2 = threadCreate(@mythread, cptr(any ptr, (5000 shl 4) + 1))
dim as any ptr thread3 = threadCreate(@mythread, cptr(any ptr, (8000 shl 4) + 2))

'poll thread status and wait for completion
dim as boolean allDone
do
	allDone = true
	locate 1,1
	for i as integer = 0 to 2
		print "thread(" & i & "): " & threadStatus(i) & "%"
		if threadStatus(i) < 100 then allDone = false
	next
	sleep 100, 1
loop until allDone

print "extra wait"
threadWait(thread1) 'or threadDetach
threadWait(thread2)
threadWait(thread3)
print "end!"
sleep 1000
fxm
Moderator
Posts: 12467
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: Thread status polling - safe?

Post by fxm »

Code analysis
- In your code, each sub-thread among the three has write access to its own shared array element.
- The main thread has read access to all elements of the shared array.
- So there is a potential access conflict only between the main thread and each sub-thread.

When accessing variables shared between multiple threads, all their accesses should generally be placed inside Mutexlock...Mutexunlock blocks, and that in all threads (including the main thread).
But when the shared variable is just a simple predefined numeric type of size <= Sizeof(INTEGER) (a single assembler instruction for access), using the mutex may not be mandatory.

In conclusion, for this particular case (accessing a single shared INTEGER), your code is safe.
fxm
Moderator
Posts: 12467
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: Thread status polling - safe?

Post by fxm »

Since the progress of each sub-thread is continously monitored by the main thread, the threadWait statement can become non-mandatory and a threadDetach can be coded as follows:

Code: Select all

#include "fbthread.bi"

dim shared as integer threadStatus(0 to 2)

sub mythread(param as any ptr)
	'warning pointer param does not point to a valid memory adress!
	dim as integer userData = cast(integer, param) 'unpack
	dim as integer threadId = userData and &h0f 'last nibble = id
	dim as integer delay = (userData shr 4) \ 100
	'start work and update status
	for i as integer = 1 to 100
		sleep delay, 1 'sleeping is hard work
		threadStatus(threadId) = i
	next
end sub

'launch threads. sleep time & threadId in packed 1 value
threadDetach(threadCreate(@mythread, cptr(any ptr, (2000 shl 4) + 0)))
threadDetach(threadCreate(@mythread, cptr(any ptr, (5000 shl 4) + 1)))
threadDetach(threadCreate(@mythread, cptr(any ptr, (8000 shl 4) + 2)))

'poll thread status and wait for completion
dim as boolean allDone
do
	allDone = true
	locate 1,1
	for i as integer = 0 to 2
		print "thread(" & i & "): " & threadStatus(i) & "%"
		if threadStatus(i) < 100 then allDone = false
	next
	sleep 100, 1
loop until allDone

print "end!"
sleep 1000
(once detached, a thread can continue to access shared variables, lock or unlock mutexes, wait on or signal conditional variables)

But be careful, because threadWait does not consume any resources while a waiting loop does (resource consumption that can be tempered by a sleep instruction in the waiting loop).
badidea
Posts: 2628
Joined: May 24, 2007 22:10
Location: The Netherlands

Re: Thread status polling - safe?

Post by badidea »

Thanks for the clarification and the threadDetach suggestion.
fxm
Moderator
Posts: 12467
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: Thread status polling - safe?

Post by fxm »

Otherwise, I am not a big fan of sharing global variables even across threads or using the thread parameter (any pointer) as just a hacked container for value(s).
I prefer to pass instead the address of a structured object that contains all the useful data to the threads functioning (we could add here the handles for mutexes and conditional variables...).
Example:

Code: Select all

type thread
    dim as integer delay
    dim as integer status
    dim as any ptr handle
    declare static sub proc(byval pthread as thread ptr)
end type

sub thread.proc(byval pthread as thread ptr)
    'start work and update status
    for i as integer = 1 to 100
        sleep pthread->delay, 1
        pthread->status = i
    next
end sub

dim as thread mythread(1 to 3) = {type(20), type(50), type(80)}
'launch threads. all data in an object
for i as integer = 1 to 3
    mythread(i).handle = threadCreate(cptr(any ptr, @thread.proc), @mythread(i))
next

'poll thread status and wait for completion
dim as boolean allDone
do
    allDone = true
    locate 1,1
    for i as integer = 1 to 3
        print "thread(" & i & "): " & mythread(i).status & "%"
        if mythread(i).status < 100 then allDone = false
    next
    sleep 100, 1
loop until allDone

print "extra wait"
for i as integer = 1 to 3
    threadWait(mythread(i).handle)
next
print "end!"
sleep 1000
badidea
Posts: 2628
Joined: May 24, 2007 22:10
Location: The Netherlands

Re: Thread status polling - safe?

Post by badidea »

I agree, I took the 'userData pointer hack' from another example and was confused for a moment. Your implementation is much cleaner.
badidea
Posts: 2628
Joined: May 24, 2007 22:10
Location: The Netherlands

Re: Thread status polling - safe?

Post by badidea »

I have been doing some performance testing with this code:

Code: Select all

const NUM_THREAD = 24

type thread
	dim as integer status
	dim as integer dummy
	dim as any ptr handle
	declare static sub proc(byval pthread as thread ptr)
end type

sub thread.proc(byval pthread as thread ptr)
	'start work and update status
	for i as integer = 1 to 100
		for j as integer = 1 to 1000
			for k as integer = 1 to 30000
				pthread->dummy = j+k
			next
		next
		sleep 1, 1
		pthread->status = i
	next
end sub

dim as thread mythread(1 to NUM_THREAD)

dim as double t0 = timer

'launch threads. all data in an object
for i as integer = 1 to NUM_THREAD
	mythread(i).handle = threadCreate(cptr(any ptr, @thread.proc), @mythread(i))
next

'poll thread status and wait for completion
dim as boolean allDone
do
	allDone = true
	locate 1,1
	for i as integer = 1 to NUM_THREAD
		print "thread(" & i & "): " & mythread(i).status & "% ",
		if mythread(i).status < 100 then allDone = false
		if i mod 4 = 0 then print
	next
	sleep 100, 1
loop until allDone

dim as double t1 = timer
print "dt: " & t1 - t0

print "wait"
for i as integer = 1 to NUM_THREAD
	threadWait(mythread(i).handle)
	mythread(i).handle = 0
next
print "end!"
sleep 1000
The performance with the 32-bit compiler seems a bit disappointing and erratic. With 64-bit, multi-threaded code seems to work quit well (although a bit weird; same times for 16, 24 and 32 threads). OS: Ubuntu 22.04.5 LTS. CPU: AMD Ryzen 7 7745HX (8 Core, 16 Threads).
These are my results:

Code: Select all

'results in seconds with fbc32 -mt -w all -exx
'NUM_THREAD = 1: 4.4
'NUM_THREAD = 2: 8.7
'NUM_THREAD = 4: 9.1 to 16.2
'NUM_THREAD = 8: 23.1 to 28.0
'NUM_THREAD = 16: 47.3 to 48.2
'NUM_THREAD = 24: 62.3

'results in seconds with fbc32 -mt -w all
'NUM_THREAD = 1: 2.2
'NUM_THREAD = 2: 6.2
'NUM_THREAD = 4: 12.7
'NUM_THREAD = 8: 12.9
'NUM_THREAD = 16: 35.9
'NUM_THREAD = 24: 36.0

'results in seconds with fbc32 -gen gcc -mt -w all
'NUM_THREAD = 1: 2.3
'NUM_THREAD = 2: 2.3
'NUM_THREAD = 4: 9.5 to 13.2
'NUM_THREAD = 8: 19.5
'NUM_THREAD = 16: 31.3
'NUM_THREAD = 24: 35.7

'results in seconds with fbc32 -gen gas -mt -w all
'NUM_THREAD = 24: 34.9

'results in seconds with fbc64 -mt -w all
'NUM_THREAD = 1: 2.4
'NUM_THREAD = 2: 2.3
'NUM_THREAD = 4: 6.5
'NUM_THREAD = 8: 10.7
'NUM_THREAD = 16: 16.8
'NUM_THREAD = 24: 17.3
'NUM_THREAD = 32: 16.0
'NUM_THREAD = 64: 20.3
'NUM_THREAD = 128: 35.0
'NUM_THREAD = 256: 66.9

'results in seconds with fbc64 -gen gcc -mt -w all
'NUM_THREAD = 24: 17.1

'results in seconds with fbc64 -gen gas64 -mt -w all
'NUM_THREAD = 24: 17.2
fxm
Moderator
Posts: 12467
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: Thread status polling - safe?

Post by fxm »

Hardware memory accesses are not parallelizable..
So, to clearly show the contribution of multi-threading, the thread should not spend a large part of its time accessing memory.

On the other hand, to easily adjust the duration of the thread, it is better to avoid the possibilities of code optimization by the compiler.

Proposed modification to improve the rendering:

Code: Select all

const NUM_THREAD = 4

type thread
	dim as integer status
	redim as integer dummy(1 to 300000)
	dim as any ptr handle
	declare static sub proc(byval pthread as thread ptr)
end type

sub thread.proc(byval pthread as thread ptr)
	'start work and update status
	for i as integer = 1 to 100
		for j as integer = 1 to 300000
			pthread->dummy(j) = sin(i+j) + cos(i+j) + tan(i+j) + acos(i+j) + asin(i+j) + atn(i+j) + exp(i+j) + log(i+j) + sqr(i+j)
		next
		sleep 1, 1
		pthread->status = i
	next
end sub

dim as thread mythread(1 to NUM_THREAD)

dim as double t0 = timer

'launch threads. all data in an object
for i as integer = 1 to NUM_THREAD
	mythread(i).handle = threadCreate(cptr(any ptr, @thread.proc), @mythread(i))
next

'poll thread status and wait for completion
dim as boolean allDone
do
	allDone = true
	locate 1,1
	for i as integer = 1 to NUM_THREAD
		print "thread(" & i & "): " & mythread(i).status & "% ",
		if mythread(i).status < 100 then allDone = false
		if i mod 4 = 0 then print
	next
	sleep 100, 1
loop until allDone

dim as double t1 = timer
print "dt: " & t1 - t0

print "wait"
for i as integer = 1 to NUM_THREAD
	threadWait(mythread(i).handle)
	mythread(i).handle = 0
next
print "end!"
sleep 1000
If 'NbT' is the number of threads available on the CPU, as long as 'NUM_THREAD <= NbT', the execution time is almost constant (increasing very little depending on the number of threads used, because a certain quota of memory access remains).
It starts to really increase linearly as soon as the number of threads available on the CPU is exceeded.
In general, it is very tricky to find the theoretical maximum multi-threading performance on a test code (use of Sleep statement also impacts the results).

Note: the compile option '-mt' is useless.

Results with my AMD Ryzen 5 3500U (4 cores, 8 threads) and GAS 32-bit:
1 thread : 7.6 s
2 threads : 7.7 s
4 threads : 8.2 s
8 threads : 9.0 s
16 threads : 16.5 s
32 threads : 32.3 s
64 threads : 64.5 s
badidea
Posts: 2628
Joined: May 24, 2007 22:10
Location: The Netherlands

Re: Thread status polling - safe?

Post by badidea »

Unfortunately, my intended use, generating crossword puzzles, involves a lot of memory access. I was hoping that the per core L1-cache would help here. The amount of memory used is not so much. I am still working on the actual code, which I will post once in a more presentable state.

My results with your test code:

Code: Select all

'32 bit gas -exx
'NUM_THREAD = 1: 5.0
'NUM_THREAD = 4: 5.0
'NUM_THREAD = 8: 5.2
'NUM_THREAD = 16: 6.1
'NUM_THREAD = 32: 12.1
'NUM_THREAD = 64: 23.9

'64 bit gcc
'NUM_THREAD = 1: 1.6
'NUM_THREAD = 4: 1.7
'NUM_THREAD = 8: 1.8
'NUM_THREAD = 16: 3.0
'NUM_THREAD = 32: 5.7
'NUM_THREAD = 64: 11.1
fxm
Moderator
Posts: 12467
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: Thread status polling - safe?

Post by fxm »

In fact, until we actually write the real code for the different tasks and do some measurements, we can not predict how much execution time we would gain by running them in parallel instead of sequentially.

This is how I practiced for Dynamic ArrayList (with circular doubly linked list under the hood) and got for the 2 threads case (main + child) a gain of 25% instead of the maximum achievable 50%.
badidea
Posts: 2628
Joined: May 24, 2007 22:10
Location: The Netherlands

Re: Thread status polling - safe?

Post by badidea »

The crossword generator code can be found here: https://github.com/verybadidea/crossword_generator
In crossword_generator.bas the number of threads can be set with N_WORDMAP
It calls the function recursiveSmart0 in wordmap_v1.bas

The results here:
1 threads: 6.2 seconds
2 threads parallel: 21 seconds !
2 threads consecutive: 11.4 seconds

It seems like the both threads are accessing some common resource, but I have no idea what. Al previously shared variables are eliminated and I don't see any freebasic function calls that could be a problem. So far my detour into multi-threading is not very successful.
fxm
Moderator
Posts: 12467
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: Thread status polling - safe?

Post by fxm »

I do not find the same proportion as you between 1 thread and 2 threads in parallel.
My results (average over 3 measurements):
N_WORDMAP = 1 : 10.8 seconds
N_WORDMAP = 2 : 16.6 seconds

Given that the threads always do the same work, we must therefore compare 2*10.8=21.6s and 16.6s, which is an improvement of 23.1% (compared to the theoretical maximum 50% for 2 threads).

Considering our respective CPUs, I consider that it is your result for 2 threads (in parallel) which is weird.

Note:
For safer operation and in line with the thread specifications, I would prefer that you simply transform the member function 'function recursiveSmart0() as isz' (return not used) into a member sub 'sub recursiveSmart0()'
fxm
Moderator
Posts: 12467
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: Thread status polling - safe?

Post by fxm »

fxm wrote: Jan 22, 2025 7:55 I do not find the same proportion as you between 1 thread and 2 threads in parallel.
My results (average over 3 measurements):
N_WORDMAP = 1 : 10.8 seconds
N_WORDMAP = 2 : 16.6 seconds

Given that the threads always do the same work, we must therefore compare 2*10.8=21.6s and 16.6s, which is an improvement of 23.1% (compared to the theoretical maximum 50% for 2 threads).
  • On the other hand, I find that increasing the number of threads above 3 degrades the result compared to 2:
    N_WORDMAP = 3 : 25.4 seconds (improvement +21.6% for +66.6% max theoretical)
    N_WORDMAP = 4 : 36.8 seconds (improvement +14.8% for +75.0% max theoretical)
    N_WORDMAP = 5 : 54.0 seconds (improvement 0.0% for +80.0% max theoretical)
    N_WORDMAP = 6 : 71.1 seconds (improvement -8.9% for +83.3% max theoretical)
    N_WORDMAP = 7 : 92.3 seconds (improvement -22.1% for +85.7% max theoretical)

Each core has its own cache memory that allows to buffer the useful data (in read and write) of the shared memory.
Consequently, in case of writing in the cache, a cache coherence algorithm between cores is executed if necessary to keep for the common memory areas between caches the most recent values ​​among all the caches concerned.
I think that it is this algorithm that penalizes the most the performances of multi-threading in case of multi-access (in write) in shared memory.

It is therefore necessary to limit as much as possible the access of threads (in particular writing) to shared memory.
For example, all intermediate results of threads could be performed in local memory, and only the final useful ones in shared memory.


@badidea, example issued from your previous code:

Code: Select all

const NUM_THREAD = 4

type thread
	dim as integer status
	dim as integer dummy
	dim as any ptr handle
	declare static sub proc(byval pthread as thread ptr)
end type

sub thread.proc(byval pthread as thread ptr)
	'start work and update status
	for i as integer = 1 to 100
		for j as integer = 1 to 1000
			for k as integer = 1 to 30000
				pthread->dummy = j+k
'				dim as integer N = pthread->dummy+k
'				dim as integer N = j+k
			next
		next
		sleep 1, 1
		pthread->status = i
	next
end sub

dim as thread mythread(1 to NUM_THREAD)

dim as double t0 = timer

'launch threads. all data in an object
for i as integer = 1 to NUM_THREAD
	mythread(i).handle = threadCreate(cptr(any ptr, @thread.proc), @mythread(i))
next

'poll thread status and wait for completion
dim as boolean allDone
do
	allDone = true
	locate 1,1
	for i as integer = 1 to NUM_THREAD
		print "thread(" & i & "): " & mythread(i).status & "% ",
		if mythread(i).status < 100 then allDone = false
		if i mod 4 = 0 then print
	next
	sleep 100, 1
loop until allDone

dim as double t1 = timer
print "dt: " & t1 - t0

print "wait"
for i as integer = 1 to NUM_THREAD
	threadWait(mythread(i).handle)
	mythread(i).handle = 0
next
print "end!"
sleep 1000
My results:
- executing only 'pthread->dummy = j+k' in the thread: 28.8 s
- executing only 'dim as integer N = pthread->dummy+k' in the thread: 11.5 s
- executing only 'dim as integer N = j+k' in the thread: 8.7 s
Last edited by fxm on Jan 22, 2025 17:27, edited 3 times in total.
Reason: Post complemented.
badidea
Posts: 2628
Joined: May 24, 2007 22:10
Location: The Netherlands

Re: Thread status polling - safe?

Post by badidea »

fxm wrote: Jan 22, 2025 12:49 It is therefore necessary to limit as much as possible the access of threads (in particular writing) to shared memory.
For example, all intermediate results of threads could be performed in local memory, and only the final useful ones in shared memory.
That's sounds difficult for my recursive crossword generator, but I will try a few things.

I did a test (crossword generator) on Windows11 (same system) and things look less bad, but still not optimal:

Code: Select all

32 bit gas -exx
Threads	Time	Time/Threads
1	8.9	8.9
2	12.1	6.1
3	15.5	5.2
4	21.2	5.3
5	28.2	5.6
6	37	6.2

Code: Select all

64 bit gcc
Threads	Time	Time/Threads
1	6.4	6.4
2	9.1	4.6
3	12.5	4.2
4	15.7	3.9
5	22.2	4.4
6	28.2	4.7
To be investigated further...

I brought my work laptop with me today with Windows11 and a 13th Gen Intel(R) Core(TM) i5-1345U CPU
I ran the same code (twice because the virus/malware-scanner was having a hard time), results:

Code: Select all

32 bit -exx
Threads		Test1	Test2
1		11.4	12.8
2		18.9	18.2
3		38.1	36.9
4		64.6	58.9
4 consecutive	40.0	39.3

Code: Select all

64 bit
Threads		Test1	Test2
1		7.5	7.5
2		16.3	17.6
3		31.9	29.9
4		51.2	64.3
4 consecutive	28.1	28.9
This laptop has limited cooling so clock speed may have dropped during the longer tests.
But also here, parallel threads don't seem to work well with this code.
fxm
Moderator
Posts: 12467
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: Thread status polling - safe?

Post by fxm »

I modified the thread 'wordmap_type.recursiveSmart0()' so that it works internally on a copy of the instance passed (copy at the beginning, and copy back at the end) like this (so we lose the monitoring of the progress):

Code: Select all

function wordmap_type.recursiveSmart0() as isz
	dim as wordmap_type wt = this
	dim as isz depth = 0
	dim as word_placement wp '= type(0, 0, 0)
	dim as string word = wt.entries.entry(depth).word
	dim as string erasedStr
	wp.ori = 0 'horizontal
	for row as isz = 0 to wt.ubr
		wp.row = row
		for col as isz = 0 to wt.ubc - (len(word) - 1)
			wp.col = col
			'sleep 1 : print ".";
			sleep 1,1 : wt.loopCount += 1
			wt.copyToMapSave(wp, word, erasedStr)
			wt.wpList(depth) = wp
			wt.recursiveSmart1(depth + 1)
			wt.copyToMap(wp, erasedStr) 'undo
		next
	next
	wp.ori = 1 'vertical
	for col as isz = 0 to wt.ubc
		wp.col = col
		for row as isz = 0 to wt.ubr - (len(word) - 1)
			wp.row = row
			'sleep 1 : print ".";
			sleep 1,1 : wt.loopCount += 1
			wt.copyToMapSave(wp, word, erasedStr)
			wt.wpList(depth) = wp
			wt.recursiveSmart1(depth + 1)
			wt.copyToMap(wp, erasedStr) 'undo
		next
	next
	wt.done = 1
	this = wt
	return 0 'no more positions to try
end function
But I did not see any improvement.

This is explained because simple numeric variables are well in local memory, but for all the pseudo-objects like var-len strings and dynamic arrays, only the descriptors are in local memory but not the data itself which is always on the heap (only fixed length data can be put in local memory).
Post Reply