Ramblings of crazed multi-threaders

General discussion for topics related to the FreeBASIC project or its community.
1000101
Posts: 2556
Joined: Jun 13, 2005 23:14
Location: SK, Canada

Ramblings of crazed multi-threaders

Post by 1000101 »

This forum thread (pun intended) came about from a thread in the Tips and Tricks board.

Original thread topic: Using Framebased movement with Delta-timing by relsoft

1000101 » Nov 16, 2012 8:04 wrote:Re: Using Framebased movement with Delta-timing

The method I use is aggressive multi-threading for each sub-system (renderer, logic (physics, input processing), input, network, sound, etc). Each sub-system can then run at it's own resolution and the sub-systems use a tagged FIFO queue to communicate to each other. This allows for each sub-system to not only be more responsive but also be minimally affected by a latency of any other sub-system.

There are some drawbacks however. Namely, to use this method it requires greater knowledge and more careful design to implement successfully.

This is a preferable method, imo, over a single-threaded program which relies the computing power of a single resource (ie: core frequency and logic). Since each sub-system takes a fraction of the total frequency to run, the operating system can load balance the threads for maximum efficiency across all available cores. Since the number of cores is increasing faster now than the speed and ability of any single core, the effect is that you have cores * frequency - overhead performance gain. Since the program overhead is minimal (every sub-system must read it's queue each "loop cycle", sub-systems must add to a queue to communicate) and the OS overhead is also small (thread switching), the benefits far out-weigh the additional complexity of the system. Further, as the number of processor cores increase, the responsiveness of each sub-system potentially increases.

Several months ago I posted libraries which are designed for exactly this. Check this forum thread for my code.
Gonzo » Nov 16, 2012 12:13 wrote:Re: Using Framebased movement with Delta-timing

personally i use the single responsibility principle
if you do that, you can use round-robins on the "other end" without any locks or queues
the drawback (there is always one), is that you have to iterate the entire array on the "other end"
if your queues are small this is a non-issue.. given that the queue has an .alive member its a fast loop
you'd have to have many checks in place to avoid unfavorable conditions, but if your objects have states that are well defined,
your engine can handle anything that comes its way accordingly
i imagine the last part goes for any type of multithreaded system, anyways
the main point is to always avoid locks, not because they give you less headache, but because they force the OS to not release your thread until the lock is freed

and yes, single-threaded is not the way to go :P unless you are making a demo...
i reckon most people have at least 4 cores now.. with intel that's 8 (because of "hyperthreading")

note: i haven't made a lockfree solution for work-threads yet.. that is going to be a major headache
i have only made it for transfer of data between physics and render thread in various ways
1000101 » Nov 16, 2012 19:36 wrote:Re: Using Framebased movement with Delta-timing

There are lockless solutions available to problems, but as gonzo stated, they are a major headache to design and implement. Anywhere you have a situation of write once, read many (for example) is a good place for lockless algorithms (just create data before the readers start). Many times however, data can not be accessed in such a free manner and can not be lockless or require some form of guard to validate data. My tagged queue is lockless but uses the cmpxchg8b instruction when adding and removing from the queue to validate data. This may result in a small "spin" to re-establish the data in the correct manner but is still faster than a lock since it is highly unlikely that multiple cores will be modifying the exact same memory address (head/tail of the queue) at the exact same time which is the only way the function will spin. In order to protect this, the code uses hardware level verification of data (cmpxcg8b) when updating the queue.

No matter how you do handle it, the real solution to modern computing problems lies in taking advantage of modern technology. Trying to "carry forward" older inefficient paradigms is a lost cause, just like none of us use q[uick]basic (note the two are different) anymore as we move forward we need to stop trying to use the solutions we used which no longer fit the new paradigm. Some algorithms and techniques are timeless or are irrespective of the technology used but most can be modernized or redesigned to face new challenges and take advantage of new methods and technologies.

I think gonzo won't argue with me that multi-threading can greatly help complex systems which are really sets of smaller systems which need to work together. As such by decoupling the systems and creating communication pathways (regardless of implementation) is a more efficient method overall for modern hardware then single-threaded brute force methods and workarounds.

I do have a nitpick though. Gonzo stated that the OS will not release a thread while there is an active lock. This is categorically not true. Locks are a function of program logic. The OS only knows you have a lock established, not it's purpose and intent. The OS will task-switch no matter how many locks are active in a thread. The task scheduler will not activate other threads which are waiting to aquire an active lock but it does not prevent the task scheduler from switching out of a thread with active locks. This can be easily demonstrated by the following code:

Code: Select all

Dim Shared As Any Ptr   mutex
Dim As Any Ptr      threads( 0 To 2 )


Sub thread_0( ByVal foo As Any Ptr )
   Print "Thread 0: Aquiring lock"
   MutexLock( mutex )
   Print "Thread 0: Sleeping for a long time"
   Sleep 10000, 1
   Print "Thread 0: Releasing lock"
   MutexUnLock( mutex )
   Print "Thread 0: Terminating" 
End Sub

Sub thread_1( ByVal foo As Any Ptr )
   Print "Thread 1: Aquiring lock"
   MutexLock( mutex )
   Print "Thread 1: Sleeping for a long time"
   Sleep 10000, 1
   Print "Thread 1: Releasing lock"
   MutexUnLock( mutex )
   Print "Thread 1: Terminating" 
End Sub

Sub thread_2( ByVal foo As Any Ptr )
   Print "Thread 2: No lock, now a silly loop"
   For i As Integer = 0 To 100000
      If( ( i Mod 123 ) = 0 )Then
         Print "Thread 2: I'm going hard!"
      EndIf
   Next
   Print "Thread 2: Terminating" 
End Sub


Print "Main: Creating mutex"
mutex = MutexCreate()

Print "Main: Spawning child theads"
threads( 0 ) = ThreadCreate( @thread_0 )
threads( 1 ) = ThreadCreate( @thread_1 )
threads( 2 ) = ThreadCreate( @thread_2 )

ThreadWait( threads( 0 ) )
ThreadWait( threads( 1 ) )
ThreadWait( threads( 2 ) )

Print "Main: Destroying mutex"
MutexDestroy( mutex )

Print "Main: Terminating"
Sleep
End
Gonzo » Nov 16, 2012 20:03 wrote:Re: Using Framebased movement with Delta-timing

yes, you may be right.. it doesnt make sense for the OS not to give time to other processes
still, as a programmer your concern should be within your own program (and that was what i meant)

if it seemed i didnt use multithreading - i am.. way too many threads =)
1 for render, 1 for everything else (physics etc.), up to 8 work threads, threads for networking and sound.. and im sure the graphics driver are using quite a few as well
my main reason for using up to 8 work threads is because the work i'm doing takes an extremely long time
but not enough time to make the physics etc. thread slow, it's enough for let's say 1 round
there are however at least 4 cores that aren't doing anything, or enough time in average to do work at least 4-8 times in parallell
if i had any concerns about my setup - it's that i'm not allowed to say which threads uses what core
i must ensure that the rendering thread is all alone on one core, but thats impossible, isnt it?

the problem really just boils down to that im rendering way too heavily (something ive tried for 2 years to rectify, but it doesnt get likelier for each day/month that passes)
and the other is the work threads doing too much and consuming too much time
this is something i can fix by splitting the work into passes, something im confident i can do in time
in any case, the most effective work thread scheduling i have used yet is simply using conditionals (signals)
the control part of my thread scheduler is lockless, which is a pain in the ass, because the control part is not the receiving end
it's a convoluted system because it's parallellized 3-ways :) physics -> work -> render
the render thread can't wait for anything - ever, so there's some hardcore stuff in there that no coder should have to put up with :P
it basically boils down to taking over memory..
once the controller has scheduled a thread, and that thread is finished, the rendering thread immediately takes over the pointers to the memory in question, and thats it.. the rendering thread may choose, or not choose to do anything with it, depending on how stressed it is
switching pointers is something ive had to come up with as a necessity for survival :P more or less

in closing.. i wonder if i create the threads in a very specific order, do you think the OS will round-robin on the available cores?
i hope so, it's something i would consider trying out
Please keep your posts relevant to the topic (which is intentionally vague).

Edit: Fixed url link in quote
Last edited by 1000101 on Nov 16, 2012 22:15, edited 1 time in total.
1000101
Posts: 2556
Joined: Jun 13, 2005 23:14
Location: SK, Canada

Re: Ramblings of crazed multi-threaders

Post by 1000101 »

@gonzo:

I wasn't trying to suggest you don't use multi-threading, quite the opposite - I know full well your trials and tribulations when learning a whole new set of programming logic and data interactions. :)

The long running worker threads is exactly the reason I created my thread pool library, so long as the function to be performed can be processed asynchronously.

Your concerns about your cpu affinity (what thread runs on what core) can actually be answered positively. :)

All aggressively massively multi-threading operating systems (Windows, Linux, BSD, UNIX, Solaris, etc) have functions to allow a program to modify their process and thread priorities and core affinity. Since I mainly use and develop for Windows I can give you information on that but information is available in the other OS's SDKs.

MSDN article on thread priorities: http://msdn.microsoft.com/en-us/library ... p/ms685100(v=vs.85%29.aspx
MSDN SetProcessPriority: http://msdn.microsoft.com/en-us/library ... p/ms686219(v=vs.85%29.aspx

MSDN article on thread affinity: http://msdn.microsoft.com/en-us/library ... p/ms684251(v=vs.85%29.aspx


Using priorities and affinities you can influence which thread gets more or less time and which processor (core) it should run on. By default all processes and threads are set to normal priority and have a complete core mask (they can run on any core).
Gonzo
Posts: 722
Joined: Dec 11, 2005 22:46

Re: Ramblings of crazed multi-threaders

Post by Gonzo »

i see, i've been tampering with process affinity back in the days, but i wasn't aware of any thread specific ones
i'll look into making some platform independent thing to try this out with
with 8 work threads there's not much point in doing this, im assuming.. unless the work threads aren't distributed well :S
man, i wish multithreading was more developed, considering how hard it can be
personally i use 6 work threads to complete the circle - render + physics + 6x work
so, i could benefit from having the threads well distributed

just today i was finishing up some joystick code, which had some unusual side-effects for crouching mechanic
apparently some random frame every few seconds or more, it would act as if the player wasnt crouching
which immediately led me to believe i had some state machine issues with threading.. i wasnt
it turns out my window-library (GLFW) is CLEARING the the joystick-button array results before filling them up again!
glfwGetJoystickPos(GLFW_JOYSTICK_1, @keyconf.axis(0), 5) '' requesting 5 button states
not a big deal, but very surprising :)
multithreading is a wonderful thing, but sometimes i'm just stumped on how many ways it can go wrong
1000101
Posts: 2556
Joined: Jun 13, 2005 23:14
Location: SK, Canada

Re: Ramblings of crazed multi-threaders

Post by 1000101 »

heh, I very well understand your problems. I'm developing some commercial software and I'm have some TLS (thread local storage) issues. Something is corrupting my TLS indexes with a bad memory write and I can't find out where because the memory write is within the TLS itself (chicken-egg problem).
Gonzo
Posts: 722
Joined: Dec 11, 2005 22:46

Re: Ramblings of crazed multi-threaders

Post by Gonzo »

yes, some problems have taken me a very long time to find.. im extremely glad i don't have any such issues at present
but it's hard to be 100% certain
at the very least, my client is running well for extended amounts of time
but there was a time when i would spend days debugging things that "shouldn't be possible" =)
in the end, they were all resolved, and turned out to be very logical
those cases were only related to sharing of data between 2 threads though, but now i'm nearing a point where i might have to start making some thread scheduling for other tasks.. i'm sure i will be on the receiving end for quite a while, in gdb =)
i'm confident i can make it work, but always my biggest problem has been making the data separated from other aspects of the program
read-only situations would have been a godsend, but i don't have any of those =)

also, for anyone reading this and contemplating threading:
don't create and destroy threads! it's extremely expensive and time consuming for the OS (at least on windows)!
if you've looked at other programs with process explorer, you'll notice they always have sleeping threads, and that's the proper way to do it! (even though in many cases, some apps are just loitering in the os sheduler)
Gonzo
Posts: 722
Joined: Dec 11, 2005 22:46

Re: Ramblings of crazed multi-threaders

Post by Gonzo »

the joy of being stuck in gdb :)
trying to avoid finishing the work queue, so that the program can schedule the work threads, and just continue as normal
there's so many dependencies here that i would be lying if i said this thing isnt a compile and pray situation :)

edit: which is now resolved, so to speak
it turns out i just can't do it this way, for now.. some other features of my engine are moving things around when the player moves too far (and the world has to transition, aka. do a seamless roll-over of terrain)
but, just like with anything else there's always a solution, it just has to happen more over time as i refactor code

do you have any wisdom to share regarding conditionals, and how many cpu cycles before the memory is visible to other threads?
i'm trying to make the thread scheduler more effective, but i think i'm missing cycles because it takes time to make shared memory visible to other threads (and if i didnt know better, it's just the way it has to be)
1000101
Posts: 2556
Joined: Jun 13, 2005 23:14
Location: SK, Canada

Re: Ramblings of crazed multi-threaders

Post by 1000101 »

Personally I wouldn't bother trying to interfere with the scheduler but I may give it hints (make the renderer lower priority then the phyics, etc).

As to how long it takes a memory write to propogate from the issuing instruction and the physical memory update, as far as you are concerned it is visible as soon as written. SMP systems handle cache coherency behind the back for you. Ideally the scheduler will not schedule reader threads which access the same memory as a writer threads at the same time but stagger them. Of course on a single-core system this is all irrelevant since there is only one active process at a time but trying to implement lockless code on SMP means there may be cache delays between the internal core while a write to a memory address in one core must be propogated in the other cores (which are using the same memory) caches. This cache thrash is minimized by the shared L2 cache but there is a real-world delay. That's ok because the CCC (cache coherency controller) will force cores into an idle state until the cache update is complete. The speed of this depends on the architecture used for the memory, cache, cpu and communication bus's.

As to handling transition of the game environment, I tend to use a multi-window approach (not to be confused with a GUI window, but a memory window into the larger data). I will maintain 9 windows (-1 to 1, -1 to 1) of world information. The main window (0,0) is based on the players world position. The other surrounding windows are there as a fast way to transition through the world. The assumption is that once the player leave window (0,0), the window they move into becomes (0,0) and all loaded windows are shifted accordingly. Then a request is made to the data loader to load the 3 to 5 new windows which are required to complete the surrounding buffer while the old windows which have been moved away from are unloaded. The balance then becomes "how large do I make a given window so the player doesn't out-run the data load thread?" This is based on nothing more then using a "best guess" constant and I will increase the number windows to increase the buffer zone.
Gonzo
Posts: 722
Joined: Dec 11, 2005 22:46

Re: Ramblings of crazed multi-threaders

Post by Gonzo »

thanks for the clarifications.. i must have been reading about something that only was relevant to physically different cores (CPUs)
yes, my world is split into sectors in 3D (because theres a huge amount of data)
i transition just like you would imagine i did, it all just takes more time, simply due to data size (which i cant lower)
this was the reason i started with multithreading 2 years ago.. it was necessary to keep things smooth
it was like standing in fire for a few months =)

my only remaining issue now is dealing with rendering position offsets
but im working that out now, even though its an ultra-rare condition when transitioning..
my setup should be watertight.. i have rendering offsets for the mesh on the physics side
i have a copy on the renderer side off of these values, and i compare it to the renderers own world offsets
that way, the offsets should be disconnected from any transition that happens on the physics side while rendering, but they're sometimes (ultra-rarely) not
to solve it i'd have to completely understand the reason behind it happening, because i can't confirm it without moving around for extended amounts of time (and even then...) :P

edit: managed to increase work thread efficiency by increasing the amount of work buffers and looking for finished threads without locking mutexes (which would have forced them to finish)
that way early finished threads gets new payloads without interfering with the rest.. works well so far
the biggest hurdle was the work buffers must be executed in-order but that solved itself by forcing the receiving end to never iterate past something it knows is inbound but not finished yet
i'm sure it can be more efficient, but this is alot better than it used to be.. i still have really low cpu usage on average for various reasons
i'm not managing to get the work threads to max out each core, and i'm not really sure why
i just profiled, and it's 30% average CPU usage! thats.. not very effective
everything is lightning fast, so i shouldn't complain.. hehe
h4tt3n
Posts: 698
Joined: Oct 22, 2005 21:12
Location: Denmark

Re: Ramblings of crazed multi-threaders

Post by h4tt3n »

Hello folks,

I find this very interesting. My programs are getting so big and cpu-demanding that I cannot ingnore the benefits of mutithreading any more. However, 1000101 I'm not sure what to make of your code samples. The test_thread_pool.exe becomes unresponsive and has to be stopped with task manager? Also, the MSDN articles are 404'ed. What's the next step from here?

Cheers,
Mike
Gonzo
Posts: 722
Joined: Dec 11, 2005 22:46

Re: Ramblings of crazed multi-threaders

Post by Gonzo »

then help us by lobbying for freebasic to become a more parallellizable language
there is absolutely no future in the single-threaded line of thought

anyways, if you have any questions feel free to ask.. im sure both of us can accomodate you
if nothing else, i can help you find mistakes and improvements (im not an expert on the theory)
h4tt3n
Posts: 698
Joined: Oct 22, 2005 21:12
Location: Denmark

Re: Ramblings of crazed multi-threaders

Post by h4tt3n »

Cool, I'm doing my first experiments with the built-in thread and mutex commands in FB. I guess it's hard to ask meaningfull questions before you have at least a fundamental understanding on what's going on. SO, I'll mess around a bit and then come crying back :-)

Cheers,
Mike
integer
Posts: 408
Joined: Feb 01, 2007 16:54
Location: usa

Re: Ramblings of crazed multi-threaders

Post by integer »

A couple weeks ago Intel estimated that the 48-core processor would be available by 2016 -to- 2018 time frame.
When Intel cancelled the 80-core project, I presumed that 6-to-8 cores would be the limit for microprocessors.

With 48 cores available on future iphones, learning/tinkering time has arrived.
Gonzo
Posts: 722
Joined: Dec 11, 2005 22:46

Re: Ramblings of crazed multi-threaders

Post by Gonzo »

h4tt3n wrote:Cool, I'm doing my first experiments with the built-in thread and mutex commands in FB. I guess it's hard to ask meaningfull questions before you have at least a fundamental understanding on what's going on. SO, I'll mess around a bit and then come crying back :-)

Cheers,
Mike
ok, it's good to learn the basics
but to make multi threaded programs, you need to think parallell code,
so your effort should be:
not solving a problem by reusing and modifying your existing code
instead you should make solve problem by making the task _inherently_ parallell
this is a very complicated issue, even with basic tasks

thankfully, there's always a failsafe option:
investigate and find all the data you need, make it unique, use it to schedule a parallell running task
making the data unique is by making it impossible for any other part of your program to modify it as long as a parallell task is running (and using the data)
for example, by making a type working_dataset_t with all the data separated from other parts of the program
or by running a huge amount of parallell threads on the dataset, and only when it's all done continue with the program
there is no best-choice or final solution to anything

for example a GPU shader will execute on 16x16 samples (sub-pixels) at one time,
so if your fragment (sample) shader is very complicated and time intensive, it will drag down your fps greatly
however, there's a stage before that called the vertex shader, in which all the points that define your primitive is set up
the more calculations you can do in that stage, for usage in the sampling stage will greatly help performance overall
note that this is only example with made-up values (but i think they're accurate on older gpu's)

the best practices are always available in reading about how the GPUs work, and especially about how to parallellize problems in openCL
i do this to get ideas on how to improve my programs

there is of course many factors at play here, most of all the cost of (overhead of) running threads
if your problem is small, the cost of running it parallell might not make up for the cost of running worker threads
but the rule of thumb is, if you have a high number of iterations on something, and you would like to divide and conquer it,
you can always make it parallellizable to some degree, and run the problem in smaller sets of data, which can later be merged into the final solution
it's not always the case, but i haven't encountered any such unparallellizable problems yet myself
h4tt3n
Posts: 698
Joined: Oct 22, 2005 21:12
Location: Denmark

Re: Ramblings of crazed multi-threaders

Post by h4tt3n »

Gonzo wrote:there is of course many factors at play here, most of all the cost of (overhead of) running threads
if your problem is small, the cost of running it parallell might not make up for the cost of running worker threads
but the rule of thumb is, if you have a high number of iterations on something, and you would like to divide and conquer it,
you can always make it parallellizable to some degree, and run the problem in smaller sets of data, which can later be merged into the final solution
it's not always the case, but i haven't encountered any such unparallellizable problems yet myself
Right now my goal is to improve speed on tasks that involve huge numbers of objects that somehow interact with each other; polygon collision detection, smoothed particle hydrodynamics, n-body gravity simulations, that sort of thing.
Am I understanding this right, that if you have a truck load of interacting particles, you can split them up in - say - 4 chunks, one for each processor, and let 4 identical threads eat their way through each chunk in parallel?
Gonzo
Posts: 722
Joined: Dec 11, 2005 22:46

Re: Ramblings of crazed multi-threaders

Post by Gonzo »

yes, you can see an overview of the options you have here:
http://www.cse.illinois.edu/courses/cs5 ... rticle.pdf

basically, your best parallell algorithm is O(n) with a substantial constant
far away particles in a block-chunk will exert a small amount of force that is almost monodirectional
because of this, far away blocks of particles can exert 1 combined force against each individual local particle

if you have an intel quad core (4 cores), you'll have 8 "hyperthreads"
the effective amount of threads may be somewhere between 4 and 8, most likely 4
Post Reply