Internal Multithreading
Easily harness multicore processors to speed up your simulation.
1 | Getting Started
BEPUphysics supports the usage of an arbitrary number of threads. Modern PCs with multicore processors and the Xbox360 get a significant performance boost in over the sequential alternative.
As far as the engine is concerned, getting multithreading running is a very easy one-step process: pass an IParallelLooper into the Space constructor and it works.
1.A | The Parallel Looper
BEPUphysics uses threads through the IParallelLooper passed into the Space constructor. Every threaded job in the engine boils down to a simple for loop, so any implementation which can support threaded loops can be plugged in easily. The most common implementation of this interface is the ParallelLooper.
The ParallelLooper, specifically, offers an easy “AddThread” method that can be called to tell the engine to use another thread. As a user, no management of a Thread object is necessary. The thread is created and handled internally. When adding a thread, an initializing delegate can be specified. Usually, this is used on the Xbox360 to call Thread.CurrentThread.SetProcessorAffinity; more information about this can be found in section 1.D.
1.B | Work Distribution
Each stage of the engine must wait for all threads to complete their task before continuing, so spreading and balancing the load is beneficial. Ideally, the ThreadManager will be given the same number of threads as the computer has free processors.
A Space will only use the threads of the ThreadManager while inside its update method. This means, after the update method, all of the processor power is free to do other things. In an otherwise sequential game, the engine can afford to use every processor. If the space is being updated asynchronously while some other components need processor power, care must be taken to ensure that bottlenecks are not being created.
1.C | Windows Threading
The thread scheduler on Windows can effectively determine which processor should work on a thread. The user can check how many processors/hardware threads are available by using the .NET framework's System.Environment.ProcessorCount property. Adding as many threads as processors exist generally works well, assuming there is more than one processor.
The PC implementation of the IThreadManager is also capable of automatically balancing parallel for loops, helping it play better with other processes and CPU loads.
1.D | Xbox360 Threading
On the Xbox360, threads should be given a particular location to run. Out of the Xbox's six available hardware threads, 0 and 2 are reserved by the XNA framework, leaving hardware threads 1, 3, 4 and 5 for other work. The thread affinity of a thread being added to the ThreadManager can be specified by giving the ThreadManager's AddThread method a delegate which sets the affinity using Thread.CurrentThread.SetProcessorAffinity.
There are two hardware threads per core on the Xbox, and both hardware threads 4 and 5 are on the same core. As of v0.16.0, the Xbox360 uses the same thread manager as the PC, so it balances loads decently.
Using four hardware threads puts two threads on the third physical core, requiring additional load balancing. However, it can still benefit some simulations. As of v1.2.0, the BEPUphysicsDemos default to using 4 hardware threads.
2 | Extensible Types and Threading
Updateables and SolverUpdateables both interact with threading to provide options for speeding up custom types.
2.A | Updateable Threading
Updateables are a method for extending BEPUphysics. Each updateable has multiple stages which are called automatically at various points in the execution of the space's update. Updateables can work in three different ways with respect to threading:
Updateable does make use of worker threads and executes sequentially
Updateable does not make use of worker threads and executes sequentially
Updateable does not make use of worker threads and executes in parallel with other Updateables
The default style is to use no threads and execute sequentially for simplicity. The style can be changed by setting an Updateable's IsUpdatedSequentially property. When an Updateable is updated sequentially, no other Updateables or BEPUphysics-related tasks are running at the same time and all of the threads of the ThreadManager are usable. This is helpful for speeding up a large, performance-critical Updateable.
If IsUpdatedSequentially is false, the Updateable will run side-by-side with other Updateables that also have IsUpdatedSequentially set to false. The ThreadManager should not be used by these Updateables as the ThreadManager's threads are executing the Updateables themselves. Since these Updateables are running in parallel, they must operate in a thread-safe way.
2.B | SolverUpdateable Threading
SolverUpdateables are used in the velocity solver to manage velocity-based interactions, like collisions. They have three important methods: Update, ExclusiveUpdate, and SolveIteration. The Update method performs any per-frame configuration calculations that do not need exclusive access to involved objects. The ExclusiveUpdate method does any required pre-iterations work, like warm starting, that needs exclusive access to the involved objects. The SolveIteration method converges to the desired solution by being called repeatedly, each time with exclusive access to the involved objects.
To obtain exclusive access to the involved objects, a lock is acquired on each involved object's locker object before entering either the ExclusiveUpdate or SolveIteration method. Upon completing the method, the locks are released.
3 | Miscellaneous Details
There are a variety of other details to take into account when delving deeply into the multithreading system.
3.A | Determinism
The default multithreading system is not deterministic. If the same simulation is run multiple times with exactly the same starting conditions, the result can be different. The longer the simulations run, the more they will diverge.
For example, when creating a networking system that handles physics, the engine should not be relied upon to generate consistent results on the various clients even if the input parameters are all identical. Periodic synchronizations are required to keep everything running consistently.
The sources of this nondeterminism are the multithreaded DynamicHierarchy broad phase, the non-batched multithreaded solver, and the multithreaded narrow phase. The order of pairs generated by the DynamicHiearchy is unpredictable, leading to different solving orders and slightly different results. The narrow phase creates constraints to be added to the solver based on the order that the pair updates, which is not deterministic with multithreading on. Similarly, the solving order of the non-batched solver is based on how fast threads can work through the list and will produce slightly different results.
Every multithreadable processing stage like the BroadPhase, NarrowPhase, and Solver has an AllowMultithreading field. Setting this to false will force the stage to use the sequential (deterministic) version.
v0.14.3 and before had a batched mode solver which could be enabled by setting the space's UseBatchedMultithreadingSolver to true. Using the batched solver incurred a significant overhead on some simulations, though it can technically scale to manycore systems better than the default solver in some large simulations. The batched solver may make a return in a future version if it becomes beneficial.
In general, it is recommended that the normal, fully multithreaded systems are used and that determinism should not be designed to be a requirement for a simulation.
3.B | IThreadManager Types
The engine provides SimpleThreadManager, ThreadTaskManager,, and SpecializedThreadManager. As of v0.16.0, all platforms default to using the SpecializedThreadManager. The SimpleThreadManager is a reference implementation from v0.10.0.
Custom IThreadManager implementations can be created and used. The Windows version of the library provides a TPL-based thread manager, showing how a custom thread manager can be constructed. (The ThreadManagerTPL typically performs slightly slower
than the SpecializedThreadManager.)