Difference between revisions of "Proposal Threading and Acceleration"
(→Common Terms and Definitions)
(→Common Terms and Definitions)
|Line 43:||Line 43:|
| Domain-specific language. Counter type of a General-
| Domain-specific language. Counter type of a General-language. When making considerations for certain platforms it is appropriate to write code that complements the target platform. (< Fix this!!!)
Revision as of 19:11, 14 October 2014
|Proposal pages help us get from feature requests into actual plans. This proposal page is about Threading and Acceleration.|
Proposal pages are used on an ongoing basis by the Audacity development team and are open to edits from visitors to the wiki. They are a good way to get community feedback on a proposal.
- Note: Proposals for Google Summer of Code projects are significantly different in structure, are submitted via Google's web app and may or may not have a corresponding proposal page.
- 1 Proposed Feature
- 2 Developer/QA/Programmer backing
- 3 Common Terms and Definitions
- 4 Approaches
- 4.1 Simple Self Contained
- 4.2 Library for Self Contained (Globally Supported Locally Implemented)
- 4.3 Central Co-ordination (Fully Threaded Operator Aware)
- 4.4 DSL based Multi-Target
Most if not all modern computers have multiple cores, GPUs, and other accelerators all capable of executing code. To make use of all this processing power, programs must implement varying levels of gates and sandboxes. Depending on the algorithm, splitting up these pieces and controlling their flow can range from trivial to Rube Goldberg esque. Accomplishing this may require both support libraries and strict rules.
This proposal is about laying out these operations and their placement. Four options are under discussion. The actual situation is that we already have special cases of 'the simple self contained model' and are proposing developing more complex models.
- Andrew Hallendorff
- James Crook
Common Terms and Definitions
Definitions to help understanding. Edits welcome.
|Thread||The defined environment necessary for one program entry to execute.|
|Thread Safe||A way of arranging the resources and controls such that multiple threads can exist in the same environment.|
|Semaphore||A gate that controls how many threads pass through.|
|Mutex||(aka: Mutual Exclusion) Maintains coherency between threads preventing parallel operations on same resource.|
|Producer Consumer||The relationship where one part of a program sets up the execution of another. Often 1 to many or many to many.|
|Race Condition||When more than one part of a program depends on the time of execution.|
|Deadlock||When two or more consumers create a situation where they own the resource the others want. This usually stops execution.|
|DSL||Domain-specific language. Counter type of a General-purpose language. When making considerations for certain platforms it is appropriate to write code that complements the target platform. (< Fix this!!!)|
- Simple Self Contained
- Library for Self Contained
- Central co-ordination
- DSL based Multi-Target
Simple Self Contained
One part of the program creates its own threads, breaks up the job, and controls execution.
- Has little effect on other parts of the code.
- Local control and containment allows for fine tuning.
- Must implement all levels of code.
- Has no thread pool and must create threads on the fly.
- Is unaware of other processes thus it could compete with other parts of the program.
Case Study (Andrew): The EQ Effect with SSE and Threading
Towards the end of 2013 I jumped into Audacity to learn about audio programming. Studying the large and computationally intense EQ routine, I ended up building a parallel code path that implemented SSE instructions. While not the level of parallelism of threading, the way I implemented it had the effect of arranging the data in discrete intervals. Once I worked out the bugs in out of order processing of FFT convolution, it seemed natural to extended this relationship to more than 4 parallel intervals. (An 128bit SSE float instruction operates on 4 operands at once) Speed improvements were maybe 1.1-1.4x max. Much less against a vectoring compiler.
I ended up setting up a producer consumer relationship. The main thread spawns a series of child threads each with their own scratch buffers. It then creates and fills a series of containers that hold all the information necessary for the child threads to execute one interval. All flow control was done with a single Mutex gating access to this list of containers. The main thread then continues to fill and retire these containers while the child threads process them. On a typical multi-core processor improvements were on the order of 1.5-5x.
Files: Equalization48x.cpp, Equalization48x.h, RealFFTf48x.cpp, RealFFTf48x.h, plus a small amount of code in Equalization.cpp and Equalization.h.
Library for Self Contained (Globally Supported Locally Implemented)
A series of routines that provide tools and thread safe operators for modules to implement simple threading. Having a central location would allow for inter-process awareness but flow control would still mostly be the responsibility of the module.
- Shared routines simplify deployment.
- Central location could be used to limit/grant access to resources.
- Only alleviates simple operations.
- Limited flow control.
Central Co-ordination (Fully Threaded Operator Aware)
Operations are organized at a central location and partitioned out to modules that comply with thread safe rules.
- Centralized flow control.
- Simplified deployment to modules.
- Pipelined operations.
- Non-threaded routines can still execute concurrently (to some extent).
- Large footprint.
- Design time.
DSL based Multi-Target
The algorithm is written in a custom DSL with rules for generation. In particular the same algorithm can target C, SSE, or GPU, single threaded or multi-threaded. Depending on the support implemented it can hook into a central scheduling routine, or be completely stand alone.
- Can test algorithm in isolation from threading method.
- Can much more readily experiment with variations on the algorithm.
- Can much more readily experiment with variations on deployment.
- Very challenging to create this.
- Still needs code for handling different cases, so possibly no saving in effort.
- Generated code, if done simply, may not have helpful comments and may be verbose and hard to follow.