Proposal Threading and Acceleration
|Proposal pages help us get from feature requests into actual plans. This proposal page is about Threading and Acceleration.|
Proposal pages are used on an ongoing basis by the Audacity development team and are open to edits from visitors to the wiki. They are a good way to get community feedback on a proposal.
- Note: Proposals for Google Summer of Code projects are significantly different in structure, are submitted via Google's web app and may or may not have a corresponding proposal page.
- 1 Proposed Feature
- 2 Developer/QA/Guest Programmers backing
- 3 Common Terms and Definitions
- 4 Approaches
Most if not all modern computers have multiple cores, GPUs, and other accelerators all capable of executing code. To make use of all this processing power, programs must implement varying levels of gates and sandboxes. Depending on the algorithm, splitting up these pieces and controlling their flow can range from trivial to Rube Goldberg esque. Accomplishing this may require both support libraries and strict rules. This proposal is about laying out these operations and their placement.
Developer/QA/Guest Programmers backing
Common Terms and Definitions
(Note: These are just a few definitions to help understanding. Feel free to better my definitions)
- The defined environment necessary for one program entry to execute.
- A way of arranging the resources and controls such that multiple threads can exist in the same environment.
- A gate that controls how many threads pass through.
- (aka: Mutual Exclusion) Maintains coherency between threads preventing parallel operations on same resource.
- The relationship where one part of a program sets up the execution of another. Often 1 to many or many to many.
- When more than one part of a program depends on the time of execution.
- When two or more consumers create a situation where they own the resource the others want. This usually stops execution.
Simple Self Contained
One part of the program creates its own threads, breaks up the job, and controls execution.
- Has little affect on other parts of the code.
- Local control and containment allows for fine tuning.
- Must implement all levels of code.
- Has no thread pool and must create threads on the fly.
- Is unaware of other processes thus it could compete with from other parts of the program.
Case Study: The EQ Effect with SSE and Threading
Last fall I jumped into Audacity without looking. Studying the large and computationally intense EQ routine, I ended up building parallel code path that implemented SSE instructions. While not the level of parallelism of threading, the way I implemented it had the effect of arranging the data in discrete intervals. Once I worked out the bugs in out of order processing of FFT convolution, it seemed natural to extended this relationship to more than 4 parallel intervals. (An 128bit SSE float instruction operates on 4 operands at once) Speed improvements were maybe 1.1-1.4x max. Much less against a vectoring compiler.
I ended up setting up a producer consumer relationship. The main thread spawns a series of child threads each with their own scratch buffers. It then creates and fills a series of containers that hold all the information necessary for the child threads to execute one interval. All flow control was done with a single Mutex gating access to this list of containers. The main thread then continues to fill and retire these containers while the child threads process them. On a typical multi-core processor improvements were on the order of 1.5-5x.
Files: Equalization48x.cpp.h RealFFTf48x.cpp.h small amount code Equalization.cpp.h
Globally Supported Locally Implemented
A series of routines that provide tools and thread safe operators for modules to implement simple threading. Having a central location would allow for inter-process awareness but flow control would still mostly be the responsibility of the module.
- Shared routines simplify deployment.
- Central location could be used to limit/grant access to resources.
- Only alleviates simple operations.
- Limited flow control.
Fully Threaded Operator Aware
Operations are organized at a central location and partitioned out to modules that comply with thread safe rules.
- Centralized flow control.
- Simplified deployment to modules.
- Pipelined operations.
- Non-threaded routines can still execute concurrently. (to some extent)
- Large footprint
- Design time