Difference between revisions of "Proposal Threading and Acceleration"

From Audacity Wiki
Jump to: navigation, search
Line 31: Line 31:
 
* Has no thread pool and must create threads on the fly.
 
* Has no thread pool and must create threads on the fly.
 
* Is unaware of other processes could compete with from other parts of the program.
 
* Is unaware of other processes could compete with from other parts of the program.
=== Case Study: The EQ Effect with SSE and Threading===
+
==== Case Study: The EQ Effect with SSE and Threading====
 
Last fall I jumped into Audacity without looking. Studying the large and computationally intense EQ routine I ended up building parallel code path that implemented SSE instructions. While not the level of parallelism of threading, the way I implemented it had the effect of arranging the data in discrete intervals. Once I worked out the bugs in out of order processing of FFT convolution, It seemed natural to extended this relationship to more than 4 parallel intervals. (An 128bit SSE float instruction operates on 4 operands at once) Speed improvements were maybe 1.1-1.4x max. Much less against a vectoring compiler.
 
Last fall I jumped into Audacity without looking. Studying the large and computationally intense EQ routine I ended up building parallel code path that implemented SSE instructions. While not the level of parallelism of threading, the way I implemented it had the effect of arranging the data in discrete intervals. Once I worked out the bugs in out of order processing of FFT convolution, It seemed natural to extended this relationship to more than 4 parallel intervals. (An 128bit SSE float instruction operates on 4 operands at once) Speed improvements were maybe 1.1-1.4x max. Much less against a vectoring compiler.
  
 
I ended up setting up a producer consumer relationship. The main thread spawns a series of child threads each with their own scratch buffers. It then creates and fills a series of containers that hold all the information necessary for the child threads to execute one interval. All flow control was done with a single Mutex gating access to this list of containers. The main thread then continues to fill and retire these containers while the child threads process them. On a typical multi-core processor improvements were on the order of 3-6x.
 
I ended up setting up a producer consumer relationship. The main thread spawns a series of child threads each with their own scratch buffers. It then creates and fills a series of containers that hold all the information necessary for the child threads to execute one interval. All flow control was done with a single Mutex gating access to this list of containers. The main thread then continues to fill and retire these containers while the child threads process them. On a typical multi-core processor improvements were on the order of 3-6x.
 +
 +
Files: Equalization48x.cpp.h RealFFTf48x.cpp.h small amount code Equalization.cpp.h
 +
 +
=== Globally Supported Locally Implemented ===
 +
A series of routing

Revision as of 23:55, 13 October 2014

Proposal pages help us get from feature requests into actual plans. This proposal page is about Threading and Acceleration.
Proposal pages are used on an ongoing basis by the Audacity development team and are open to edits from visitors to the wiki. They are a good way to get community feedback on a proposal.


  • Note: Proposals for Google Summer of Code projects are significantly different in structure, are submitted via Google's web app and may or may not have a corresponding proposal page.

Proposed Feature

Most if not all modern computers have multiple cores, GPUs, and other accelerators all capable of executing code. To make use of all this processing power, programs must implement varying levels of gates and sandboxes. Depending on the algorithm, splitting up these pieces and controlling their flow can range from trivial to Rube Goldberg esque. Accomplishing this may require both support libraries and strict rules. This proposal is about laying out these operations and their placement.

Developer/QA/Guest Programmers backing

Andrew Hallendorff

Common Terms and Definitions

(Note: These are just a few definitions to help understanding. Feel free to better my definitions)

Thread

The defined environment necessary for one program entry to execute.

Thread Safe

A way of arranging the resources and controls such that multiple threads can exist in the same environment.

Semaphore

A gate that controls how many threads pass through.

Mutex

(aka: Mutual Exclusion) Maintains coherency between threads preventing parallel operations from executing on same resource.

Producer Consumer

The relationship where one part of a program sets up the execution of another. Often 1 to many or many to many.

Race Condition

When more than one part of a program depends on the time of execution.

Deadlock

When two or more consumers create a situation where they own the resource the others want. This usually stops execution.

Approaches

Simple Self Contained

One part of the program creates its own threads, breaks up the job, and controls execution.

Advantages

  • Has little affect on other parts of the code.
  • Full control and containment allows for fine tuning.

Disadvantages

  • Must implement all levels of code.
  • Has no thread pool and must create threads on the fly.
  • Is unaware of other processes could compete with from other parts of the program.

Case Study: The EQ Effect with SSE and Threading

Last fall I jumped into Audacity without looking. Studying the large and computationally intense EQ routine I ended up building parallel code path that implemented SSE instructions. While not the level of parallelism of threading, the way I implemented it had the effect of arranging the data in discrete intervals. Once I worked out the bugs in out of order processing of FFT convolution, It seemed natural to extended this relationship to more than 4 parallel intervals. (An 128bit SSE float instruction operates on 4 operands at once) Speed improvements were maybe 1.1-1.4x max. Much less against a vectoring compiler.

I ended up setting up a producer consumer relationship. The main thread spawns a series of child threads each with their own scratch buffers. It then creates and fills a series of containers that hold all the information necessary for the child threads to execute one interval. All flow control was done with a single Mutex gating access to this list of containers. The main thread then continues to fill and retire these containers while the child threads process them. On a typical multi-core processor improvements were on the order of 3-6x.

Files: Equalization48x.cpp.h RealFFTf48x.cpp.h small amount code Equalization.cpp.h

Globally Supported Locally Implemented

A series of routing