Difference between revisions of "Proposal Threading and Acceleration"

From Audacity Wiki
Jump to: navigation, search
(Case Study: The EQ Effect with SSE and Threading)
(Use a date that does not need updating. Write out file names in full for clarity.)
Line 33: Line 33:
  
 
==== Case Study: The EQ Effect with SSE and Threading====
 
==== Case Study: The EQ Effect with SSE and Threading====
Last fall I jumped into Audacity to learn about audio programming. Studying the large and computationally intense EQ routine, I ended up building a parallel code path that implemented SSE instructions. While not the level of parallelism of threading, the way I implemented it had the effect of arranging the data in discrete intervals. Once I worked out the bugs in out of order processing of FFT convolution, it seemed natural to extended this relationship to more than 4 parallel intervals. (An 128bit SSE float instruction operates on 4 operands at once) Speed improvements were maybe 1.1-1.4x max. Much less against a vectoring compiler.
+
Towards the end of 2013 I jumped into Audacity to learn about audio programming. Studying the large and computationally intense EQ routine, I ended up building a parallel code path that implemented SSE instructions. While not the level of parallelism of threading, the way I implemented it had the effect of arranging the data in discrete intervals. Once I worked out the bugs in out of order processing of FFT convolution, it seemed natural to extended this relationship to more than 4 parallel intervals. (An 128bit SSE float instruction operates on 4 operands at once) Speed improvements were maybe 1.1-1.4x max. Much less against a vectoring compiler.
  
 
I ended up setting up a producer consumer relationship. The main thread spawns a series of child threads each with their own scratch buffers. It then creates and fills a series of containers that hold all the information necessary for the child threads to execute one interval. All flow control was done with a single Mutex gating access to this list of containers. The main thread then continues to fill and retire these containers while the child threads process them. On a typical multi-core processor improvements were on the order of 1.5-5x.
 
I ended up setting up a producer consumer relationship. The main thread spawns a series of child threads each with their own scratch buffers. It then creates and fills a series of containers that hold all the information necessary for the child threads to execute one interval. All flow control was done with a single Mutex gating access to this list of containers. The main thread then continues to fill and retire these containers while the child threads process them. On a typical multi-core processor improvements were on the order of 1.5-5x.
  
Files: Equalization48x.cpp.h RealFFTf48x.cpp.h small amount code Equalization.cpp.h
+
Files: Equalization48x.cpp, Equalization48x.h, RealFFTf48x.cpp, RealFFTf48x.h, plus a small amount of code in Equalization.cpp and Equalization.h.
  
 
=== Globally Supported Locally Implemented ===
 
=== Globally Supported Locally Implemented ===
Line 53: Line 53:
 
* Simplified deployment to modules.
 
* Simplified deployment to modules.
 
* Pipelined operations.
 
* Pipelined operations.
* Non-threaded routines can still execute concurrently. (to some extent)
+
* Non-threaded routines can still execute concurrently (to some extent).
 
==== Disadvantages ====
 
==== Disadvantages ====
* Large footprint
+
* Large footprint.
* Design time
+
* Design time.

Revision as of 09:27, 14 October 2014

Proposal pages help us get from feature requests into actual plans. This proposal page is about Threading and Acceleration.
Proposal pages are used on an ongoing basis by the Audacity development team and are open to edits from visitors to the wiki. They are a good way to get community feedback on a proposal.


  • Note: Proposals for Google Summer of Code projects are significantly different in structure, are submitted via Google's web app and may or may not have a corresponding proposal page.

Proposed Feature

Most if not all modern computers have multiple cores, GPUs, and other accelerators all capable of executing code. To make use of all this processing power, programs must implement varying levels of gates and sandboxes. Depending on the algorithm, splitting up these pieces and controlling their flow can range from trivial to Rube Goldberg esque. Accomplishing this may require both support libraries and strict rules. This proposal is about laying out these operations and their placement.

Developer/QA/Guest Programmers backing

Andrew Hallendorff

Common Terms and Definitions

(Note: These are just a few definitions to help understanding. Feel free to better my definitions)

Thread

The defined environment necessary for one program entry to execute.

Thread Safe

A way of arranging the resources and controls such that multiple threads can exist in the same environment.

Semaphore

A gate that controls how many threads pass through.

Mutex

(aka: Mutual Exclusion) Maintains coherency between threads preventing parallel operations on same resource.

Producer Consumer

The relationship where one part of a program sets up the execution of another. Often 1 to many or many to many.

Race Condition

When more than one part of a program depends on the time of execution.

Deadlock

When two or more consumers create a situation where they own the resource the others want. This usually stops execution.

Approaches

Simple Self Contained

One part of the program creates its own threads, breaks up the job, and controls execution.

Advantages

  • Has little affect on other parts of the code.
  • Local control and containment allows for fine tuning.

Disadvantages

  • Must implement all levels of code.
  • Has no thread pool and must create threads on the fly.
  • Is unaware of other processes thus it could compete with from other parts of the program.

Case Study: The EQ Effect with SSE and Threading

Towards the end of 2013 I jumped into Audacity to learn about audio programming. Studying the large and computationally intense EQ routine, I ended up building a parallel code path that implemented SSE instructions. While not the level of parallelism of threading, the way I implemented it had the effect of arranging the data in discrete intervals. Once I worked out the bugs in out of order processing of FFT convolution, it seemed natural to extended this relationship to more than 4 parallel intervals. (An 128bit SSE float instruction operates on 4 operands at once) Speed improvements were maybe 1.1-1.4x max. Much less against a vectoring compiler.

I ended up setting up a producer consumer relationship. The main thread spawns a series of child threads each with their own scratch buffers. It then creates and fills a series of containers that hold all the information necessary for the child threads to execute one interval. All flow control was done with a single Mutex gating access to this list of containers. The main thread then continues to fill and retire these containers while the child threads process them. On a typical multi-core processor improvements were on the order of 1.5-5x.

Files: Equalization48x.cpp, Equalization48x.h, RealFFTf48x.cpp, RealFFTf48x.h, plus a small amount of code in Equalization.cpp and Equalization.h.

Globally Supported Locally Implemented

A series of routines that provide tools and thread safe operators for modules to implement simple threading. Having a central location would allow for inter-process awareness but flow control would still mostly be the responsibility of the module.

Advantages

  • Shared routines simplify deployment.
  • Central location could be used to limit/grant access to resources.

Disadvantages

  • Only alleviates simple operations.
  • Limited flow control.

Fully Threaded Operator Aware

Operations are organized at a central location and partitioned out to modules that comply with thread safe rules.

Advantages

  • Centralized flow control.
  • Simplified deployment to modules.
  • Pipelined operations.
  • Non-threaded routines can still execute concurrently (to some extent).

Disadvantages

  • Large footprint.
  • Design time.