Difference between revisions of "Proposal Threading and Acceleration"

From Audacity Wiki
Jump to: navigation, search
(Common Terms and Definitions)
(Common Terms and Definitions)
Line 43: Line 43:
| Domain-specific language. Counter type of a General-specific language. When making considerations for certain platforms it is appropriate to write code that complements the target platform. (< Fix this!!!)
| Domain-specific language. Counter type of a General-purpose language. When making considerations for certain platforms it is appropriate to write code that complements the target platform. (< Fix this!!!)

Revision as of 19:11, 14 October 2014

Proposal pages help us get from feature requests into actual plans. This proposal page is about Threading and Acceleration.
Proposal pages are used on an ongoing basis by the Audacity development team and are open to edits from visitors to the wiki. They are a good way to get community feedback on a proposal.

  • Note: Proposals for Google Summer of Code projects are significantly different in structure, are submitted via Google's web app and may or may not have a corresponding proposal page.

Proposed Feature

Most if not all modern computers have multiple cores, GPUs, and other accelerators all capable of executing code. To make use of all this processing power, programs must implement varying levels of gates and sandboxes. Depending on the algorithm, splitting up these pieces and controlling their flow can range from trivial to Rube Goldberg esque. Accomplishing this may require both support libraries and strict rules.

This proposal is about laying out these operations and their placement. Four options are under discussion. The actual situation is that we already have special cases of 'the simple self contained model' and are proposing developing more complex models.

Developer/QA/Programmer backing

  • Andrew Hallendorff
  • James Crook

Common Terms and Definitions

Definitions to help understanding. Edits welcome.

Term Definition
Thread The defined environment necessary for one program entry to execute.
Thread Safe A way of arranging the resources and controls such that multiple threads can exist in the same environment.
Semaphore A gate that controls how many threads pass through.
Mutex (aka: Mutual Exclusion) Maintains coherency between threads preventing parallel operations on same resource.
Producer Consumer The relationship where one part of a program sets up the execution of another. Often 1 to many or many to many.
Race Condition When more than one part of a program depends on the time of execution.
Deadlock When two or more consumers create a situation where they own the resource the others want. This usually stops execution.
DSL Domain-specific language. Counter type of a General-purpose language. When making considerations for certain platforms it is appropriate to write code that complements the target platform. (< Fix this!!!)


  • Simple Self Contained
  • Library for Self Contained
  • Central co-ordination
  • DSL based Multi-Target

Simple Self Contained

One part of the program creates its own threads, breaks up the job, and controls execution.


  • Has little effect on other parts of the code.
  • Local control and containment allows for fine tuning.


  • Must implement all levels of code.
  • Has no thread pool and must create threads on the fly.
  • Is unaware of other processes thus it could compete with other parts of the program.

Case Study (Andrew): The EQ Effect with SSE and Threading

Towards the end of 2013 I jumped into Audacity to learn about audio programming. Studying the large and computationally intense EQ routine, I ended up building a parallel code path that implemented SSE instructions. While not the level of parallelism of threading, the way I implemented it had the effect of arranging the data in discrete intervals. Once I worked out the bugs in out of order processing of FFT convolution, it seemed natural to extended this relationship to more than 4 parallel intervals. (An 128bit SSE float instruction operates on 4 operands at once) Speed improvements were maybe 1.1-1.4x max. Much less against a vectoring compiler.

I ended up setting up a producer consumer relationship. The main thread spawns a series of child threads each with their own scratch buffers. It then creates and fills a series of containers that hold all the information necessary for the child threads to execute one interval. All flow control was done with a single Mutex gating access to this list of containers. The main thread then continues to fill and retire these containers while the child threads process them. On a typical multi-core processor improvements were on the order of 1.5-5x.

Files: Equalization48x.cpp, Equalization48x.h, RealFFTf48x.cpp, RealFFTf48x.h, plus a small amount of code in Equalization.cpp and Equalization.h.

Library for Self Contained (Globally Supported Locally Implemented)

A series of routines that provide tools and thread safe operators for modules to implement simple threading. Having a central location would allow for inter-process awareness but flow control would still mostly be the responsibility of the module.


  • Shared routines simplify deployment.
  • Central location could be used to limit/grant access to resources.


  • Only alleviates simple operations.
  • Limited flow control.

Central Co-ordination (Fully Threaded Operator Aware)

Operations are organized at a central location and partitioned out to modules that comply with thread safe rules.


  • Centralized flow control.
  • Simplified deployment to modules.
  • Pipelined operations.
  • Non-threaded routines can still execute concurrently (to some extent).


  • Large footprint.
  • Design time.

DSL based Multi-Target

The algorithm is written in a custom DSL with rules for generation. In particular the same algorithm can target C, SSE, or GPU, single threaded or multi-threaded. Depending on the support implemented it can hook into a central scheduling routine, or be completely stand alone.


  • Can test algorithm in isolation from threading method.
  • Can much more readily experiment with variations on the algorithm.
  • Can much more readily experiment with variations on deployment.


  • Very challenging to create this.
  • Still needs code for handling different cases, so possibly no saving in effort.
  • Generated code, if done simply, may not have helpful comments and may be verbose and hard to follow.