Proposal Audio Diff

From Audacity Wiki
Revision as of 14:54, 10 September 2008 by James (talk | contribs) (Added Category:For Developers)
Jump to: navigation, search

What Diff Does

  • If you can compare short moments of sound, say 100ms fragments, then you can extend it to a method for comparing and aligning audio of any length, allowing for changes in the length. That's comparison of sequences allowing for insertions or deletions (indels).
  • Diff extends a method for comparing small fixed size elements to become a method for aligning sequences of those fixed size elements.

Some links

  • ALE An alignment editor written in emacs.
  • Simon Dixon's MATCH algorithm. It is also available as a VAMP plug in.

Is Diff Difficult?

The difficulty in 'diff' is not in the essence of it, which is a simple but powerful idea. As the Wikipedia article shows, diff can be presented in 20 lines or so of code. The difficulties are in specialising it to particular applications. It's an O(mn) algorithm, so speed matters. It invites numerous variations, many of which are ideal for certain uses. The parameters in its tables can be tweaked in innumerable ways. If there are variations possible in the underlying fragment comparison algorithm, it's 10 times more so with the algorithm extended to sequences. Interface design for diff is challenging too.

On the plus side, diff is very powerful. Comparison underlies so much of understanding of data of all kinds. It's possible to be selective about what areas to explore. You don't have to go into all the variations, or just work on one variant that is good enough. One can work on two paths, one with a simple to write but flexible version of diff, where experimentation on variations is easy, another with an extreme optimised but very specialised variant for an application like database data mining where the optimisation is central - and the main research is on how to use those results. One can also work just on the interface, with a minimal basic diff (fast, simple, basic) to drive it.

GSoC Project

Diff has been proposed as a Google Summer of Code project. The project needs:

  • GUI, this is the most substantial piece of work. It's about being able to use the results of the diff, e.g. to merge pairs of tracks intelligently. We're expecting some smart ways of using the clips somewhat like Araxis Merge (see below). Using clips introduces gaps (silence) in the audio. We think that's fine for speech, as the gaps will be between words. We also expect a dotplot display. It will be needed during development anyway, so might as well productise it and make it available to the user.
  • The fragment comparison algorithm. Keep it as simple as possible for this project. It's a research area in itself if you do otherwise. Do make it a VAMP plug-in, because then we can switch in different versions.
  • The diff algorithm itself. What parameters will you present to the user? How will you deal with very large sequences? What do you see as the use cases to cover? These determine how much flexibility you build into the underlying diff.


  • The commercial program Araxis Merge has a good interface for working with textual differences between two files. I particularly like the way that once aligned you can replace a block in one file with the corresponding block from the other with a single click. You can shuttle blocks around easily.
  • If we were to line up two audio tracks in this way, we could mix-and-match between the two tracks. This would be extraordinarily cool. I don't know of any commercial audio software that can do this - perhaps someone else can look.
  • In developing a diff algorithm it can be helpful to present the match-matrix graphically. In molecular biology the 'dotplot' is used to do this, (though it's not usually recognised how closely related to alignment this is!). There is no reason not to do something similar for sound. It is useful for the bench scientist, and in sound it will probably be useful to the end user too.

Longest Common Subsequence Algorithm

It's explained well on Wikipedia. For the moment we're only interested in comparing two sequences, so the preamble about NP hardness of aligning more than two sequences is of no relevance to us. For comparing two sequences the algorithm has complexity O(nm) where n and m are the sizes of the two sequences.

The vital insight is that there is a correspondence between an 'alignment' of two text sequences and a path through the grid shown in the example. The path goes from the top left corner to the bottom right hand corner.

There's more information here. This brings in the idea that there is a 'cost' for an indel (= insertion or deletion), i.e. a penalty whenever a letter is aligned against a space.

The section of this page on sequence alignment covers related material. At the time it was written there was a bit more mystique around 'dynamic programming' and I wanted to dispel that. Nowadays the idea that dynamic programming is just recursion with memoisation (in my terms recursion with a cache) is much more mainstream. Although we don't yet have tools that can automatically convert a recursive formulation of an algorithm to an efficient memoised/dynamic-programming equivalent, it's clear that the change is not hugely challenging in itself. FFT can in fact be formulated as a recursive algorithm with memoisation - it's just that it's better to memoise it by hand.


  • The LCS algorithm is scoring '1' for a match, and '0' for a mismatch. An obvious and very useful extension is to score an alignment/path-through-the-matrix where the component scores depend on the 'letters' matched - for example S and T might be considered similar, near to 1 in score, whilst A and W might be near to 0 in score. We can plug in any function we like for pairwise similarity for short segments.
  • Instead of a fixed penalty for an indel, we can make it a function of the length so that once a 'gap' is above a certain length increasing it further does not cost us much.
  • A modified algorithm need not align the entirety of either sequence. Perhaps surprisingly it does not cost us any more in algorithm complexity to find the best scoring local subsequence. This variant is valuable for database searching, for finding whether part of one sequence is present in part of some other sequence in a database.
  • Multiple sequences, not just two. It's NP hard, if you read the literature, but you don't have to do it that way. In practice a perfectly workable way of aligning real world multiple sequences builds them up from 'best-pairwise alignments', and only does k(k-1)/2 of these for k sequences. You might want to leave your workstation running overnight all the same.


  • For DNA database searching a lot of attention to detail in the implementation can boost the performance significantly. It doesn't change the O(mn)ness of the algorithm, but for my use it did make the algorithm 50 times faster. See this page for details.

Limitations / Further Research

The ability to mix and match would have some limitations. It only works with the audio fragments it's supplied with. The alignment gives you an opportunity to select the best fragments for each time slice. In itself it won't give you a better fragment than one or other of the two corresponding fragments in the two takes. In its first incarnation it would probably be most useful for speech or for an unaccompanied solo singer. It is easier to make clean transitions back and forth between two tracks in the 'silences/voicing' between words than it is to slide between two tracks with multiple voices and/or instruments that differ slightly. Even in speech/song, the speaker/singer would need to have been careful to speak/sing with the same kind of intonation and emphasis or the switching back and forth would sound strange. The feature would be suitable for a trained speech talent. It would not be as suitable for someone new to vocal work - though using it a great deal would probably help train them in consistency!

Use Cases

See Fixing a Jam Session in Use Cases for more context in which these algorithms are useful.

  • (The big one) - Multiple takes of the same 'session'. Allow the user to mix and match.
  • Spot every cough from the audience in a recording (irregular spacing)
  • Graph the speed of an engine from its sound (regular, but unknown, spacing)
  • Gather evidence of the noise pollution (pneumatic drill, heavy lorries) from building works nearby over a 10 day long recording.... He's planning to take them to court.

Any one of these would be a significant advance for Audacity.