We're intending to make Audacity compatible with biological sequence data, in fact sequence data of many kinds.
We'd expect scientists to be using their existing favourite specialised biological sequence browsers in their labs. Our reason for handling biological sequence data at all is to act as an ideas incubator. We believe that the approaches we have - often motivated by the needs of audio sequence analysis - will inspire ideas and improvements in the user interfaces of the lab tools. Our code is open source, so researchers are welcome to build on and adapt our code if that looks useful to them.
Here's a screenshot of a sequence browser for biological sequence data at jbrowse.org
It has affordances for zooming, panning, selecting and multiple tracks of different types. It should be clear that Audacity already is an audio sequence browser (and editor). We can add new track types, and handle more general sequences.
What can we offer?
Audacity has labels where one kind of label can label:
- Individual positions.
- A range of positions.
- Boundaries - several consecutive ranges which share an edge.
Editing in these different cases should be easier than with three different kinds of labels. Our display of labels is also designed to keep the descriptive text of the label on screen, even as you zoom and pan. Without that, especially when zoomed in, it's very easy to lose track of what each label is for. We think these features of labels are useful in a biological sequence browser.
We plan to improve on our labels, especially to get better at handling large numbers of labels, labels for different kinds of features, and connections between labels. The needs of biological sequence analysis will probably help us do labels for audio better.
We have a prototype for continuous zooming of the timeline that we think is visually pleasing, clear, and uses little space. It makes it easy to precisely select a range. This is worth trying with non-audio sequence too.
This is an area currently almost entirely missing in Audacity. We have plans here for user interface for showing alignments of audio of tracks, including mixed alignment where we align different types of track, e.g. MIDI to audio.
Our current plan is to design and implement user interface both for audio and biological sequence alignment at the same time. As we work on MIDI to audio WAV alignment we'll also be working on DNA to protein sequence alignment (algorithms similar to tFASTY). We think that some of the pressures of aligning audio tracks will suggest user interface ideas for biological sequence tracks, and even underlying alignment algorithm tweaks, that are valuable and would not have been thought of if only working with DNA and protein.
Pathways Browser / Code Browser
We've been working on improving the doxygen documentation of Audacity. It needs it, because Audacity is large and complex and hard to understand. We've decided to leverage wiki - use wiki for explaining algorithms and structure. To do that, we're making a toolbox of interactive diagram making code, as we progress in documenting Audacity.
Many of the needs for documenting biological sequence information coincide with the needs for documentation of a large and complex program. Some parallels are shown in Progress Report 2.
Below is a screenshot from reactome.org, which is a kind of google maps for biochemistry. It packs a tremendous amount of information in, in a beautifully done interface.
We also like the approach at wikipathways.org where the pathways can be organised and collaboratively edited.
We're working on a hybrid approach. We're using wiki as glue and our own new code for interactive diagrams. We aim for diagrams that combine elements of hand curation with automatic layout. Graphviz, GnuPlot and d3.js show how data driven diagrams can be produced. We're adding layout hints, landmarks and rich tooltips to the mix.
We aim to have a reactome style browser, built in wiki, for Audacity source code. Audacity has hundreds of classes rather than the thousands of proteins in reactome, and the challenges in documenting the code are much more manageable.
We like the way that wiki does not straight jacket content. If a topic connects two systems, say protein folding and immunity, the topic can be added and can be found from either place. We also like the hierarchical and diagram based navigation of reactome. We're getting that into wiki via diagram types that have tips with hyperlinks.
Wikipedia has a long tradition of using bots to do many mundane and mechanical tasks. We plan to use bots on our wiki to keep information up to date. If comments in our source code change, or naming of functions changes, we're expecting that a bot that reads source code comments the way doxygen does can keep text and diagrams up to date.
We've already found in the early work on the diagram tools that visual indications of how detailed documentation of a subsystem there is is valuable information to show on a diagram. It helps in guiding exploration. It helps in guiding as to what needs documenting. In our class descriptions we use spot-sizes that grow logarithmically with the quantity of documentation to indicate how much information is 'behind' each spot. This is also valuable in biochemical information where distinguishing visually between algorithmically inferred function and actual validated by experiment information is worth doing.
Biology Quests / Investigations
We're not planning to replicate complete and comprehensive charts for known biochemistry. We're an ideas incubator, not a curator of that information. We will be doing some smaller investigations that can be done using readily available information. We're particularly interested in:
- Pathogenesis vs symbiosis genes. There are genes involved in symbiotic nitrogen fixation that are allied to genes involved in pathogenesis. Also systematically exploring differences between malarial parasites and corralicolid symbiotes may shed some light on pathogenesis. Even if it doesn't, tools to help with such investigation will be interesting.
- Repetitive sequences. These are a problem for conventional search algorithms, which may simply discard them. Plasmodium falciparum has intriguing similarity in repetitive protein with a protein in Staph aureus that warrants further exploration. Kelch proteins also show short local repeated regions, and may be being under reported in database searches.
- Single linkage cluster analysis especially to follow up on weak similarities between widely diverse species. Commonalities will generally be in fundamental mechanisms. There's a link between CRISP (Cysteine-Rich Secretory Protein, not CRISPR) and P53 (cellular tumour antigen) we want to follow up.
Are we worried about being scooped? Not at all. We're more interested in generating tools than in getting credit for finding some unseen pattern or connection. Having actual biological 'quests' will help us with focus, and will help us design tools that support them.
Our use of wiki and of interactive diagrams is intended to make the tools easy to adopt. As we progress with the tools, we hope to have conversations with wikipathways and with wikipedia foundation about what is needed to make these tools suitable for use beyond our own project.
The additions we make may also be worth adapting by GnuPlot and GraphViz. These may help scientific notebooks such as Jython which use those tools to produce static diagrams. With modification the notebooks will be able to produce richer interactive diagrams.
Apart from the actual biology questions, none of this bio-audacity work is 'fundamental research' that directly in itself makes discoveries in molecular biology. Work on tools currently is not getting enough attention. Such work can be done by people who are programmers rather than molecular biologists. Done well, these changes make working with the data easier. As an open source team that works well in the open source way, we're well positioned to do and coordinate some exploration of innovation in the tools.