We're intending to make Audacity compatible with biological sequence data, in fact sequence data of many kinds.
We'd expect scientists to be using their existing favourite specialised biological sequence browsers in their labs. Our reason for handling biological sequence data at all is to act as an ideas incubator. We believe that the approaches we have - often motivated by the needs of audio sequence analysis - will inspire ideas and improvements in the user interfaces of the lab tools. Our code is open source, so researchers are welcome to build on the ideas of our code and adapt our actual code if that looks useful to them.
Here's a screenshot of a sequence browser for biological sequence data at jbrowse.org
It has affordances for zooming, panning, selecting and multiple tracks of different types. So does Audacity. It should be clear that Audacity already is an audio sequence browser (and editor). We can add new track types, and handle more general sequences.
What can we offer?
Audacity has labels where one kind of label can label:
- Individual positions.
- A range of positions.
- Boundaries - several consecutive ranges which share an edge.
Editing in these different cases should be easier than with three different kinds of labels. Our display of labels is also designed to keep the descriptive text of the label on screen, even as you zoom and pan. Without that, especially when zoomed in, it's very easy to lose track of what each label is for. We think these features of labels are useful in a biological sequence browser.
- Are these features of label editing vital to a DNA/protein sequence editor? No. Could they make a worthwhile difference? Yes. Annotation is one of the core ways of documenting understanding of biological sequences. Even small improvements are worth having.
We plan to improve on our labels, especially to get better at handling large numbers of labels, labels for different kinds of features, and connections between labels. The needs of biological sequence analysis will probably help us do labels for audio better.
We have a prototype for continuous zooming of the timeline that we think is visually pleasing, clear, and uses little space. It makes it easy to precisely select a range. This is worth trying with non-audio sequence too.
This is an area currently almost entirely missing in Audacity. We have plans here for user interface for showing alignments of audio of tracks, including mixed alignment where we align different types of track, e.g. MIDI to audio.
Our current plan is to design and implement user interface both for audio and biological sequence alignment at the same time. As we work on MIDI to audio WAV alignment we'll also be working on DNA to protein sequence alignment (algorithms similar to tFASTY). We think that some of the pressures of aligning audio tracks will suggest user interface ideas for biological sequence tracks, and even underlying alignment algorithm tweaks, that are valuable and would not have been thought of if only working with DNA and protein.
Pathways Browser / Code Browser
We've also been working on improving the doxygen documentation of Audacity. It needs it, because Audacity is large and complex and hard to understand. We've decided to leverage wiki - use wiki for explaining algorithms and structure. To do that, we're making a toolbox of interactive diagram making code, as we progress in documenting Audacity.
- The wiki strategy is important. Wiki is built for collaboration. In our view diagrams are pain point in wiki. Producing them collaboratively is much harder than producing text collaboratively.
Many of the needs for diagrams for documenting biological sequence information coincide with the needs for documentation of a large and complex program. Some parallels in work already done are shown in Progress Report 2. Later diagrams will animate the steps in an algorithm, or the steps in a pathway, to show how it works.
Below is a screenshot from reactome.org, which is a kind of google maps for biochemistry. It packs a tremendous amount of information in, in a beautifully done interface.
We also like the approach at wikipathways.org where the pathways can be organised and collaboratively edited.
We're working on a hybrid approach. We're using wiki as glue as Wikipathways does, and our own new code for interactive diagrams. Wiki works well for collaboration on explaining different topics. We want to augment it. Rather than building an overarching navigator of the kind Reactome has, we'll build diagrams to facilitate navigation that we can place on wiki pages as needed.
We aim for diagrams that combine elements of hand curation with automatic layout. Graphviz, GnuPlot and d3.js show how data driven diagrams can be produced. We're adding layout hints, landmarks, integration into mediawiki and rich tooltips to the mix and making it so that data driven parts can be updated, with hand crafted parts still arranged nicely.
We aim too to have a reactome style browser, built in wiki, for Audacity source code. Audacity has hundreds of classes rather than the thousands of proteins in reactome, and the challenges in documenting the code are much more manageable.
We like the way that wiki does not straight jacket content. If a topic connects two systems, say protein folding and immunity, the topic can be added and can be found from either place. We also like the hierarchical and diagram based navigation of reactome.
We're getting more navigation features into wiki via diagram types that have tips with hyperlinks. The Keyword-In-Context index will add to this, as will new diagrams for exploring trees of information.
Wikipedia has a long tradition of using bots to do many mundane and mechanical tasks. We plan to use bots on our wiki to keep information up to date. If comments in our source code change, or naming of functions changes, we're expecting that a bot that reads source code comments the way Doxygen does can keep text and diagrams up to date.
We've found in the early work on the diagram tools that visuals indicating how much detailed documentation of a subsystem there is are valuable. In our class descriptions we use spot-sizes that grow logarithmically with the quantity of documentation to indicate how much information is 'behind' each spot. This acts as a guide when exploring as to what are the well trodden paths and which are more off the beaten track. This is also valuable in biochemical information where distinguishing visually between algorithmically inferred function and actual validated by experiment information is essential.
Biology Quests / Investigations
We're not planning to replicate complete and comprehensive charts for known biochemistry. We're an ideas incubator, not a curator of that information. We'll make tools which can be adopted by others. We will be doing some smaller investigations that can be done using readily available information. We're particularly interested in:
- Pathogenesis vs symbiosis genes. There are genes involved in symbiotic nitrogen fixation that are allied to genes involved in pathogenesis. Also systematically exploring differences between malarial parasites and corralicolid symbiotes may shed some light on pathogenesis. Even if it doesn't, tools to help with such investigation will be interesting.
- Repetitive sequences. These are a problem for conventional search algorithms, which may simply discard them. Plasmodium falciparum has intriguing similarity in repetitive protein with a protein in Staph aureus that warrants further exploration. Kelch proteins also show short local repeated regions, and may be being under reported in database searches.
- Single linkage cluster analysis especially to follow up on weak similarities between widely diverse species. Commonalities will generally be in fundamental mechanisms. There's a link between CRISP (Cysteine-Rich Secretory Protein, not CRISPR) and P53 (cellular tumour antigen) we want to follow up.
Are we worried about being scooped? Not at all. We're more interested in generating tools than in getting credit for finding some unseen pattern or connection. Having actual biological 'quests' will help us with focus, and will help us design tools that support them. It's also a way to connect better with professional biologists. We may get things badly wrong. Their comments and feedback - provided the quests are not completely off the walls - will help us make the tools more biological data informed.
Our use of wiki and of interactive diagrams is intended to make the tools easy to adopt. As we progress with the tools, we hope to have conversations with Wikipathways and with Wikipedia foundation about what is needed to make our tools suitable for use beyond our own project.
The additions we make may also be worth adapting by GnuPlot and GraphViz. These may help scientific notebooks such as Jython which often use the GnuPlot and GraphViz tools. With modification, the notebooks will be able to produce richer interactive diagrams.
This isn't a quick fix project. It's for the long haul. The interactive diagram software is currently only about 4 months old. Hand crafted static diagrams created in Inkscape are going to work better for Wikipedia than automatically produced diagrams do for a long time.
For simple diagrams, the diagram software is already showing promise. The navigation control of the keyword in context index may already be of interest to Wikipathways.
Turning Audacity into a general sequence editor and adding sequence comparison functions is also a slow process. A lot has already been done to make the code more general, but implementation takes a lot longer than formulating the outline of ideas and a plan.
The project will likely work or not depending on whether more people like the plan and join in with coding, with graphics/UI design and with the biology quests. Without that it will progress, but more slowly than we would like.
Apart from the actual biology questions, none of this bio-audacity work is 'fundamental research' that directly in itself makes discoveries in molecular biology. Work on tools currently is not getting enough attention. Such work can be done by people who are programmers rather than molecular biologists. Done well, these changes make working with the data easier. As an open source team that works well in the open source way, we're well positioned to do and coordinate some exploration of innovation in the tools.