Toward building an automated bioinformatician: parameter advising for improved scientific discovery

Modern scientific software has a large number of tunable parameters that need to be adjusted to ensure computational performance and accuracy of the results. When these parameter choices are made incorrectly we may overlook significant results or falsely report insignificant ones. Optimizing the parameter choices for one input may not provide an assignment that’s good for another, so this parameter optimization process typically needs to be repeated for each new piece of data. Standard machine learning methods for solving this problem need to repeatedly run the software which may not be suitable in practice. Because of the time consumption required to optimize parameters and the possible loss of accuracy that can result when chosen incorrectly, the default parameter vector that are provided by the tool developer is often used. These defaults are designed to work well on average, but most interesting cases are rarely “average”.

In this talk, I describe my first steps in automatically learning the correct program configuration for biological applications using a framework we call “Parameter Advising”. To apply this framework to the problem of multiple sequence alignment we developed an accuracy estimator, called Facet, to help choose alignments since no ground truth is available in practice. When we use Facet for advising on the Opal aligner we boost accuracy by 14.6% on the hardest-to-align benchmarks. For the reference-based transcript assembly problem, when applying parameter advising to the Scallop assembler we see an increase in accuracy of 28.9%. The framework is general and can be extended to other problems in computational biology and beyond. I will discuss possible areas where parameter advising could be used to automatically learn to run complex analysis software.

(slides)