My research interest are broadly in algorithm design and analysis, and I take inspiration from biological problems. Many times this not only leads to an interesting algorithmic result, but a useful biological tool (see Software).
In the past my work has focused mainly on multiple sequence alignment problems. Most recently I worked on improving accuracy of protein multiple sequence alignments. Multiple sequence alignment is a fundamental step in bioinformatics, but the problem is NP-complete. Because of the importance of the result and complexity of the multiple sequence alignment problem many algorithms exist to find high quality alignments in practice. Each of these algorithms has a large number of tunable parameters that can greatly affect the quality of the computed alignment. Most users rely on the default parameter choices, which produce the best alignments on average, but produce poor alignments for some inputs. We developed a process called parameter advising which selects parameter choices that produces a high quality alignment for the input. To accomplish this candidate alignments are produced using each of the parameter choices in an advising set, the accuracy of these candidate alignments is then estimated using an advising estimator, the candidate alignment with the highest estimated accuracy is then selected for the user. To estimate the alignment accuracy we developed Facet (Feature-based accuracy estimator) which is a linear combination of efficiently-computable feature functions. We have found that learning an optimal advisor (selecting both the estimator coefficients and the set of parameter choices) is NP-complete. We expanded this result to show that finding the estimator coefficients or the estimator set independently is also NP-complete. In practice, we have methods to find close-to optimal advisors. We are working on ways to improve the accuracy of these parameter advisors.
I have also worked on improving the memory consumption of secondary structure conscious RNA multiple sequence alignment (see PMFastR) and high throughput phylogeny filtering (see SiClE).
Kwanho Kim, who I have been working with since 2017, graduated today with his Masters of Science in Computational Biology. He successfully defended his thesis titled “Analyzing the influence of assessment metrics on automated transcript assembly parameter selection” on April 30 and will be starting a position at The Broad Institute this summer. Congratulations Kwanho!
Related to the work we presented at ISMB last year, our new work on making a new hashing scheme that improves on Jaccard and Hamming distances for searching large sequences was recently accepted to ISMB/ECCB 2019 in Basel, Switzerland. This is work with Guillaume Marçais, Carl Kingsford, and Prashant Pandey. A preprint of the manuscript is on bioRxiv (see Publications).
Our editorial describing the symposia that were hosted by the ISCB Student Council over 2018 calendar year was recently published in F1000 Research. I am proud to say this is the first year I attended 2 out of the 3. While I was not a primary organizer for any of the events, all three were well received and the symposia chairs, the committee members, and all of the attendees are to thank.
I have been invited to give a talk at the Cold Spring Harbor Laboratory Biological Data Science meeting (#biodata18) November 7-10. My talk is preliminarily titled “Building an automated bioinformatician—More accurate, large-scale genomic discovery using parameter advising”.