Marriott City Center, Pittsburgh, PA | September 20 - 24, 2009

An algorithm to guide the selection of specific biomolecules for future wet-lab experiments

Madhavi Ganapathiraju ; Assistant Professor;

Content:

With the large amounts of biomedical data available today, the few instances of labeled data are insufficient to confidently characterize remaining unlabeled data. Conversely, data characterization using wet-lab experiments is expensive in terms expert man power, time and resources. When choosing a specific biospecimen or molecule to be studied by wet-lab experiment, its redundancy with previously annotated (labeled) data is usually not taken into account. It is desirable to have algorithms which guide selection of data that is best suitable for improving accuracy and confidence of labeling the remaining data. We designed an algorithm with active learning, in application to a problem relevant to structure based drug design. Membrane proteins (MPs) comprise 60% of drug targets. Prediction of transmembrane helix locations in MPs serves as the first step in computational modeling of their structure and thereby in structure based drug design. However, experimentally determined structures are available for less than 1% of MPs. Using standard modeling techniques it is not possible to predict structures of novel MPs which lack a representative structure.

Technology:

The algorithm is developed in Matlab and the Matlab Neural Network Toolbox.

Design:

Feature vectors of MP sequences are derived as in TMpro. The data is clustered using a neural network based Self Organizing Map of 5x8 nodes. The active learning algorithm is implemented for two scenarios: One in which localized data points within a molecule are selected to maximize disambiguation while minimizing redundancy, and a second scenario in which selection is permitted only at a whole molecule level, in which case coverage on unlabeled data is maximized.

Results:

A novel active learning algorithm has been successfully designed for this domain for non redundant data selection with a gain in F-score and Precision compared to unguided selection of training data (see Figure); using only 1% of available labels, an F-score of 80% has been achieved.

Conclusion:

The algorithm can be translated to other domains, such as medical informatics and bio-image informatics to choose data selectively for manual or wet-lab annotation so as to accurately characterize the complete data while ensuring there is minimal redundancy.

Search