An algorithm to guide the selection of specific biomolecules to be studied by wet-lab experiments
Madhavi Ganapathiraju ; University of Pittsburgh;
Content:
With the large amounts of biomedical data available today, the few instances of manually annotated or experimentally determined information is insufficient to confidently characterize remaining unlabeled data. Conversely, data characterization using wet-lab experimental methods is expensive in terms of expert man power, time and resources, making algorithms desirable which select biomolecules for experimental annotation such that the outcomes are non redundant and can accurately label the remaining data. Here, we present one such algorithm (which falls into the emerging area of active learning) in application to a problem relevant to structure based drug design. Membrane proteins (MPs) comprise 60% of drug targets. Prediction of transmembrane helix locations in MPs serves as the first step in computational modeling of their structure and thereby in structure based drug design. However, experimentally determined structures are available for less than 1% of MPs, and using standard modeling techniques it is not possible to predict structures of novel MPs which lack a representative structure.
Technology:
The algorithm is developed in Matlab.
Design:
Feature vectors of MP primary sequences are derived as in TMpro. A neural network is used to construct a Self Organizing Map of 5x8 nodes using a random subset of 1% of data. All data is then assigned to nodes by simulating the neural network on the data (see Figure). The active learning algorithm is implemented for two scenarios: (A) Data points are chosen based on cluster density and on need for disambiguation of labels of a given cluster. (B) Data points of an entire protein are chosen to maximize coverage on unlabeled data. Recall (% observed TM segments predicted correctly) and Precision (% predicted TM segments predicted correctly) metrics are used in evaluating the algorithms performance.
Results:
Novel active learning algorithm has been successfully designed for this domain for non redundant data selection without a loss in accuracy (81% F-score) compared to use of all the available training data.
Conclusion:
The algorithm can be translated to other domains to selectively choose data points for manual annotation where there are very few instances of manually annotated data that fail to accurately characterize all unlabeled data.
