Use of Active Learning for Selective Annotation of Training Data in a Supervised Classification System for Digitized Histology
Scott Doyle ; Rutgers University;
Content:
Annotation of ground truth in a data set is necessary for proper training of a supervised classification system. However, in the field of digitized histopathology, obtaining labeled ground truth is costly and time-consuming. Active learning is a method that intelligently chooses informative samples (rather than random samples) from a database for annotation. In this work, we present an active learning paradigm for the automated classification of prostate cancer from images of digitized histopathology. By selectively annotating training samples based on their uniqueness, the number of training samples necessary for accurate classification is reduced when compared with traditional randomized techniques.
Technology:
A set of Hematoxylin and Eosin stained prostate tissue slides are scanned into a computer using a whole-slide digital scanner. These images are analyzed using a set of routines implemented in the MATLAB software package. Ground truth is labeled by an expert pathologist using standard image-editing software (Aperio ImageScope).
Design:
Our dataset consists of digitized high-resolution images of prostate histology. The goal of the supervised classification system is to detect which pixels in each image correspond to cancerous growths. From each digital image, a set of features is extracted, including Haralick co-occurrence features, Gabor filter features, and statistical greylevel features. Ground truth is annotated on each image by an expert pathologist, and from this pool of annotated data we construct a training set. In this study, we are comparing the accuracy of three different experimental setups: (1) the Randomized training set; (2) the Active Learning set; and (3) the Control set. To construct (1), we randomly sample the annotated data to generate ten sets of training data, each of which is used to train a decision tree classifier. Each of the ten decision trees casts a vote for the testing samples, so that each of the samples has between 0 and 10 votes for the cancer class. To construct (2), we append (1) with the testing pixels that received an intermediate (between 2 and 6) number of votes. In this way, we are choosing to annotate informative or difficult-to-classify samples to improve classification accuracy. These are used to retrain the ten decision tree classifiers and re-classify the images. Finally, we construct (3) by selecting samples that have either very low (0 or 1) or very high (7 through 10) votes for inclusion into the training set. By selecting these uninformative samples, we can ensure that any increase in accuracy is due to the inclusion of informative samples rather than the increase that would be expected when any data (informative or otherwise) is added to the training set.
Results:
The results of the analysis of 3 test images are shown below. We find that the classification accuracy increases when Active Learning is used to select additional training samples, and that the increase is due to the informative nature of the samples themselves, as non-informative samples (the Control group) show a smaller increase in accuracy. Shown in the table is the number of training samples used in each of the three setups, along with the accuracy obtained using that training set, for each of the three images. Note that our goal here is to show trends in accuracy rather than perfect classification; clearly, a more robust classifier can be built with a larger dataset to annotate. In each case, the classification accuracy goes up when Active Learning is employed to annotate informative samples for training. Further, the Control group uses an equal number of training samples to the Active Learning group, but achieves lower classification accuracy. This indicates that annotation of informative samples, rather than uninformative samples, is necessary to maximize classification accuracy.
Conclusion:
In this study we have presented an active learning paradigm for the training of a CAD system. By selective annotation of difficult-to-classify samples, we can increase the accuracy of the system using fewer training samples than would be necessary in a traditional training paradigm. Since labeling digitized histopathology is costly and time-consuming, we must choose to annotate only informative samples in order to maximize the ratio of accuracy to training. Implementing active learning reduces the cost of obtaining labeled ground truth samples from a pathologist by reducing the overall number of training samples necessary to obtain high classification accuracy.
