Date of this Version
James Duin, Hierarchical Active Learning Application to Mitochondrial Disease Protein Dataset, MS thesis, University of Nebraska-Lincoln, May 2017.
This study investigates an application of active machine learning to a protein dataset developed to identify the source of mutations which give rise to mitochondrial disease. The dataset is labeled according to the protein's location of origin in the cell; whether in the mitochondria or not, or a specific target location in the mitochondria's outer or inner membrane, its matrix, or its ribosomes. This dataset forms a labeling hierarchy. A new machine learning approach is investigated to learn the high-level classifier, i.e., whether the protein is a mitochondrion, by separately learning finer-grained target compartment concepts and combining the results. This approach is termed active over-labeling. In experiments on the protein dataset it is shown that active over-labeling improves area under the precision-recall curve compared to standard passive or active learning. Because finer-grained labels are more costly to obtain, alternative strategies exploring using fixed proportions of a given budget to buy fine vs. coarse labels at various costs are compared and presented. Finally, we present a cost-sensitive active learner that uses a multi-armed bandit approach to dynamically choose the label granularity to purchase, and show that the bandit-based learner is robust to variations in both labeling cost and budget.