Off-campus UNL users: To download campus access dissertations, please use the following link to log into our proxy server with your NU ID and password. When you are done browsing please remember to return to this page and log out.
Non-UNL users: Please talk to your librarian about requesting this dissertation through interlibrary loan.
Making efficient learning algorithms with exponentially many features
Abstract
Expanding the learning problems' input spaces to high-dimensional feature spaces can increase expressiveness of the hypothesis class and thus may improve the performance of linear threshold-based learning algorithms such as Perceptron and Winnow. However, since the number of features is dramatically increased, these algorithms will not run efficiently unless special techniques are used. Such techniques include Monte Carlo approaches, grouping strategies and kernels. We investigated these techniques and applied them to two problems: DNF learning and generalized multiple-instance learning (GMIL). For DNF learning, we used a new approach to learn generalized (non-Boolean) DNF formulas, which uses all (exponentially many) possible terms over n attributes as inputs to Winnow. Then the weighted sums to Winnow are approximated using a Markov chain Monte Carlo (MCMC) method. We proposed an optimized version of this basic algorithm, which produces exactly the same classifications while often using fewer Markov chain simulations. We also empirically evaluated four MCMC sampling techniques in terms of accuracy of weighted sum estimates. For the GMIL problem, our work is based on Scott et al.'s algorithm GMIL-1 for their GMIL model, which has exponential time and space complexity since it uses all possible boxes in a discretized space as features. We proposed a Winnow-based algorithm GMIL-2 with a new grouping strategy, which has the same generalization ability as GMIL-1 and can save more than 98% time and memory in practice. Then we proposed a kernel that exactly corresponds to the feature mapping used by GMIL-1. We showed that this kernel is #P-complete to compute, and then gave a fully polynomial randomized approximation scheme for it. Our kernel showed improvements in both generalization error and time complexity over GMIL-1 and GMIL-2, and outperformed other MIL algorithms on benchmark data sets. We also proposed two extensions to further improve its generalization abilities. Both GMIL-2 and the kernels have been successfully applied to several applications: drug discovery, content-based image retrieval and biological sequence analysis.
Subject Area
Computer science|Artificial intelligence
Recommended Citation
Tao, Qingping, "Making efficient learning algorithms with exponentially many features" (2004). ETD collection for University of Nebraska-Lincoln. AAI3159564.
https://digitalcommons.unl.edu/dissertations/AAI3159564