Date of this Version
Srikanth Maturu, Application of Cosine Similarity in Bioinformatics, MS Thesis, University of Nebraska, Lincoln, 2018.
Finding similar sequences to an input query sequence (DNA or proteins) from a sequence data set is an important problem in bioinformatics. It provides researchers an intuition of what could be related or how the search space can be reduced for further tasks. An exact brute-force nearest-neighbor algorithm used for this task has complexity O(m * n) where n is the database size and m is the query size. Such an algorithm faces time-complexity issues as the database and query sizes increase. Furthermore, the use of alignment-based similarity measures such as minimum edit distance adds an additional complexity to the exact algorithm. In this thesis, an alignment-free method based similarity measures such as cosine similarity and squared euclidean distance by representing sequences as vectors was investigated. The cosine-similarity based locality-sensitive hashing technique was used to reduce the number of pairwise comparisons while finding similar sequences to an input query. We evaluated our algorithm on a proteins dataset of size 100,000 sequences and found that our cosine-similarity based algorithm is 28 times faster than the exact algorithm and 13 times faster than the BLASTP algorithm for finding similar sequences with percent identity greater than 90%. It also has 99.5% accuracy. We also developed a greedy incremental clustering algorithm based on our cosine-similarity nearest neighbor algorithm for removing redundant sequences in a protein dataset. We compared our clustering algorithm with a popular clustering algorithm CD-HIT. The clustering results on protein dataset of size 100000 show that our clustering algorithm generated clusters with accuracy almost equal to the CD-HIT algorithm accuracy. We further demonstrated two bioinformatics application where our cosine-similarity based algorithm can be used: an analysis of assembly data of various assemblers and a clustering of a protein dataset. Using our algorithm, we successfully compared the quality of assembly data of multiple de novo and genome-guided assemblers.
Adviser: Jitender Deogun