Computer Science and Engineering, Department of


Date of this Version



P. Z. Revesz, C. Assi, Data mining of pancreatic cancer protein databases, In: Advances in Environment, Computational Chemistry and Bioscience (includes Proc. 3rd International Conference on Bioscience and Bioinformatics), S. Oprisan et al., eds., WSEAS Press, pp. 320-325, 2012.



Christopher Assi, M.S. in Computer Science, August 2012.


Data mining of protein databases poses special challenges because many protein databases are non- relational whereas most data mining and machine learning algorithms assume the input data to be a type of rela- tional database that is also representable as an ARFF file. We developed a method to restructure protein databases so that they become amenable for various data mining and machine learning tools. Our restructuring method en- abled us to apply both decision tree and support vector machine classifiers to a pancreatic protein database. The SVM classifier that used both GO term and PFAM families to characterize proteins gave us over 73% accuracy in predicting whether a protein is involved in pancreatic cancer.