Recognition of protein binding sites using support vector machines

Detection of functional sites of proteins is an important problem in computational biology and has wide implications in computational drug discovery. FEATURE by Stanford Helix Group is a system for predicting protein function based on a set of physiochemical properties of functional microenvironments and using Naive Bayesian classification for recognition of functional sites. In this work we explore the application of a novel classification method for FEATURE system with the goal of increasing its prediction accuracy. We address the challenge of learning in small sample size, highly imbalanced and high dimensional setting. We employ Support Vector Maehine(SVM) classification algorithm, which is known to be tolerant of high-dimensional low sample size problems. We analyze the performance of the Support Vector Machine learning algorithm and compare it with Nave Bayesian algorithm currently used in FEATURE using identical accuracy measures, data, and experiment methodology. We establish that Support Vector Machine classification is advantageous for prediction of functional sites for the functional families examined. We show that SVM approach is capable of identifying functional sites that both Naive Bayesian classification and sequence based methods misclassify. We also demonstrate that for Naive Bayesian classification neither homology filtering, nor the selection of the prior have effect on classification accuracy. The improved classification mechanisms allow higher confidence in pre-screening of functional microenvironment candidates. Improved software will be made available to the public and benefit the broader research community. The work is done in collaboration with Stanford HELIX group.