41 Using Noise Perturbation along with Ga-SVM to Overcome over Fitting and Identify Biomarker Sets for Colorectal Cancer
-
Published:2010
Download citation file:
This paper describes an ongoing research effort to identify gene sets that predict the survival of colorectal cancer patients based on gene expression data. Since the dataset includes 395 genes (after initial feature reduction) and 122 patients, the issue of over fitting must be addressed. A genetic algorithm (GA) specifically designed for feature set selection is used in combination with a support vector machine (SVM). By evaluating groups of genes as opposed to individual genes, complementary sets are obtained. To combat over fitting, the original measurements are perturbed by noise using variances appropriate to each measurement and an overall gain that is adjusted until only a “modest” number of gene sets are repeatedly discovered. Through these adjustments we seek the strongest signal in the data set. The goal is the discovery of clinically useful diagnostic patterns or the rejection of a data set if the strongest signal is not biologically relevant. Initial simulations have shown signs of reproducibility, consistency, and relevance of identified (individual) genes.