Thesis

A machine learning approach for identifying subtypes of autism

Utilizing data mining techniques for the purpose of categorizing the Autism Spectrum
 Disorder (ASD) has great potential for better understanding this currently wide spread and
 complex medical condition. Since the 1960's, The Autism Research Institute (ARI) has been
 collecting data on the symptoms of children with ASD. The raw ARI database has over 120
 factors and over 40,000 unique entries. In this thesis, we define a novel four step approach
 to identify subtypes in this Autism data set. While our research is applied to the ARI data
 set, this approach can be applied to other data sets and can utilize various machine learning
 algorithms.
 The four steps of our approach are 1) data pre-processing, 2) apply clustering algorithms, 3)
 apply classification algorithms, and 4) evaluate the results. We used the Weka data mining
 tool to implement our four steps. From Weka, the Expectation Maximization (EM), and JRip
 classification algorithms were used. In addition, we developed our own Weka clustering
 algorithm adapted from Minimum Message length (MML) communication theory. Using
 our approach and these machine learning algorithms, we were able to identify twenty-five
 unique subtypes and nine major overarching types.
 These results show the usefulness that data mining can have in many fields; especially in the
 field of medical taxonomy. Through the evaluation process in our approach we are able to
 show which machine learning algorithms are more efficient with the ARI database and other
 similar data sets. In addition, our approach sets the ground work for future research in the
 emerging fields of data mining and Autism.
 The research in this thesis has a twofold contribution to the sciences. The first is to the
 Computer Science community. We define a robust data mining approach that can be
 applied to various data sets. In addition, we have also created a novel adaptation of the
 MML algorithms and introduced it into the Weka pallet of algorithms. The second
 contribution of our research to the sciences is to the Autism community. The recent rise in
 occurrences of this disorder has sparked a major need to understand its many aspects. Our
 research is a start at being able to segment the Autism Spectrum Disorder into smaller,
 comprehendible facets.
 Keywords: Data Mining, Machine Learning, Autism, Clustering, Classification, Expectation
 Maximization algorithm, Minimum Message length algorithm, JRip algorithm.

Utilizing data mining techniques for the purpose of categorizing the Autism Spectrum Disorder (ASD) has great potential for better understanding this currently wide spread and complex medical condition. Since the 1960's, The Autism Research Institute (ARI) has been collecting data on the symptoms of children with ASD. The raw ARI database has over 120 factors and over 40,000 unique entries. In this thesis, we define a novel four step approach to identify subtypes in this Autism data set. While our research is applied to the ARI data set, this approach can be applied to other data sets and can utilize various machine learning algorithms. The four steps of our approach are 1) data pre-processing, 2) apply clustering algorithms, 3) apply classification algorithms, and 4) evaluate the results. We used the Weka data mining tool to implement our four steps. From Weka, the Expectation Maximization (EM), and JRip classification algorithms were used. In addition, we developed our own Weka clustering algorithm adapted from Minimum Message length (MML) communication theory. Using our approach and these machine learning algorithms, we were able to identify twenty-five unique subtypes and nine major overarching types. These results show the usefulness that data mining can have in many fields; especially in the field of medical taxonomy. Through the evaluation process in our approach we are able to show which machine learning algorithms are more efficient with the ARI database and other similar data sets. In addition, our approach sets the ground work for future research in the emerging fields of data mining and Autism. The research in this thesis has a twofold contribution to the sciences. The first is to the Computer Science community. We define a robust data mining approach that can be applied to various data sets. In addition, we have also created a novel adaptation of the MML algorithms and introduced it into the Weka pallet of algorithms. The second contribution of our research to the sciences is to the Autism community. The recent rise in occurrences of this disorder has sparked a major need to understand its many aspects. Our research is a start at being able to segment the Autism Spectrum Disorder into smaller, comprehendible facets. Keywords: Data Mining, Machine Learning, Autism, Clustering, Classification, Expectation Maximization algorithm, Minimum Message length algorithm, JRip algorithm.

Relationships

Items