Thesis

A hybrid approach to biomedical assertions classification

Thesis (M.S.)--California State University San Marcos, 2011

Committee members: Rocio Guillen (chair), Rika Yoshii, Jose Mendoza

Key words: Natural Language Processing, Machine Learning, Classification, Text Mining, Feature Selection, Support Vector Machine, Naive Bayes Algorithm

Medical records processing for clinical purposes are a topic of increasing importance and interest. Prior work has broken the problem down into a number of smaller problems; these include the identification of medical problems, treatments, and tests, and how they relate to the patient. Utilizing text mining and classification techniques to address these issues has great potential for aiding patient treatment and reducing costs to patient care providers. Solutions often approach this problem from either a classical natural language processing perspective or with statistical machine-learning based techniques. In this thesis, we define a novel hybrid approach to classify known problems found in patient medical records. We propose that a hybrid system incorporating techniques from both approaches can leverage the value of each while mitigating some of their weaknesses. While our research is applied to medical records, specifically records provided in the 2010 i2b2/VA Shared Task, this approach can be applied to classifying other data sets and can utilize various machine-learning algorithms. The major steps of our approach include 1) normalizing data, 2) generalization through application of context-independent rules, 3) shallow parsing or chunking of data, 4) application of context-dependent rules across chunks, 5) generation of features for a machine-learning-based classifier. We incorporated the relational database PostgreSQL, the scripting language Perl, the Weka data mining toolkit, and the libSVM classifier library to implement our approach. From Weka, Naive Bayes and Support Vector Machine classifiers were used. We developed our own feature generation algorithm adapted from regular expressions. Using our approach we were able to classify data with a top overall F-Measure of 88.04%. The research in this thesis has a twofold contribution to the sciences. The first is to the Computer Science community. We define a text-mining approach using shallow parsing and regular expressions that can be applied to other processes and datasets. Our second contribution to the domain of Medicine provides an alternative approach to classifying medical records that does not require large resources to process

Includes bibliographical references (p. 117-120)

Title from first page of PDF file (viewed July 28, 2011)

Medical records processing for clinical purposes are a topic of increasing importance and interest. Prior work has broken the problem down into a number of smaller problems; these include the identification of medical problems, treatments, and tests, and how they relate to the patient. Utilizing text mining and classification techniques to address these issues has great potential for aiding patient treatment and reducing costs to patient care providers. Solutions often approach this problem from either a classical natural language processing perspective or with statistical machine-learning based techniques. In this thesis, we define a novel hybrid approach to classify known problems found in patient medical records. We propose that a hybrid system incorporating techniques from both approaches can leverage the value of each while mitigating some of their weaknesses. While our research is applied to medical records, specifically records provided in the 2010 i2b2/VA Shared Task, this approach can be applied to classifying other data sets and can utilize various machine-learning algorithms. The major steps of our approach include 1) normalizing data, 2) generalization through application of context-independent rules, 3) shallow parsing or chunking of data, 4) application of context-dependent rules across chunks, 5) generation of features for a machine-learning-based classifier. We incorporated the relational database PostgreSQL, the scripting language Perl, the Weka data mining toolkit, and the libSVM classifier library to implement our approach. From Weka, Naive Bayes and Support Vector Machine classifiers were used. We developed our own feature generation algorithm adapted from regular expressions. Using our approach we were able to classify data with a top overall F-Measure of 88.04%. The research in this thesis has a twofold contribution to the sciences. The first is to the Computer Science community. We define a text-mining approach using shallow parsing and regular expressions that can be applied to other processes and datasets. Our second contribution to the domain of Medicine provides an alternative approach to classifying medical records that does not require large resources to process

Relationships

Items