Automation of Medical Record Risk Factor Tagging Using Machine Learning and Natural Language Processing Methods

This paper describes a system that automatically tags unstructured medical records using the Naïve Bayes and Decision Tree algorithms for a variety of risk factors and patient medical history indicators. Additional natural language processing techniques such as chunking are applied to reduce the feature set and improve the results. We ran experiments on data consisting of de-identified medical records provided by the i2b2 NLP challenge 2014; on the final testing set trial, the Naïve Bayes classifier achieved 10.89% precision, 60.82% recall, and an F1 measure of 18.47%, while the Decision Tree classifier achieved 84.7% precision, 40.5% recall, and an F1 measure of 54.8%. The results suggest that for this particular task, the Decision Trees classifier performs at a superior level, and is a useful tool for automation of the medical record tagging process.