Abstract: For clinical prediction problems, short free-text fields often hold valuable information. However, feature engineering from non-standardized fields can be difficult without manual curation. Word embedding approaches such as word2vec (Mikolov et al. 2013) or GloVe (Pennington et al. 2014) represent a mechanism for unsupervised and data-driven feature engineering for free text but suffer from a lack of interpretability necessary for applications in the clinical domain. Previous feature engineering approaches for short clinical text have relied on bag of words techniques or mapping concept unique identifiers from the Unified Medical Language System (UMLS) (Bodenreider 2004) to create features while others studies have used raw word embeddings. Combining information from pre-existing clinical ontologies from the UMLS and data-driven word embeddings to create interpretable features from short free-text could improve performance for clinical prediction problems. We combined word embeddings generated from the Global Vectors, or GloVe, method (Pennington et al. 2014) with clinical ontologies with an approach utilizing category word lists and the Bhattacharya distance to map embedding dimensions to interpretable categories (Senel et al. 2017). We applied the approach to generate features from emergency department chief complaints, the principle reason for visit, and predicted clinical orders placed during the visit. We compared functions for combining multiple words in a single chief complaint, variations on words lists and categories generated from distinct UMLS vocabularies, and utilizing interpretable features versus raw concept identifiers and raw word embeddings. We provide an automated and unsupervised framework for combining a priori knowledge and data-driven approaches for feature engineering from short free-text. This approach can be generalized to other clinical free-text and prediction problems beyond clinical orders.
Bio: Haley Hunter-Zinck is a health science specialist at the VA Boston Healthcare System. She has a Ph.D. in computational biology from Cornell University and transitioned to medical informatics during a postdoc in Porto Alegre, Brazil working with Brazilian public hospitals and a fellowship at VA Boston. She applies and develops machine learning techniques and visualization tools to improve hospital patient flow with a focus on the emergency department.