OBJECTIVE: This is the first Malaysian machine learning model to detect and disambiguate abbreviations in clinical notes. The model has been designed to be incorporated into MyHarmony, a Natural Language Processing system, that extracts clinical information for healthcare management. The model utilizes word embedding to ensure feasibility of use, not in real-time but for secondary analysis, within the constraints of low-resource settings.
METHODS: A Malaysian clinical embedding, based on Word2Vec model, was developed using 29,895 electronic discharge summaries. The embedding was compared against conventional rule-based and FastText embedding on two tasks: abbreviation detection and abbreviation disambiguation. Machine Learning classifiers were applied to assess performance.
RESULTS: The Malaysian clinical word embedding contained 7 million word-tokens, 24,352 unique vocabularies, and 100 dimensions. For abbreviation detection, the Decision Tree classifier augmented with the Malaysian clinical embedding showed the best performance (F-score of 0.9519). For abbreviation disambiguation, the classifier with the Malaysian clinical embedding had the best performance for most of the abbreviations (F-score of 0.9903).
CONCLUSION: Despite having a smaller vocabulary and dimension, our local clinical word embedding performed better than the larger non-clinical FastText embedding. Word embedding with simple machine learning algorithms can decipher abbreviations well. It also requires lower computational resources and is suitable for implementation in low-resource settings such as Malaysia. The integration of this model into MyHarmony will improve recognition of clinical terms, thus improving the information generated for monitoring Malaysian healthcare services and policymaking.
* Title and MeSH Headings from MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine.