Leveraging Language Models and Machine Learning in Verbal Autopsy Analysis
Published in arXiv, 2025
Abstract
This thesis advances the growing body of knowledge at the intersection of advanced natural language processing (NLP), epidemiology, and global health. It is one of the first to comprehensively investigate the application of domain-adapted, pre-trained language models (PLMs) for cause of death (COD) classification in the verbal autopsy (VA) context, as well as the exploration of multimodal fusion strategies that explicitly leverage the inherently multimodal nature of VA data - supported by empirical evidence. It is also among the first to quantitatively characterize the landscape of information sufficiency in VA.
Timely and accurate COD estimates are critical to inform policy and program priorities, as well as tracking progress toward various targets defined in the Sustainable Development Goals framework, particularly in resource-constrained settings. In countries without comprehensive civil registration and vital statistics systems, VA is a critical tool for estimating COD and quantifying the burden of disease. In VA, trained interviewers ask proximal informants for details on the signs, symptoms, and circumstances preceding a death. The resulting data are multimodal with both unstructured narratives and structured questions. Physicians primarily use narratives to identify a COD. In contrast, existing automated VA cause classification algorithms only use the questions - a situation that ignores the additional information available in the narratives. Recently, automated algorithms have become increasingly important in routine (non-research) mortality surveillance applications because they are cheap, quick, and generate reproducible results.
In this thesis, we investigate how the VA narrative can be used for automated COD classification using PLMs and machine learning (ML) techniques. Using empirical data from South Africa, we demonstrate that with the narrative alone, transformer-based PLMs with task-specific fine-tuning outperform leading question-only algorithms (such as InSilicoVA) at both the individual and population levels. The narrative-only approach performs particularly well in identifying non-communicable diseases compared to the existing question-only approach.
Building on the unimodal findings, we explore various multimodal fusion strategies combining narratives and questions in unified frameworks. Multimodal approaches further improve performance in COD classification, confirming that each modality has unique contributions and may capture valuable information that is not present in the other modality.
Using empirical evidence, we characterize physician-perceived information sufficiency in VA data. We describe variations in sufficiency levels by age and COD and demonstrate that classification accuracy is affected by sufficiency for both physicians and automated methods. Finally, we investigate the potential for ML methods to predict and explain physician-perceived sufficiency. These findings reveal where VA data, coming from the VA instrument and interview, need to be improved to further refine cause classification.
Overall, this thesis demonstrates the value of the VA narrative in enhancing COD classification. Our findings underscore the need for more high-quality data from more diverse settings to use in training and fine-tuning PLM/ML methods, and additionally, they offer valuable insights to guide the rethinking and redesign of the VA instrument and interview. This study provides a clear example of how PLM and ML can greatly improve existing approaches in epidemiology, population health, and social science.
Recommended citation: Chu, Yue. "Leveraging Language Models and Machine Learning in Verbal Autopsy Analysis." arXiv preprint arXiv:2508.19274, 2025.
