TY  - JOUR
AU  - Lee, Christine
AU  - Mohebbi, Matthew
AU  - O'Callaghan, Erin
AU  - Winsberg, Mirène
PY  - 2024
DA  - 2024/8/2
TI  - Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study
JO  - JMIR Ment Health
SP  - e58129
VL  - 11
KW  - mental health
KW  - telehealth
KW  - PHQ-9
KW  - Patient Health Questionnaire-9
KW  - suicidal ideation
KW  - AI
KW  - LLM
KW  - OpenAI
KW  - GPT-4
KW  - generative pretrained transformer 4
KW  - tele-mental health
KW  - large language model
KW  - clinician
KW  - clinicians
KW  - artificial intelligence
KW  - patient information
KW  - suicide
KW  - suicidal
KW  - mental disorder
KW  - suicide attempt
KW  - psychologist
KW  - psychologists
KW  - psychiatrist
KW  - psychiatrists
KW  - psychiatry
KW  - clinical setting
KW  - self-reported
KW  - treatment
KW  - medication
KW  - digital mental health
KW  - machine learning
KW  - language model
KW  - crisis
KW  - telemental health
KW  - tele health
KW  - e-health
KW  - digital health
AB  - Background: Due to recent advances in artificial intelligence, large language models (LLMs) have emerged as a powerful tool for a variety of language-related tasks, including sentiment analysis, and summarization of provider-patient interactions. However, there is limited research on these models in the area of crisis prediction. Objective: This study aimed to evaluate the performance of LLMs, specifically OpenAI’s generative pretrained transformer 4 (GPT-4), in predicting current and future mental health crisis episodes using patient-provided information at intake among users of a national telemental health platform. Methods: Deidentified patient-provided data were pulled from specific intake questions of the Brightside telehealth platform, including the chief complaint, for 140 patients who indicated suicidal ideation (SI), and another 120 patients who later indicated SI with a plan during the course of treatment. Similar data were pulled for 200 randomly selected patients, treated during the same time period, who never endorsed SI. In total, 6 senior Brightside clinicians (3 psychologists and 3 psychiatrists) were shown patients’ self-reported chief complaint and self-reported suicide attempt history but were blinded to the future course of treatment and other reported symptoms, including SI. They were asked a simple yes or no question regarding their prediction of endorsement of SI with plan, along with their confidence level about the prediction. GPT-4 was provided with similar information and asked to answer the same questions, enabling us to directly compare the performance of artificial intelligence and clinicians. Results: Overall, the clinicians’ average precision (0.7) was higher than that of GPT-4 (0.6) in identifying the SI with plan at intake (n=140) versus no SI (n=200) when using the chief complaint alone, while sensitivity was higher for the GPT-4 (0.62) than the clinicians’ average (0.53). The addition of suicide attempt history increased the clinicians’ average sensitivity (0.59) and precision (0.77) while increasing the GPT-4 sensitivity (0.59) but decreasing the GPT-4 precision (0.54). Performance decreased comparatively when predicting future SI with plan (n=120) versus no SI (n=200) with a chief complaint only for the clinicians (average sensitivity=0.4; average precision=0.59) and the GPT-4 (sensitivity=0.46; precision=0.48). The addition of suicide attempt history increased performance comparatively for the clinicians (average sensitivity=0.46; average precision=0.69) and the GPT-4 (sensitivity=0.74; precision=0.48). Conclusions: GPT-4, with a simple prompt design, produced results on some metrics that approached those of a trained clinician. Additional work must be done before such a model can be piloted in a clinical setting. The model should undergo safety checks for bias, given evidence that LLMs can perpetuate the biases of the underlying data on which they are trained. We believe that LLMs hold promise for augmenting the identification of higher-risk patients at intake and potentially delivering more timely care to patients. 
SN  - 2368-7959
UR  - https://mental.jmir.org/2024/1/e58129
UR  - https://doi.org/10.2196/58129
UR  - http://www.ncbi.nlm.nih.gov/pubmed/38876484
DO  - 10.2196/58129
ID  - info:doi/10.2196/58129
ER  -