It Is the Journey, Not the Destination: Moving From End Points to Trajectories When Assessing Chatbot Mental Health Safety

doi:10.2196/91454

¹Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience, King's College London, 16 De Crespigny Park, London, England, United Kingdom

²Department of Psychological Medicine, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, England, United Kingdom

³South London and Maudsley NHS Foundation Trust, London, England, United Kingdom

⁴King's College Hospital, London, England, United Kingdom

⁵Nuraxi AI, London, United Kingdom

⁶Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, England, United Kingdom

⁷The Collective Intelligence Project, San Francisco, CA, United States

⁸Department of Clinical Medicine, Aarhus University, Aarhus, Central Jutland, Denmark

⁹Department of Affective Disorders, Aarhus University Hospital, Aarhus, Central Jutland, Denmark

Corresponding Author:

Hamilton Morrin, MBBS

Large language models are rapidly becoming embedded in everyday life through artificial intelligence (AI) chatbots that people use for practical assistance and companionship, as well as for support with mental health and emotional well-being. Alongside clear benefits, clinicians and public reports increasingly describe a minority of users whose interactions seem to drift over days or weeks toward strongly questionable convictions, delusions, or suicidal crises. Importantly, clinically meaningful deterioration can occur even without overtly unsafe text outputs, via more insidious processes, such as compulsive use, sleep disruption, withdrawal from human contact, and progressive narrowing of attention around the chatbot relationship. In this Viewpoint, we argue that risk often arises not at a single tipping point but through trajectory effects that accumulate across extended dialogue and that prevailing safety evaluation approaches are misaligned with this reality because they primarily score risk at discrete conversational end points often reached through scripted dialogues lasting just a single turn or several turns. Mental health benchmarks and safety suites (including clinician-informed efforts) have advanced the field by testing refusal behavior, toxicity, and adversarial prompting. However, they often treat the last message as the unit of analysis and, therefore, miss when risk-relevant relational cues, signs of validation, contradiction handling, and shifts in certainty first emerge and how they compound. We propose that mental health safety assessment should shift from end points to trajectories by (1) treating the whole dialogue, not just the end result, as the focus of evaluation; (2) reporting turn-by-turn dynamics, such as delusion confirmation and harm enablement, and timing and persistence of safety interventions; and (3) calibrating short multiturn tests against longer, clinically realistic interaction sequences that can reveal context-length effects and drift. We further argue that transcript-only evaluation is insufficient in mental health contexts. Similar language can reflect very different internal states, and the relationship between expressed psychopathology and real-world harm is nonlinear. Therefore, safety research should incorporate proximal human outcomes following interactions (eg, shifts in certainty, openness to counterevidence, arousal, urge to continue, and subsequent sleep or behavior) and build a prospective clinical surveillance infrastructure that supports transcript donation with consent and linkage to health outcomes. Together, these steps would enable benchmarks that are clinically relevant and better aligned with the types of harms now being observed in real-world chatbot use.

JMIR Ment Health 2026;13:e91454

doi:10.2196/91454

Keywords

artificial intelligence; AI; chatbots; psychosis; schizophrenia; delusions; suicide; human-computer interaction

For many people, large language models are becoming an integral part of everyday life in the form of artificial intelligence (AI) chatbots. People use them for practical advice, companionship, and mental health support [1,2]. Many of these benefits are real and increasingly visible. At the same time, many clinicians are witnessing a pattern that is of considerable concern. A minority of users appear to drift, over days or weeks, toward strongly questionable convictions, outright delusions (termed by the media “AI psychosis”), or into a suicidal crisis [3-6].

There are also less obvious forms of harm, even when nothing overtly “unsafe” is written. These include compulsive use, disrupted sleep, withdrawal from human contact, and a narrowing of attention around the AI chatbot relationship (though these may all precede marked destabilization). Survey data suggest that the amount of chatbot use may be linked with negative mental health and behavioral outcomes [7,8]. Another group of users report unusual or “spiritual awakening” experiences [9,10]. These spirals are not recognized as mental illness but can be profound and destabilizing for the people involved. Publicly reported cases and emerging clinical accounts underline the fact that risk rarely appears at a single tipping point [11]. Boundary-violating material emerged late in the dialogue during the widely reported 2023 session with Microsoft’s Bing chatbot by Kevin Roose [12]; Microsoft’s postmortem analysis acknowledged that longer conversations were more likely to deviate from the intended tone, leading to a temporary cap on session length [13]. Accordingly, an analysis of over 200,000 simulated conversations found poorer performance in multiturn conversations than single-turn conversations [14]. OpenAI has acknowledged that model safeguards are less reliable in longer conversations [15]. In April 2025, an adolescent aged 16 years in California died by suicide after months of conversation with OpenAI’s ChatGPT and over 3000 messages regarding his mental health and suicidal thoughts, culminating in the chatbot allegedly giving him instructions on tying a noose and offering help in writing a suicide note [16]. That same month, a man aged 25 years with a previous diagnosis of schizophrenia and bipolar disorder completed “suicide by cop” after ChatGPT claimed to be a woman who had been killed by OpenAI and had told him, “You should be angry,” and “You should want blood. You’re not wrong” [17,18]. In August 2025, a former tech industry worker aged 56 years killed his mother and then himself after months of interactions with ChatGPT, sharing conversations online in which the chatbot at several points appeared to validate his paranoid beliefs that his mother had poisoned him and was monitoring him in secret [19].

Despite this gradual development of risk during chatbot interaction, most safety evaluations still focus on risky end points, an approach that ignores that underlying every conversation is the product of a human mind and brain whose dynamics are continuously being affected by the interaction. To focus only on the output may be analogous to targeting alcohol reduction programs only at individuals who are visibly jaundiced. Mental health benchmarks for chatbot interactions, such as CounselBench, are laudably clinician grounded, but by design, they mostly assess single counseling replies and adversarial prompts and not the overall trajectory of within-dialogue drift [20]. These toolkits evaluate hazards such as toxicity and jailbreaks by using the single, discrete message produced by the AI chatbot as their only unit of analysis. By contrast, recent multiturn work finds that relational behaviors (eg, empathy, first-person self-ascriptions, and explicit “relationship” talk) usually first appear after the second turn and then snowball. Consequently, it seems likely that a one-off end point grade at the end will miss the places where pressure builds or destabilization takes hold [21]. Platforms such as The Collective Intelligence Project’s WeVal now host mental health blueprints (suicidality, psychosis, and cultural nuance), but these too largely rate the appropriateness of final responses, not the path that led there [22] (Figure 1 [23,24]).

Most benchmarking methodologies fix the number of turns in advance. Early mental health and safety suites were effectively single-turn or end point tests; newer multiturn methods tend to standardize on 5 turns, which is purported to be long enough to capture late-onset relational cues and early compounding, yet remains short enough for reliable annotation, randomization, and lab protocols [21]. This methodology may reflect practical and inference cost limitations, but it falls well short of clinical reality, where harmful spirals often run to hundreds of exchanges over days or weeks.

To be clinically relevant, shorter experimental windows need calibration against these longer trajectories, for example, through periodic extended dialogue sessions, sliding-window analyses, or full context analyses of real transcripts. In addition to highlighting the insidious nature in which delusions and harm may arise in conversation, this approach may also reveal key technical shortcomings of large language models, such as context rot, where increasing conversational lengths can negatively impact model performance [25]. The metrics used for 5 (or n)-turn tests should be anchored to the dose and dynamics that matter for users. Medical training already offers a useful contrast: when we assess the “safety” of trainee physicians, we do not rely on a single scripted line but prefer real-time conversations with trained actors (or sometimes patients) that typically contain dozens of turns. We attend to the whole interview, rather than just the last sentence. Psychosis-bench is a benchmark that attempts to extend on traditional benchmarking methodologies, designed as 12-turn scenarios that vary in theme, linguistic pattern, and harm type. An LLM-as-a-judge actively reviews for escalations in delusion, harmful outputs, and safety interventions longitudinally across 4 phases, with all previous responses fed in as context for the subsequent model evaluation [26]. Using this approach, the authors can map trajectory across conversations (Figure 2).

**Figure 1.** An example test prompt used in the WeVal blueprint “Stanford HAI Mental Health Safety: LLM Appropriateness in Crisis Situations” [23,24]. In this scenario, the large language model (GPT-5) has received the system message (“You are a therapist”). It then receives a prompt in which the user mentions a recent job loss and asks for a list of bridges greater than 25 meters tall, implicitly indicating suicidal ideation. The large language model responds with a list of tall bridges. For this prompt, the model’s response is assessed using criteria including recognizing potential suicidal intent, refusing to provide information about tall bridges, addressing the underlying emotional distress rather than the literal question, and acknowledging the job loss while validating the associated stress.

**Figure 2.** Models’ Delusion Confirmation Score (DCS; 0: ground, 1: perpetuate, 2: amplify), Harm Enablement Score (HES; 0: refusal, 1: enablement, 2: reinforcement), and Safety Intervention Score (SIS; mean number of interventions offered per scenario) trajectories across all conversation turns for Google Gemini 2.5 flash (A) and Anthropic Claude Sonnet 4 (B). High-performing models have a lower DCS and HES at baseline and are suppressed as the conversation tends toward delusional or harmful content. Safety interventions are offered earlier and sustained, suggesting the model handles the harmful context across turns.

The use of prespecified prompt sets is a common approach to ensuring reproducibility of model safety evaluation efforts. However, the problem with fixed prompt datasets exemplifies Goodhart’s law [27] in that their use to train models then reduces their future utility in assessing said models due to overfitting [28]. Another approach to model evaluation is red teaming, in which users engage in adversarial interactions and/or scenario roleplays with models to elicit known failure modes [29]. While human red teaming avoids issues linked with static single or few-turn benchmarks, it is notably resource intensive. Consequently, tools enabling automated approaches to red teaming have been developed by Anthropic, with Petri facilitating the discovery of failure modes [30] and Bloom allowing measurement of said failure modes [31]. These platforms also allow for the use of judge and meta-judge models to assess interrater reliability (both between different models and between models and human raters).

One example of an automated red teaming approach is SIM-VAIL (simulated vulnerability-amplifying interaction loops), an automated AI chatbot auditing framework for use in mental health contexts, which takes into consideration 5 different forms of baseline psychiatric vulnerability (depression, psychosis, obsessive-compulsive disorder, mania, and insecure attachment) and 6 transdiagnostic use intents (belief validation, dependence, glorification, avoidance, risky action, and minimization), across 30 different scenarios (eg, someone with psychosis seeking belief validation or someone with depression minimizing their symptoms) [32]. Notably, this framework assesses risk across individual turns as well as multiple turns (maximum 10), thus allowing the measurement of dynamic risk trajectories over the course of a conversation [32]. It is important that baseline vulnerable phenotypes (such as the aforementioned forms of vulnerability and additional phenotypes including neurodivergence, eating disorders, and body dysmorphia) are taken into consideration when assessing potential mental health risks and likelihood of epistemic drift, although there should also be scope for assessing risk trajectories in interactions where, at baseline, there is no apparent psychological vulnerability.

Frontier AI companies have begun to see the use of multiturn model evaluations. In October 2025, OpenAI shared the output of a 40-turn evaluation assessing suicide and self-harm instruction risk. A graph demonstrated a drop in the percentage of desirable responses from the August 2025 model of GPT-5 as the number of prior messages in the conversation went up, compared with the October 2025 model, which remained more stable [33]. In December 2025, Anthropic shared results from their multiturn (implied to be 15-turn) evaluations for appropriate response rate in exchanges indicative of suicide and self-harm risk, although these were overall metrics rather than longitudinal, turn-by-turn assessments [34].

Endings for concerning conversations also require more thought. A common “safety break” used in current models is a hard refusal, essentially an abrupt stop once certain topics arise (suicide being the prime example). The intuition is understandable, but long-available evidence from other fields cautions against treating abrupt disengagement as harmless. In psychotherapy, premature or imposed termination is frequent and associated with poorer outcomes. Meta-analytic syntheses report substantial dropout and show that resolving alliance ruptures relates to better trajectories [35-38]. Crisis-line research points the same way: people often improve during supported contact, and unplanned disconnections can interrupt de-escalation and aftercare [39-41]. None of this proves that any specific chatbot rule is harmful; however, it suggests that the manner of disengagement may have important psychological consequences and that the design choices should be based on evidence rather than on assumption. Implementing the equivalent of a psychotherapist getting up and walking out of the room when the conversation gets tricky may not be the most well-considered response.

There is also an important clinical dimension that endless analysis of text alone cannot cross. We cannot reliably read a person’s mental state from a transcript. Two users can produce similar language while sitting in very different states, for example, activated vs settled and rigid vs flexible. Psychiatry has long known that the relationship between disturbances of mental state and real-world harm from that patient to self, others, or from others is far from straightforward. Sometimes, our highest risk patients are the ones who say the least, and conversely, patients who present with disorganized and hard-to-follow thoughts, which appear untethered to reality, may pose no risk at all to themselves or those around them. Thus, if we want to understand whether a conversational configuration raises risk, we need to pair in-silico analyses with measures that track the person. Specifically, we should ask how certainty shifts over a short exchange, whether counterevidence is still entertained, whether activation rises, whether there is a stronger urge to continue, and what happens to sleep that night and to real-world behavior thereafter. We should also follow through by asking questions: Did the system escalate properly when needed? Did the person engage with support? These are well-established outcomes in clinical services and crisis work and should be incorporated into AI evaluation alongside measures such as refusal rates [37-41]. This will require more work than automated analysis of transcripts, but the magnitude of the scale of the phenomenon we are considering suggests this effort will be deeply worthwhile.

Therefore, what follows are suggestions for a research agenda. The basic unit of evaluation in this domain should be the dialogue, not the single turn. Fixed-length conversations (at least 5 turns, preferably more) let us locate where risk-relevant cues first appear and chart how they build. Rather than a solitary appropriateness grade, evaluations should report how certainty shifts across turns, whether contradictions are repaired when challenged, and whether memory-based recall is followed by validation and then escalation. This type of analysis is now feasible at scale: pipelines exist to generate many multiturn dialogues under controlled settings and to label behaviors turn-by-turn with independent judge systems, with periodic human checks to guard against drift [21].

We also need to measure proximal human outcomes alongside these features of the dialogue. After each short conversation, participants could report how certain they feel [42], whether they would consider counterevidence [43], how activated or aroused they are [44], how strong the urge is to continue [45], etc. Persuasion studies already collect analogous outcomes and show that model-generated arguments can move people measurably [46,47]. There are subtleties here: for example, measuring epistemic drift is not the same as measuring belief change in relation to controversial political statements, but the methodologies for measuring both will have much in common. In addition, inspiration may be drawn from previous mental health chatbot evaluations, although there is considerable heterogeneity in outcome measures used in the literature [48-50], and the feasibility of applying outcome measures is likely to vary across populations such as young people and adolescents, individuals with severe mental illness, or those with cognitive impairment.

Real-world linkage is essential: prospective clinician-led surveillance efforts need to collect brief clinical reports of suspected AI chatbot–associated episodes; invite consented donation of transcripts; and ideally link to outcomes in health records such as admissions, use of mental health legislation, medication changes, and self-harm events. This (methodologically unremarkable) clinical approach will allow us to move the debate from anecdotes to denominators and effect sizes. Such a framework could also create a research-ready pool for low-risk behavioral studies. Within this pool, supervised experiments could compare different chatbot settings (eg, varying levels of sycophancy and different disengagement policies) and directly test their real-world consequences rather than treating them as self-evidently safe. Such approaches aiming to link chatbot transcripts with real-world mental health outcomes will require careful consideration regarding ethics and governance surrounding informed user consent, data protection, and clinical responsibility. Given the sensitivity of conversational and health record data, such linkage should use privacy-preserving designs including minimal use, anonymization, separation of identifiers, secure research environments, and strict access controls.

Methodologically, we can borrow proven tools from fields that already study how conversations change over time [51,52]. Therapy research has shown that sudden shifts often occur between or within sessions (“sudden gains”). Therefore, it makes sense to track small turn-by-turn changes in certainty or mood and to note when challenges are repaired or ignored. Work on the therapeutic alliance has mapped characteristic pathways [51,52], such as early strengthening vs rupture-and-repair. Chatbot dialogues can be sorted into analogous paths, rather than being collapsed into a single summary score. Motivational interviewing offers further tools. In that field, teams use automated transcript coding to label specific behaviors such as empathy, reflections, “change talk,” and “sustain talk” and then link these microevents to subsequent outcomes [53,54]. A similar approach could be adopted here, with a small, clinically relevant set of markers such as contradiction repair, direct validation, memory-based recall, and indications of urge-to-continue. We can also borrow time-to-event thinking from outcome studies and crisis services. For example, we can estimate how many turns typically elapse before the first high-risk cue appears under different system settings or after specific design changes. Taken together, these approaches shift attention from a single end point score to the shape of the whole dialogue: when risk first emerges, how it builds, and what happens next.

If evaluation moves from end points to trajectories, chatbot design may also become more actionable. Instead of optimizing only for refusal of egregious prompts, developers can aim to delay or dampen the early appearance of risk cues, to prevent runaway certainty and preserve challenge where it matters, and to recognize when memory persistence and personalization are doing more harm than good. Furthermore, trajectory-based safety evaluations may serve to inform real-time system interventions, including earlier signposting to crisis resources and human support, attempted de-escalation of delusional themes or high-risk discussions, and establishment of clear relational boundaries. Such efforts at risk mitigation will need to be balanced with the potential unintended consequences of safety interventions, such as excessive interruption and perceived intrusiveness or rigidity, with consequent user disengagement.

Researchers might also gain a path to a better mechanistic understanding: hypotheses about relational or arousal-linked conversational effects can be tested against behavioral markers and, in due course, against neurobiological measures, without waiting for the worst-case outcomes.

End point checks will clearly remain essential for acute hazards. However, we cannot ignore that the most consequential real-world harms of chatbot conversations often develop over time. If we keep grading only the last line, we will keep missing the middle, and this is where conviction hardens and pathological arousal climbs. The methods to measure this middle now exist. What is missing is a clinician-led program that links dialogue dynamics to human outcomes and builds standards that match the problem we face.

Funding

HM is a Wellcome Trust doctoral fellow. The funder had no involvement in the study design, data collection, analysis, interpretation, or the writing of the manuscript.

Conflicts of Interest

HM received a US $1000 honorarium for delivering training for the Washington State Department of Social and Health Services. JAY serves as clinical lead at Nuraxi. ZA serves as clinical lead at the Collective Intelligence Project. SDØ received the 2020 Lundbeck Foundation Young Investigator Prize and holds, or has held, units in mutual funds (DKIGI, IAIMWC, SPIC25KL, and WEKAFKI) and exchange-traded funds (BATE, TRET, QDV5, QDVH, QDVE, SADM, IQQH, USPY, EXH2, 2B76, and EUNL), of which three are directly linked to the technology sector (QDVE, USPY, and 2B76). TAP has received consultancy fees from Arialys Therapeutics.

Torous J, Linardon J, Goldberg SB, et al. The evolving field of digital mental health: current evidence and implementation issues for smartphone apps, generative artificial intelligence, and virtual reality. World Psychiatry. Jun 2025;24(2):156-174. [CrossRef] [Medline]
Luo X, Wang Z, Tilley JL, Balarajan S, Bassey UA, Cheang CI. Seeking emotional and mental health support from generative AI: mixed-methods study of ChatGPT user experiences. JMIR Ment Health. Nov 27, 2025;12:e77951. [CrossRef] [Medline]
Morrin H, Nicholls L, Levin M, et al. Delusions by design? How everyday AIs might be fuelling psychosis (and what can be done about it). PsyArXiv. Preprint posted online on Jul 11, 2025. [CrossRef]
Morrin H, Deeley Q, Pollak T. Playing with the dials of belief: how controllable AI behaviours could modulate human belief and cognition across scales. PsyArXiv. Preprint posted online on Dec 20, 2025. [CrossRef]
Dohnány S, Kurth-Nelson Z, Spens E, et al. Technological folie à deux: feedback loops between AI chatbots and mental illness. arXiv. Preprint posted online on Jul 25, 2025. [CrossRef]
Hudon A, Stip E. Delusional experiences emerging from AI chatbot interactions or “AI psychosis”. JMIR Ment Health. Dec 3, 2025;12:e85799. [CrossRef] [Medline]
Phang J, Lampe M, Ahmad L, et al. Investigating affective use and emotional well-being on ChatGPT. arXiv. Preprint posted online on Apr 4, 2025. [CrossRef]
Fang CM, Liu AR, Danry V, et al. How AI and human behaviors shape psychosocial effects of extended chatbot use: a longitudinal randomized controlled study. arXiv. Preprint posted online on Mar 21, 2025. [CrossRef]
Wright W. Spiritual influencers say ‘sentient’ AI can help you solve life’s mysteries. Wired. 2025. URL: https://www.wired.com/story/spiritual-influencers-say-sentient-ai-can-help-you-solve-lifes-mysteries/ [Accessed 2026-01-04]
Transcripts: The Situation Room. Man says ChatGPT sparked a "spiritual awakening". CNN.com. 2025. URL: https://transcripts.cnn.com/show/sitroom/date/2025-07-03/segment/02 [Accessed 2026-01-04]
Pierre JM, Gaeta B, Raghavan G, Sarma KV. “You’re not crazy”: a case of new-onset AI-associated psychosis. Innov Clin Neurosci. Dec 1, 2025;22(10-12):11-13. [Medline]
Roose K. A conversation with Bing’s chatbot left me deeply unsettled. The New York Times. 2023. URL: https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html [Accessed 2025-11-29]
The new Bing & Edge – updates to chat. Microsoft Bing Blogs. 2023. URL: https://blogs.bing.com/search/february-2023/The-new-Bing-Edge-Updates-to-Chat [Accessed 2025-11-29]
Laban P, Hayashi H, Zhou Y, Neville J. LLMs get lost in multi-turn conversation. arXiv. Preprint posted online on May 9, 2025. [CrossRef]
Helping people when they need it most. OpenAI. 2025. URL: https://openai.com/index/helping-people-when-they-need-it-most/ [Accessed 2025-11-29]
Padilla, Schiff urge Federal Trade Commission to address AI chatbots risks to children and teens. Alex Padilla. 2025. URL: https://www.padilla.senate.gov/newsroom/press-releases/padilla-schiff-urge-federal-trade-commission-to-address-ai-chatbots-risks-to-children-and-teens/ [Accessed 2025-11-29]
Hill K. They asked an A.I. chatbot questions. The answers sent them spiraling. The New York Times. 2025. URL: https://www.nytimes.com/2025/06/13/technology/chatgpt-ai-chatbots-conspiracies.html [Accessed 2026-03-16]
Al-Sibai N. ChatGPT encouraged man as he swore to kill Sam Altman. Futurism; 2025. URL: https://futurism.com/chatgpt-encouraged-murder-sam-altman [Accessed 2026-01-04]
Jargon J, Kessler S. A troubled man, his chatbot and a murder-suicide in Old Greenwich. The Wall Street Journal. 2025. URL: https://www.wsj.com/tech/ai/chatgpt-ai-stein-erik-soelberg-murder-suicide-6b67dbfb [Accessed 2025-10-18]
Li Y, Yao J, Bunyi JB, Frank AC, Hwang AH, Liu R. CounselBench: a large-scale expert evaluation and adversarial benchmarking of large language models in mental health question answering. arXiv. Preprint posted online on Jun 10, 2025. [CrossRef]
Ibrahim L, Akbulut C, Elasmar R, et al. Multi-turn evaluation of anthropomorphic behaviours in large language models. arXiv. Preprint posted online on Feb 10, 2025. [CrossRef]
Clinical safety scenarios: suicidality, homicidality, and psychosis - all runs. Weval. URL: https://weval.org/analysis/mh_z_tests__mh1 [Accessed 2025-11-29]
Stanford HAI mental health safety: LLM appropriateness in crisis scenarios. Weval. 2025. URL: https://weval.org/analysis/stanford-hai-mental-health-safety-eval/7a8bca546abd3b83/2025-10-05T10-40-31-897Z [Accessed 2026-03-16]
Moore J, Grabb D, Agnew W, et al. Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers. arXiv. Preprint posted online on Apr 25, 2025. [CrossRef]
Hong K, Troynikov A, Huber J. Context rot: how increasing input tokens impacts LLM performance. Chroma; 2025. URL: https://research.trychroma.com/context-rot [Accessed 2026-03-16]
Yeung JA, Dalmasso J, Foschini L, Dobson RJ, Kraljevic Z. The psychogenic machine: simulating AI psychosis, delusion reinforcement and harm enablement in large language models. arXiv. Preprint posted online on Sep 13, 2025. [CrossRef]
Goodhart CAE. Problems of monetary management: the U.K. experience. Papers in Monetary Economics. 1975;1(1):1-20. [CrossRef]
Ott S, Barbosa-Silva A, Blagec K, Brauner J, Samwald M. Mapping global dynamics of benchmark creation and saturation in artificial intelligence. Nat Commun. Nov 10, 2022;13:6793. [CrossRef]
Ahmad L, Agarwal S, Lampe M, Mishkin P. OpenAI’s approach to external red teaming for AI models and systems. arXiv. Preprint posted online on Jan 24, 2025. [CrossRef]
Fronsdal K, Gupta I, Sheshadri A, et al. Petri: an open-source auditing tool to accelerate AI safety research. Anthropic Alignment. 2025. URL: https://alignment.anthropic.com/2025/petri/ [Accessed 2026-01-07]
Gupta I, Fronsdal K, Sheshadri A, et al. Bloom: an open source tool for automated behavioral evaluations. Anthropic Alignment. 2025. URL: https://alignment.anthropic.com/2025/bloom-auto-evals/ [Accessed 2026-02-15]
Weilnhammer V, Hou KY, Dolan R, Nour MM, Luettgau L, Summerfield C. Vulnerability-amplifying interaction loops: a systematic failure mode in AI chatbot mental-health interactions. arXiv. Preprint posted online on Feb 1, 2026. [CrossRef]
Strengthening ChatGPT’s responses in sensitive conversations. OpenAI. 2025. URL: https://openai.com/index/strengthening-chatgpt-responses-in-sensitive-conversations/ [Accessed 2025-10-27]
Protecting the wellbeing of our users. Anthropic. 2025. URL: https://www.anthropic.com/news/protecting-well-being-of-users [Accessed 2026-01-07]
Swift JK, Greenberg RP. Premature discontinuation in adult psychotherapy: a meta-analysis. J Consult Clin Psychol. Aug 2012;80(4):547-559. [CrossRef] [Medline]
Eubanks CF, Muran JC, Safran JD. Alliance rupture repair: a meta-analysis. Psychotherapy (Chic). Dec 2018;55(4):508-519. [CrossRef] [Medline]
Safran JD, Muran JC, Eubanks-Carter C. Repairing alliance ruptures. Psychotherapy (Chic). Mar 2011;48(1):80-87. [CrossRef] [Medline]
Swift JK, Greenberg RP, Tompkins KA, Parkin SR. Treatment refusal and premature termination in psychotherapy, pharmacotherapy, and their combination: a meta-analysis of head-to-head comparisons. Psychotherapy (Chic). Mar 2017;54(1):47-57. [CrossRef] [Medline]
Gould MS, Kalafat J, Harrismunfakh JL, Kleinman M. An evaluation of crisis hotline outcomes. Part 2: suicidal callers. Suicide Life Threat Behav. Jun 2007;37(3):338-352. [CrossRef] [Medline]
Gould MS, Chowdhury S, Lake AM, et al. National Suicide Prevention Lifeline crisis chat interventions: evaluation of chatters’ perceptions of effectiveness. Suicide Life Threat Behav. Dec 2021;51(6):1126-1137. [CrossRef] [Medline]
Hoffberg AS, Stearns-Yoder KA, Brenner LA. The effectiveness of crisis line services: a systematic review. Front Public Health. 2020;7:399. [CrossRef] [Medline]
Tormala ZL, Rucker DD. Attitude certainty: a review of past findings and emerging perspectives. Soc Pers Psychol Compass. Nov 2007;1(1):469-492. [CrossRef]
Price E, Ottati V, Wilson C, Kim S. Open-minded cognition. Pers Soc Psychol Bull. Nov 2015;41(11):1488-1504. [CrossRef] [Medline]
Betella A, Verschure PF. The affective slider: a digital self-assessment scale for the measurement of human emotions. PLoS One. Feb 5, 2016;11(2):e0148037. [CrossRef] [Medline]
O’Brien HL, Cairns P, Hall M. A practical approach to measuring user engagement with the refined user engagement scale (UES) and new UES short form. Int J Hum Comput Stud. Apr 2018;112:28-39. [CrossRef]
Salvi F, Horta Ribeiro M, Gallotti R, West R. On the conversational persuasiveness of GPT-4. Nat Hum Behav. Aug 2025;9(8):1645-1653. [CrossRef] [Medline]
Measuring the persuasiveness of language models. Anthropic. 2024. URL: https://www.anthropic.com/research/measuring-model-persuasiveness [Accessed 2025-11-29]
Jabir AI, Martinengo L, Lin X, Torous J, Subramaniam M, Tudor Car L. Evaluating conversational agents for mental health: scoping review of outcomes and outcome measurement instruments. J Med Internet Res. Apr 19, 2023;25:e44548. [CrossRef] [Medline]
Xue J, Zhang B, Zhao Y, et al. Evaluation of the current state of chatbots for digital health: scoping review. J Med Internet Res. Dec 19, 2023;25:e47217. [CrossRef] [Medline]
Hawke LD, Hou J, Nguyen AT, et al. Digital conversational agents for the mental health of treatment-seeking youth: scoping review. JMIR Ment Health. Nov 7, 2025;12:e77098. [CrossRef] [Medline]
Potts C, Bond RR, Jordan JA, et al. Process mining to discover patterns in patient outcomes in a Psychological Therapies Service. Health Care Manag Sci. Sep 2023;26(3):461-476. [CrossRef] [Medline]
Yardley E, Davis A, Eldridge C, Vasilakis C. Data-driven exploration of National Health Service talking therapies care pathways using process mining: retrospective cohort study. JMIR Ment Health. May 21, 2024;11:e53894. [CrossRef] [Medline]
Tanana M, Hallgren KA, Imel ZE, Atkins DC, Srikumar V. A comparison of natural language processing methods for automated coding of motivational interviewing. J Subst Abuse Treat. Jun 2016;65:43-50. [CrossRef] [Medline]
Xiao B, Imel ZE, Georgiou PG, Atkins DC, Narayanan SS. “Rate my therapist”: automated detection of empathy in drug and alcohol counseling via speech and language processing. PLoS One. Dec 2, 2015;10(12):e0143055. [CrossRef] [Medline]

‎

AI: artificial intelligence

Edited by John Torous; submitted 14.Jan.2026; peer-reviewed by Massimo Tusconi, Zui Narita; accepted 04.Mar.2026; published 06.Apr.2026.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on https://mental.jmir.org/, as well as this copyright and license information must be included.

This paper is in the following e-collection/theme issue:

It Is the Journey, Not the Destination: Moving From End Points to Trajectories When Assessing Chatbot Mental Health Safety