This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on http://mental.jmir.org/, as well as this copyright and license information must be included.
Web-based cognitive-behavioral therapeutic (CBT) apps have demonstrated efficacy but are characterized by poor adherence. Conversational agents may offer a convenient, engaging way of getting support at any time.
The objective of the study was to determine the feasibility, acceptability, and preliminary efficacy of a fully automated conversational agent to deliver a self-help program for college students who self-identify as having symptoms of anxiety and depression.
In an unblinded trial, 70 individuals age 18-28 years were recruited online from a university community social media site and were randomized to receive either 2 weeks (up to 20 sessions) of self-help content derived from CBT principles in a conversational format with a text-based conversational agent (Woebot) (n=34) or were directed to the National Institute of Mental Health ebook, “Depression in College Students,” as an information-only control group (n=36). All participants completed Web-based versions of the 9-item Patient Health Questionnaire (PHQ-9), the 7-item Generalized Anxiety Disorder scale (GAD-7), and the Positive and Negative Affect Scale at baseline and 2-3 weeks later (T2).
Participants were on average 22.2 years old (SD 2.33), 67% female (47/70), mostly non-Hispanic (93%, 54/58), and Caucasian (79%, 46/58). Participants in the Woebot group engaged with the conversational agent an average of 12.14 (SD 2.23) times over the study period. No significant differences existed between the groups at baseline, and 83% (58/70) of participants provided data at T2 (17% attrition). Intent-to-treat univariate analysis of covariance revealed a significant group difference on depression such that those in the Woebot group significantly reduced their symptoms of depression over the study period as measured by the PHQ-9 (F=6.47;
Conversational agents appear to be a feasible, engaging, and effective way to deliver CBT.
Up to 74% of mental health diagnoses have their first onset before the age of 24 [
Overcoming problems of stigma has been traditionally considered a major benefit of Internet-delivered and more recently mobile mental health interventions. In recent years, there has been an explosion of interest and development of such services to either supplement existing mental health treatments or expand limited access to quality mental health services [
With recent advancements in voice recognition, conversational interfaces (ie, those that use natural language as inputs and outputs) have begun to emerge. Conversational agents (such as Apple’s Siri or Amazon’s Alexa) may be a more natural medium through which individuals engage with technology. Humans respond and converse with nonhuman agents in ways that mirror emotional and social discourse dynamics when discussing behavioral health [
However, most consumer-facing conversational agents are not embodied. The capacity of text-based agents to deliver CBT is a question worth exploring given the ability of widely disseminated evidence-based digital apps to reduce the burden of mental illnesses in the US college population, estimated to be approximately 20 million [
Given the variability in quality of available mental health apps, a conversational agent was created to integrate 15 out of the 16 evidence-based recommendations for app development [
Thus, the objective of this study was to assess the feasibility of delivering CBT in a conversational interface via an automated bot in a way that facilitates engagement and reduction in symptoms. The current study compared outcomes from 2 weeks of a CBT-oriented conversational agent (Woebot), or an information control group (National Institute of Mental Health’s [NIMH] ebook) in a nonclinical college population. We hypothesized that conversation with a therapeutic process-oriented conversational agent would lead to greater improvement in symptoms relative to the information control group. We also hypothesized that receiving psychoeducational material in a conversational manner would be more acceptable to those who received it.
Potential participants were recruited using a flyer posted on social media websites targeting a US university community for students who self-identified as experiencing symptoms of depression and anxiety. Inclusion criteria included age 18 and over (screened at the first level via checkbox confirmation) and able to read English (implied). To guard against compromise, for example from malicious bots, all potential participants were sent an email requesting that they respond denoting their confirmation. Confirmed participants were randomized via computer algorithm that automatically generated a number between 0 and 1. Participants with numbers ˂0.5 were allocated to receive a direct link to begin chatting with Woebot in an instant messenger app, and participants with numbers >0.5 were sent a link to NIMH’s ebook on depression among college students [
Since this trial involved a nonclinical population of college students, it was considered exempt from registration in a public trials registry. See
Woebot is an automated conversational agent designed to deliver CBT in the format of brief, daily conversations and mood tracking. Woebot is used within an instant messenger app that is platform agnostic and can be used either on a desktop or mobile device. Each interaction begins with a general inquiry about context (eg, “What’s going on in your world right now?”), and mood (eg, “How are you feeling?”) with responses provided as word or emoji images to represent affect in that moment. After gathering mood data, participants are presented with core concepts related to CBT by link to short video, or by way of short “word games” designed to facilitate teaching participants about cognitive distortions. The first day included an “onboarding” process that introduced the bot, adding that while the bot may seem like a person, it is closer to a “choose your own adventure self-help book” and therefore not fully capable of understanding what the needs of the user may be. The bot also briefly explained CBT and notified the user that while a psychologist was “keeping an eye on things” (ie, monitoring), this was not happening in real time and thus the service should not be used as a replacement for therapy. In addition, participants were encouraged to call 911 for emergencies.
The bot employed several computational methods depending on the specific section or feature. The overarching methodology was a decision tree with suggested responses that also accepted natural language inputs with discrete sections of natural language processing techniques embedded at specific points in the tree to determine routing to subsequent conversational nodes. For the duration of the study, the decision tree structure remained the same for each participant and parameters did not change depending on the participants’ inputs. Weekly graphs were processed using temporal pattern recognition to provide users with weekly mood description.
The bot’s conversational style was modeled on human clinical decision making and the dynamics of social discourse. Psychoeducational content was adapted from self-help for CBT [
Empathic responses: The bot replied in an empathic way appropriate to the participants’ inputted mood. For example, in response to endorsed loneliness, it replied “I’m so sorry you’re feeling lonely. I guess we all feel a little lonely sometimes” or it showed excitement, “Yay, always good to hear that!”
Tailoring: Specific content is sent to individuals depending on mood state. For example, a participant indicating that they feel anxious is offered in-vivo assistance with the anxious event.
Goal setting: The conversational agent asked participants if they had a personal goal that they hoped to achieve over the 2-week period.
Accountability: To facilitate a sense of accountability, the bot set expectations of regular check-ins and followed up on earlier activities, for example, on the status of the stated goal.
Motivation and engagement: To engage the individual in daily monitoring, the bot sent one personalized message every day or every other day to initiate a conversation (ie, prompting). In addition, “emojis” and animated gifs with messages that provide positive reinforcement were used to encourage effort and completion of tasks.
Reflection: The bot also provided weekly charts depicting each participant’s mood over time. Each graph was sent with a brief description of the data to facilitate reflection, for example, “Overall, your mood has been fairly steady, though you tend to become tired after periods of anxiety. It looks like Tuesday was your best day.”
In the information control condition, participants were directed to the NIMH resources section and specifically, a free publication entitled “Depression in College Students” [
The Patient Health Questionnaire (PHQ-9) [
The Generalized Anxiety Disorder 7-item scale (GAD-7) [
The Positive and Negative Affect Schedule (PANAS) [
Mixed-format questions assessed feasibility and acceptability of both conditions. Participants from both groups were asked to rate on a 5-point Likert scale their level of overall satisfaction and satisfaction with content (0=hated it, 5=loved it, 3=neutral, 2 and 4 unlabeled); the extent to which they felt the intervention facilitated emotional awareness (0=not at all, 5=a lot, 3=neutral, 2 and 4 unlabeled); whether or not they learned anything (binary, yes/no response), and to what extent this learning was relevant to their everyday life (0=not at all, 5=a lot, 3=neutral, 2 and 4 unlabeled). In addition, participants were asked what the best and worst thing about their experience was and to provide other comments. While we were mainly interested in qualitative responses pertaining to the Woebot condition, responses to the information control allowed for an informal assessment of engagement. Finally, for those in the Woebot condition, we recorded total number of interactions (ie, conversations) with the bot over the 2-week period. An interaction was deemed to have taken place if mood and context data were recorded. Session or conversation length varied from approximately 90 seconds to 10 minutes, depending on psychoeducational content.
Statistical power calculations using analysis of covariance (ANCOVA) revealed that a sample size of 70 would have sufficient (80%) power to detect a moderate-large effect size (Cohen
To determine whether any significant differences between groups existed at baseline, independent
As secondary subgroup analyses, we conducted completer analyses using 2x2 repeated measures analysis of variance (ANOVA) to explore main and interaction effects.
Participants’ responses to open-ended questions were analyzed for the Woebot group using only thematic analysis and were reported as frequencies. Data were analyzed thematically using an inductive (data-driven) approach guided by the procedure outlined by Braun and Clarke [
The study was reviewed and approved by Stanford School of Medicine’s Institutional Review Board. Participants indicated their consent to the terms of the study via checkbox on an information sheet. As additional safety measures, participants in the Woebot group who denoted long-standing depression, suicidality, or self-harm were automatically provided with helpline numbers and a crisis text line number, and were encouraged to call 911 in emergencies.
With the exception of data on usage, which were collected by the Life Ninja Project, all study data were collected by the academic institution. Because of deidentification of all data transmitted between the Life Ninja Project and Stanford, usage data were not linked to specific research participants and are reported as means only for the entire group of study participants.
Of the randomized participants, 83% (58/70) went on to provide partial or complete data at T2 representing an overall attrition rate of 17%. Attrition was not equal between the arms and was greater among the information control group (31% vs 9%; χ21=5.16;
In terms of baseline characteristics, nearly half (46%, 32/69) of the sample was in the moderately-severe or severe range of depression at baseline as measured by the PHQ-9, while three-quarters (74%, 52/70) were in the severe range for anxiety as measured by the GAD-7.
Participant recruitment flow.
Demographic and clinical variables of participants at baseline.
Information control | Woebot | ||
Depression (PHQ-9) | 13.25 (5.17) | 14.30 (6.65) | |
Anxiety (GAD-7) | 19.02 (4.27) | 18.05 (5.89) | |
Positive affect | 26.19 (8.37) | 25.54 (9.58) | |
Negative affect | 28.74 (8.92) | 24.87 (8.13) | |
Age, mean (SD) | 21.83 (2.24) | 22.58 (2.38) | |
Male | 4 (7) | 7 (21) | |
Female | 20 (55) | 27 (79) | |
Latino/Hispanic | 2 (8) | 2 (6) | |
Non-Latino/Hispanic | 22 (92) | 32 (94) | |
Caucasian | 18 (75) | 28 (82) | |
Non-Caucasian | 6 (25) | 6 (18) |
Results of ITT analysis of entire sample on primary outcomes in the study at T2.
Information-only control | Woebot | ||||||
T2a | 95% CIb | T2a | 95% CIb | ||||
PHQ-9 | 13.67 (.81) | 12.07-15.27 | 11.14 (0.71) | 9.74-12.32 | 6.03 | .017 | 0.44 |
GAD-7 | 16.84 (.67) | 15.52-18.56 | 17.35 (0.60) | 16.16-18.13 | 0.38 | .581 | 0.14 |
PANAS positive affect | 26.02 (1.45) | 23.17-28.86 | 26.88 (1.29) | 24.35-29.41 | 0.17 | .707 | 0.02 |
PANAS negative affect | 27.53 (1.42) | 24.73-30.32 | 25.98 (1.24) | 23.54-28.42 | 0.91 | .912 | 0.344 |
aBaseline=pooled mean (standard error)
b95% confidence interval.
cCohen
Change in mean depression (PHQ-9) score by group over the study period. Error bars represent standard error.
As a secondary analysis, to explore whether any main effects existed, 2x2 repeated measures ANOVAs were conducted on the primary outcome variables (with the exception of PHQ-9) among completers only. A significant main effect was observed on GAD-7 (
To further elucidate the source and magnitude of change in depression, repeated measures dependent
Participants in the Woebot condition checked in with the bot (defined as at least providing context and mood information) an average of 12.14 times (SD 2.23; median 12; range 8-18) over the 2-week period, with almost all check-ins occurring on unique days. Since we could not track website visits, page views, click-through rates, etc, of NIMH’s website that hosted the ebook, we have no means of confirming to what extent individuals in the information control group engaged with the material. However, a total of 13 (52%) provided detailed comments suggesting they had read the ebook at least once.
While ratings indicated that both conditions were acceptable (above 3/5), participants in the Woebot condition reported significantly higher levels of satisfaction both overall (4.3 versus 3.4;
Thematic map of participants’ most favored features of their experience of using Woebot.
Thematic map of participants’ least favored experiences using Woebot.
A total of 11 “other comments” were received, which were all positive, either expressing gratitude for the experience: “I love Woebot so much. I hope we can be friends forever. I actually feel super good and happy when I see that it ‘remembered’ to check in with me!” Statements described how helpful it was: “I really was impressed and surprised at the difference the bot made in my everyday life in terms of noticing the types of thinking I was having and changing it”. Many spoke about Woebot in interpersonal terms, for example, “Woebot is a fun little dude and I hope he continues improving.”
To our knowledge this is the first randomized trial of a nonembodied text-based conversational agent designed for therapeutic use. The objective of the study was to explore whether a fully automated conversational agent based on CBT principals could deliver a therapeutic experience to college students over a 2-week period. We hypothesized that a conversational agent built to incorporate both evidence-based guidelines for the development of mental health apps as well as hypothesized therapeutic process variables would be highly engaging, more acceptable, and would lead to greater reductions in symptoms of anxiety and depression relative to an information control group.
The study confirmed that after 2 weeks, those in the Woebot group experienced a significant reduction in depression, thus our hypothesis was partially supported. Woebot was associated with a high level of engagement with most individuals using the bot nearly every day and was generally viewed more favorably than the information-only comparison.
Using Woebot was associated with a significant reduction in depression as measured by the PHQ-9. The effect size for depression was moderate though smaller than the four published studies [
The number of participants reporting that the bot felt empathic is noteworthy, and comments that referred to the bot as “he,” “a friend,” and a “fun little dude” suggest that the perceived source of empathy was Woebot rather than the bot’s developers. This is especially noteworthy since a purposefully robotic name “Woebot” was chosen to emphasize the nonhuman nature of the agent. This is in line with other work that suggests that therapeutic relationship can be established between humans and nonhuman agents in the context of health and mental health. For example, Bickmore et al [
The frequency of process-related comments made by participants in response to questions about their experience with Woebot suggests that conversational agents can approximate some therapeutic process factors. In addition, just as these factors are thought to convey much of the variance in positive outcomes across therapeutic approaches, this study suggests that conversational agent process factors, such as the ability to convey empathy, may be capable of both amplifying and conversely, violating, a therapeutic process. This underscores the importance of including trained and seasoned clinicians in clinical app design processes. While this point has been suggested, for example in the recent guidelines for clinical app evaluation published by the American Psychiatric Association [
There are several methodological weaknesses that limit the generalizability of the findings. As a feasibility study, we recruited a limited number of participants to receive a relatively short intervention, and no follow-up data were available to assess whether gains were sustained. The small number of participants meant that a formal mediator analysis was not possible, thus we cannot formally test a theorized relationship between engagement and outcome in this context of conversational agents. The study should be replicated with more participants, a longer dose, and a follow-up period to investigate if findings persist. In addition, sufficient numbers to test for mediation effects would inform theory. Aside from indirectly inferring from comments, objective quantitative data on engagement were not available for the information-only control group, thus it was not possible to compare engagement between the two groups in a meaningful way. In addition, because data were deidentified, it was not possible to explore whether any dose-response effects existed. Nonetheless, the relatively strong comparison group can be viewed as a strength of the study. Indeed, the relative strength of the control group was illustrated by the fact that individuals providing data in that group saw a similar reduction in anxiety as those who received Woebot, which supports the literature that suggests minimal passive psychoeducation alone can reduce symptoms of psychological distress [
Finally, the study was conducted in a New York area university community population and since we did not formally assess digital divide factors such as socioeconomic status, findings may be limited in their generalizability.
While results should be viewed with some caution and the findings need to be replicated, this study nonetheless demonstrates that a text-based conversational agent designed to mirror therapeutic process has the potential to offer an alternative and engaging method of delivering CBT for some 10 million college students in the United States who experience debilitating anxiety and depression.
CONSORT-EHEALTH checklist V1.6.2.
analysis of covariance
analysis of variance
Diagnostic and Statistical Manual of Mental Disorders, 4th edition
Generalized Anxiety Disorders scale
intention to treat
National Institute of Mental Health
Positive and Negative Affect Scale
Patient Health Questionnaire scale
time 2
The second author (AMD) is the founder of a commercial entity Woebot Labs Inc. (formerly, the Life Ninja Project) that created the intervention (Woebot) that is the subject of this trial and therefore has financial interest in that company. Woebot Labs Inc. covered the cost of participant incentives, though Standford made the payments.