This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on https://mental.jmir.org/, as well as this copyright and license information must be included.
The measurement and monitoring of generalized anxiety disorder requires frequent interaction with psychiatrists or psychologists. Access to mental health professionals is often difficult because of high costs or insufficient availability. The ability to assess generalized anxiety disorder passively and at frequent intervals could be a useful complement to conventional treatment and help with relapse monitoring. Prior work suggests that higher anxiety levels are associated with features of human speech. As such, monitoring speech using personal smartphones or other wearable devices may be a means to achieve passive anxiety monitoring.
This study aims to validate the association of previously suggested acoustic and linguistic features of speech with anxiety severity.
A large number of participants (n=2000) were recruited and participated in a single web-based study session. Participants completed the Generalized Anxiety Disorder 7-item scale assessment and provided an impromptu speech sample in response to a modified version of the Trier Social Stress Test. Acoustic and linguistic speech features were a priori selected based on the existing speech and anxiety literature, along with related features. Associations between speech features and anxiety levels were assessed using age and personal income as covariates.
Word count and speaking duration were negatively correlated with anxiety scores (
Both acoustic and linguistic speech measures are associated with anxiety scores. The amount of speech, acoustic quality of speech, and gender-specific linguistic characteristics of speech may be useful as part of a system to screen for anxiety, detect relapse, or monitor treatment.
Anxiety disorders are among the most common mental health issues, with an incidence of approximately 10% in the Canadian population [
In this work, we focused specifically on GAD [
This paper is organized as follows: the next section summarizes related work in anxiety detection. The
Although it is important to note that some scholarship is skeptical that biomarkers correlate with emotions [
McGinnis et al [
Özseven et al [
Weeks et al [
Laukka et al [
Albuquerque et al [
Wörtwein et al [
Hagenaars and van Minnen [
Di Matteo et al [
In a similar study that used LIWC features, Anderson et al [
Overall, previous work identifies several audio features that are correlated with anxiety. However, the results are mixed because of differences in participants recruited, speech measures assessed, statistical methods used, and amount of mood induction. In addition, the largest sample size among these studies was 112, which limits the potential for generalizability to the larger population, a necessary step before considering the deployment of technologies for passive anxiety monitoring. In this study, we recruited a substantially larger cohort (n=2000) to explore features of speech from previous findings at a greater scale.
Participants from a nonclinical population were recruited for a 10- to 15-minute task implemented through a custom website. Self-report measures of anxiety were collected once at the beginning of the study and at the end of each of 2 specific tasks. In the following subsections, we describe the recruitment of participants, the data collection procedure, and the assessment of anxiety and speech measures.
The study was approved by the University of Toronto Research Ethics Board (37584).
A total of 2000 participants were recruited using the Prolific [
Participants who completed the study were paid £2 (US $2.74). They were able to complete the entire study remotely, using their PCs.
Participants were presented with the opportunity to participate in this study on Prolific if they met the aforementioned inclusion criteria. Those who wished to participate clicked on the study link, which brought them to a consent form that described the procedure and goals of the study and also provided information on data privacy. After they gave consent, a hyperlink brought participants to an external web application (a screenshot of which is presented in
Participants were first asked to fill out the standard Generalized Anxiety Disorder 7-item scale (GAD-7) questionnaire [
For the first speech task (task 1), participants were asked to read aloud a specific passage titled
For the second speech task (task 2), the participant followed a modified version of the widely used TSST [
In this modified version of the TSST, participants were told to imagine that they were a job applicant for a job that they really want (their
It should be noted that in the original TSST [
Our goal was to examine possible correlations between features of speech and GAD, based largely on previously suggested features. To measure the severity of GAD, we used the GAD-7 [
Each of the 7 questions on the GAD-7 has 4 options for the participant to select from, indicating how often they have been bothered by the 7 problems on the scale. These options and their numerical ratings are as follows: 0=not at all, 1=several days, 2=more than half the days, and 3=nearly every day. The final GAD-7 score is a summation of the values for each question, giving a severity measure for GAD in the range of 0 (no anxiety symptoms) to 21 (severe anxiety symptoms).
We also used a second, informal anxiety measure in this study to serve as an internal check to measure how much, on average, the modified TSST (task 2) induced stress and anxiety compared with task 1 (the reading or speaking of the
Prior work suggested that information about the mental state of a person may be acquired from the signals within speech acoustics [
In this work, we considered both acoustic and linguistic features, which are described in the following sections. These features were extracted from each of the 5-minute speech samples in which the participant responded to the modified TSST task. It should be noted that all the participants were prompted to speak for the full 5 minutes, as described in the
Previous research has identified several acoustic features that are correlated with anxiety, as described in the
These are coefficients derived from a mel-scale cepstral representation of an audio signal. We included 13 MFCCs, a common set of acoustic signals designed to reflect changes in perceivable pitch. The MFCC features were shown to be related to anxiety in 3 studies [
These are coefficients derived from a linear prediction cepstral representation of an audio signal. The first 13 cepstrum coefficients were used here. The LPCC features were shown to be related with anxiety in the study by Özseven et al [
In the study by McGinnis et al [
This refers to the amount of speech and related metrics such as the percentage of silence. These features have been shown to be related to anxiety in 3 studies [
This indicates how fast the participant spoke. The study by Hagenaars and van Minnen [
This is the frequency at which the glottis vibrates, also known as the
These are the F1, F2, and F3 [
This refers to the cycle-to-cycle F0 variation of the sound wave.
This refers to the cycle-to-cycle amplitude variation of the sound wave.
The squared mean of the amplitude of the sound wave within a given frame, also known as
Using Amazon’s AWS STT [
To apply the LIWC dictionaries, one simply counts the number of words that belong to each category, and each count becomes a feature. There are 93 categories in the LIWC, although not all are relevant for an STT transcript. We removed those features that were not relevant; for example, informal language words such as
The overarching objective of this study was to gain an understanding of which features of speech—both acoustic and linguistic—are correlated with the GAD-7. However, it is known that certain demographic attributes are directly indicative of anxiety. For example, sex is known to influence the prevalence of anxiety [
The partial Pearson correlation coefficient [
This section reports the main empirical results. We begin by discussing the recruitment yield, the demographic characteristics of the participants, and the relationship between demographic attributes and the reported GAD-7 score. Next, we report correlations for the features described in the
A total of 4542 participants accepted the offer from the Prolific recruitment platform to participate in the study, of whom 2212 (48.7%) completed the study, giving a recruitment yield of approximately 49%.
Of the 2212 participants who completed the study, 2000 (90.42%) provided acceptable submissions (and thus received payment), giving a submission-to-approval yield of approximately 90%. To be clear, the recruitment continued until 2000 acceptable submissions were received. The reasons for which submissions were deemed unacceptable included the following: a missing video, a missing or grossly imperfect audio, or failure to complete one or both tasks. These acceptability criteria were distinct from those used in the subsequent review of audio quality that is described in the following paragraphs. The period of recruitment ranged from November 23, 2020, to May 28, 2021. Of note, the recruitment took place during the global COVID-19 pandemic.
In addition to the aforementioned submission approval criteria, we reviewed the input data and audio for acceptability using the following procedure. To begin, we computed all acoustic and linguistic features described in the
A task 2 word count of <125
A speaking duration for task 2 of <60 seconds (compared with the full 5 minutes)
Any other feature value being beyond 3 SDs from the mean in either direction (outliers)
Of the 2000 participant recordings, 193 (9.65%) were flagged based on these criteria. For each of these, a researcher (BGT) listened to the task 2 audio recordings. The researcher discarded any samples that were deemed, subjectively, to be of insufficient audio quality or those whose response to task 2 was not responsive to the task itself. Of the 193 flagged participants, 123 (63.7%) were rejected through this manual review, meaning that of the 2000 samples, 1877 (93.85%) remained.
Finally, the 1877 samples were checked for missing data, and 133 (7.09%) participants had missing demographic information; consequently, the final number of participants included in our analysis was 1744 (92.91%). The flow chart of the study recruitment and quality control is presented in
Study recruitment flow chart.
Of the 1744 participants, 540 (30.96%) were above the GAD-7 screening threshold of 10 and 1204 (69.04%) were below the GAD-7 screening threshold of 10. Hereon, we will refer to those participants with a GAD-7 score ≥10 as the group
Demographic characteristics of participants in the group with anxiety and the nonanxious group (N=1744).
Demographic factors | Group with anxiety (n=540), n (%) | Nonanxious group (n=1204), n (%) | |||
|
<.001 | ||||
|
Male | 229 (42.41) | 653 (54.24) |
|
|
|
Female | 311 (57.59) | 551 (45.76) |
|
|
|
<.001 | ||||
|
Yes | 297 (55) | 311 (25.83) |
|
|
|
No | 243 (45) | 893 (74.17) |
|
|
|
<.001 | ||||
|
<10,000 | 181 (33.52) | 281 (23.34) |
|
|
|
10,000 to 19,999 | 112 (20.74) | 208 (17.28) |
|
|
|
20,000 to 29,999 | 92 (17.04) | 259 (21.51) |
|
|
|
30,000 to 39,999 | 60 (11.11) | 184 (15.28) |
|
|
|
40,000 to 49,999 | 36 (6.67) | 109 (9.05) |
|
|
|
50,000 to 59,999 | 20 (3.7) | 74 (6.15) |
|
|
|
≥60,000 | 39 (7.22) | 89 (7.39) |
|
|
|
<.001 | ||||
|
18 to 19 | 27 (5) | 44 (3.65) |
|
|
|
20 to 29 | 239 (44.26) | 379 (31.48) |
|
|
|
30 to 39 | 162 (30) | 334 (27.74) |
|
|
|
40 to 49 | 67 (12.41) | 219 (18.19) |
|
|
|
50 to 59 | 39 (7.22) | 132 (10.96) |
|
|
|
≥60 | 6 (1.11) | 96 (7.97) |
|
As described in the
The
The features with one of the highest correlations for both the male-sample and female-sample data sets were those related to the amount the participant spoke during task 2. The 2 specific features used to estimate speech length were speaking duration (the number of seconds of speech present within the 5-minute speech task) and the word count derived from an STT transcript.
Correlation of amount of speech features with the Generalized Anxiety Disorder 7-item scale.
Sample and feature |
|
||
|
|||
|
Speaking duration | –0.12 | <.001 |
|
Word count | –0.12 | <.001 |
|
|||
|
Word count | –0.13 | <.001 |
|
Speaking duration | –0.11 | <.001 |
|
|||
|
Speaking duration | –0.13 | <.001 |
|
Word count | –0.12 | <.001 |
Speaking duration versus Generalized Anxiety Disorder 7-item scale (GAD-7) scatter plot and distributions.
Word count (WC) versus Generalized Anxiety Disorder 7-item scale (GAD-7) scatter plot and distributions.
Correlation of significant acoustic features with the Generalized Anxiety Disorder 7-item scale.
Sample and feature |
|
||||
|
|||||
|
Shimmer | 0.08 | <.001 | ||
|
mfcc_std_2 | –0.08 | .002 | ||
|
mfcc_std_3 | –0.07 | .002 | ||
|
mfcc_mean_2 | –0.07 | .004 | ||
|
f0_std | 0.06 | .01 | ||
|
mfcc_std_5 | –0.06 | .01 | ||
|
mfcc_std_4 | –0.05 | .03 | ||
|
|||||
|
mfcc_std_3 | –0.10 | .002 | ||
|
Shimmer | 0.10 | .004 | ||
|
lpcc_std_6 | –0.09 | .008 | ||
|
lpcc_std_4 | –0.09 | .008 | ||
|
mfcc_mean_2 | –0.09 | .01 | ||
|
Intensity_mean | –0.09 | .01 | ||
|
mfcc_mean_1 | –0.09 | .01 | ||
|
lpcc_std_10 | –0.07 | .03 | ||
|
intensity_std | –0.07 | .03 | ||
|
lpcc_std_12 | –0.07 | .04 | ||
|
mfcc_mean_8 | 0.07 | .04 | ||
|
lpcc_mean_4 | 0.07 | .049 | ||
|
|||||
|
mfcc_std_2 | –0.09 | .005 | ||
|
mfcc_std_5 | –0.09 | .01 | ||
|
mfcc_mean_5 | –0.08 | .01 | ||
|
f0_std | 0.07 | .03 | ||
|
mfcc_std_4 | –0.07 | .04 | ||
|
Shimmer | 0.07 | .04 | ||
|
mfcc_std_11 | –0.07 | .046 | ||
|
f1_mean | 0.07 | .047 |
Correlation of acoustic features not found to be significant.
Feature | Previous works | This study | |||||||
|
|
All samples | Female samples | Male samples | |||||
|
|
|
|
|
|||||
Jitter | Showed a significant increase from a neutral state to an anxious state [ |
0.03 | .18 | –0.01 | .76 | 0.06 | .06 | ||
ZCR-zPSDa | ZCR-zPSD was one of the top selected features using the Davies-Bouldin index–based feature selection [ |
0.01 | .67 | –0.04 | .29 | 0.05 | .14 | ||
Articulation rate | Patients with panic disorder spoke significantly slower ( |
–0.01 | .64 | –0.05 | .12 | 0.02 | .55 | ||
F1b SD | Showed a significant change between neutral state and anxious state [ |
–0.03 | .18 | –0.02 | .53 | –0.04 | .25 | ||
F2c mean | Showed a significant change between neutral state and anxious state [ |
0.004 | .85 | 0.04 | .26 | –0.04 | .22 | ||
F2 SD | Showed a significant change between neutral state and anxious state [ |
0.01 | .59 | 0.03 | .38 | –0.02 | .60 | ||
F3d mean | Showed a significant change between neutral state and anxious state [ |
0.02 | .49 | 0.04 | .21 | –0.01 | .72 |
aZCR-zPSD: zero crossing rate for the
bF1: first formant.
cF2: second formant.
dF3: third formant.
Comparison of previous works’ correlations with those of this study.
Feature | Previous work | This study | ||||||
|
|
All samples | Female samples | Male samples | ||||
|
|
|
|
|
||||
Speaking duration | –0.36 | <.01 | –0.12 | <.001 | –0.11 | <.001 | –0.13 | <.001 |
MFCCa_std_1 | –0.36 | <.05 | 0.01 | .54 | 0.02 | .61 | 0.02 | .52 |
F0b_mean | Female: 0.02; male: 0.72 | Female: 0.92; male: 0.002 | 0.02 | .37 | –0.03 | .33 | 0.06 | .06 |
F0_SD | –0.24 | <.05 | 0.06 | .01 | 0.03 | .30 | 0.07 | .03 |
Intensity mean | –0.2 | —c | –0.04 | .13 | –0.09 | .01 | 0.01 | .72 |
aMFCC: mel-frequency cepstral coefficient.
bF0: fundamental frequency.
cNot available.
The quality of the transcript produced using Amazon’s AWS STT program [
Correlation of significant Linguistic Inquiry and Word Count linguistic features with the Generalized Anxiety Disorder 7-item scale.
Sample and feature |
|
||||
|
|||||
|
AllPunc | 0.13 | <.001 | ||
|
Period | 0.12 | <.001 | ||
|
assent | 0.10 | <.001 | ||
|
negemo | 0.10 | <.001 | ||
|
relativ | –0.09 | <.001 | ||
|
motion | –0.08 | <.001 | ||
|
swear | 0.08 | <.001 | ||
|
anger | 0.08 | <.001 | ||
|
focusfuture | –0.07 | .003 | ||
|
adverb | –0.07 | .004 | ||
|
time | –0.07 | .004 | ||
|
function | –0.07 | .005 | ||
|
negate | 0.07 | .006 | ||
|
prep | –0.06 | .007 | ||
|
WPSa | –0.06 | .007 | ||
|
anx | 0.06 | .008 | ||
|
hear | 0.06 | .01 | ||
|
death | 0.06 | .01 | ||
|
ipron | –0.06 | .01 | ||
|
see | –0.06 | .01 | ||
|
affect | 0.06 | .02 | ||
|
i | 0.05 | .02 | ||
|
family | 0.05 | .02 | ||
|
sad | 0.05 | .03 | ||
|
ppron | 0.05 | .03 | ||
|
space | –0.05 | .04 | ||
|
article | –0.05 | .04 | ||
|
leisure | 0.05 | .04 | ||
|
friend | 0.05 | .047 | ||
|
|||||
|
Period | 0.16 | <.001 | ||
|
AllPunc | 0.14 | <.001 | ||
|
adverb | –0.11 | <.001 | ||
|
negemo | 0.11 | <.001 | ||
|
anger | 0.11 | .002 | ||
|
motion | –0.10 | .003 | ||
|
assent | 0.10 | .004 | ||
|
see | –0.09 | .006 | ||
|
relativ | –0.09 | .006 | ||
|
sad | 0.08 | .01 | ||
|
Dic | –0.08 | .02 | ||
|
power | 0.07 | .03 | ||
|
WPS | –0.07 | .03 | ||
|
death | 0.07 | .04 | ||
|
percept | –0.07 | .046 | ||
|
|||||
|
AllPunc | 0.13 | <.001 | ||
|
assent | 0.11 | .001 | ||
|
relativ | –0.10 | .002 | ||
|
leisure | 0.10 | .002 | ||
|
hear | 0.10 | .003 | ||
|
swear | 0.10 | .004 | ||
|
time | –0.10 | .004 | ||
|
Apostro | 0.09 | .005 | ||
|
power | –0.09 | .01 | ||
|
ppron | 0.09 | .01 | ||
|
Sixltr | –0.09 | .01 | ||
|
anx | 0.08 | .01 | ||
|
negate | 0.08 | .01 | ||
|
negemo | 0.08 | .01 | ||
|
article | –0.08 | .01 | ||
|
Period | 0.08 | .02 | ||
|
prep | –0.08 | .02 | ||
|
focusfuture | –0.08 | .02 | ||
|
family | 0.08 | .02 | ||
|
ipron | –0.07 | .04 | ||
|
affect | 0.07 | .04 | ||
|
motion | –0.07 | .048 |
aWPS: words per sentence.
Our central objective was to test specific acoustic and linguistic features of impromptu speech for their association with anxiety and to do so with a larger number of participants. In this section, we discuss the implications of the findings presented in the previous section, as well as the limitations of the study.
The results presented in the
We also conducted a missingness analysis on the 4.98% (256/4542) of samples excluded from the study (
The proportion of participants in the group with anxiety (those above the GAD-7 screening threshold of 10) was 30.96% (540/1744), which is much higher than the general population rate of approximately 10% [
The demographic data listed in
The rows in
Similarly, with respect to age, younger participants were more likely to be in the group with anxiety, which is consistent with previous work [
As described in the
The results suggest that features related to the amount of speech that the participants delivered in response to task 2 had one of the highest correlations with their GAD-7 response across all the features explored in this work. In particular, 2 features captured this aspect:
The main purpose of this work was to explore how acoustic features relate to anxiety. We wanted to determine whether associations found in previous studies still hold with the larger sample size.
The following features, listed as relevant in prior work, did not show significant correlations with the GAD-7: F2 and F3, jitter, ZCR-zPSD, and the articulation rate.
Correlations between linguistic features extracted using the LIWC dictionaries [
Other LIWC categories with high correlation in the all-sample data set were negative emotion (
An LIWC category with a significant correlation that is present in the male-sample data set but not in the female-sample data set is the use of apostrophes (
Another differentiation between men and women occurs in the LIWC feature for words related to
In prior work studying associations between LIWC scores and anxiety, words related to anxiety and first-person singular pronouns were shown to be significantly associated with social anxiety [
Furthermore, in prior work, death-related words were shown to have a positive correlation with anxiety [
The fact that there are several single-word categories that have significant correlations suggests that techniques that are able to look at multiple word meanings may have greater potential in making predictions.
A limitation of this study is the use of self-report measures to assess GAD. Self-report measures, by nature, are subjective opinions that individuals have about themselves while filling out the questionnaires and may not completely capture clinical symptoms. In this study, we took these self-report questionnaires as the true label of the audio samples. However, we believe that this is a good first step that gave us encouraging preliminary results. A psychiatric diagnosis would be an improved label but is clearly much more expensive to acquire.
A further limitation of this study is the selection bias that might be introduced during the recruitment of the participants. As presented in
Another limitation concerns the differences in the recording devices and recording locations of the participants performing each task. Ideally, we would want every sample to be recorded using the same microphone in the same location with the same acoustics. This would reduce the potential bias introduced by different factors such as recording quality or background noise. At the same time, in a real-life scenario where an application to detect anxiety might be deployed, the recording equipment and the location will likely differ for everyone. Hence, this limitation could be unavoidable, and it might even be essential to take these types of differences into consideration.
We present results from a large-N study examining the relationship between speech and GAD. Our data collection relied on participants using home recording devices, hence capturing variations in acoustic environments, which will need to be factored in when deploying tools for the detection of mental health disorders in the wild. Our goal was to provide a useful benchmark for future research by assessing the extent to which results from previous research are generalizable to our data collection approach and larger data set. We tested the most common acoustic and linguistic features associated with anxiety in previous studies and provided detailed correlation tables broken down by demographics.
Our findings are decidedly mixed. On the one hand, with our larger data set, we found modest correlations between anxiety and several features of speech, including speaking duration and acoustic features such as MFCCs, LPCCS, shimmer, F0, and F1. However, other features shown to correlate with anxiety elsewhere—including F2 and F3, jitter, and ZCR-zPSD—were not significantly associated with anxiety in our study. Although these null findings do not entirely rule out the potential of more sophisticated learning models for this task, we believe that researchers should be wary of inherent difficulties. Readers should also note that our data collection already sidestepped additional challenges that we expected to influence the detection of anxiety disorders from speech, such as variations in accents, dialects, and spoken language. On the other hand, we found statistically significant correlations for a subset of speech features from previous research. This suggests that there may be a fundamental pathway between anxiety and the production of speech, one that is robust enough to be generalized to the population.
Future investigations could explore whether features of speech from task 1 (simple reading of a passage) exhibit correlations with the GAD-7 or whether these features could be used as a control for the features of task 2 (the modified TSST task). It may also be informative to separate out different age groups (eg, younger and older) to see whether there is a specific impact of speech features on the GAD-7.
Web application screenshot.
My Grandfather passage.
Speech encouragement statements.
Excluded data analysis.
Correlation between demographics and acoustic and linguistic features.
Significant feature intercorrelations of the all-sample data set.
Significant feature intercorrelations of the female-sample data set.
Significant feature intercorrelations of the male-sample data set.
fundamental frequency
first formant
second formant
third formant
generalized anxiety disorder
Generalized Anxiety Disorder 7-item scale
Linguistic Inquiry and Word Count
linear prediction cepstral coefficient
mel-frequency cepstral coefficient
social anxiety disorder
speech-to-text
Trier Social Stress Test
zero crossing rate for the z score of the power spectral density
This research was funded by a University of Toronto XSeed Grant, Natural Sciences and Engineering Research Council of Canada Discovery Grant (RGPIN-2019-04395), and Social Sciences and Humanities Research Council Partnership Engage Grant (892-2019-0011).
WS is an employee of Winterlight Labs and hold equity within the company, and DDD is a former employee of Winterlight Labs.