This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on https://mental.jmir.org/, as well as this copyright and license information must be included.
Emotions and mood are important for overall well-being. Therefore, the search for continuous, effortless emotion prediction methods is an important field of study. Mobile sensing provides a promising tool and can capture one of the most telling signs of emotion: language.
The aim of this study is to examine the separate and combined predictive value of mobile-sensed language data sources for detecting both momentary emotional experience as well as global individual differences in emotional traits and depression.
In a 2-week experience sampling method study, we collected self-reported emotion ratings and voice recordings 10 times a day, continuous keyboard activity, and trait depression severity. We correlated state and trait emotions and depression and language, distinguishing between speech content (spoken words), speech form (voice acoustics), writing content (written words), and writing form (typing dynamics). We also investigated how well these features predicted state and trait emotions using cross-validation to select features and a hold-out set for validation.
Overall, the reported emotions and mobile-sensed language demonstrated weak correlations. The most significant correlations were found between speech content and state emotions and between speech form and state emotions, ranging up to 0.25. Speech content provided the best predictions for state emotions. None of the trait emotion–language correlations remained significant after correction. Among the emotions studied, valence and happiness displayed the most significant correlations and the highest predictive performance.
Although using mobile-sensed language as an emotion marker shows some promise, correlations and predictive
Emotions are crucial to human survival, functioning, and well-being. They alert us to opportunities and challenges in our environment and motivate us to act on them to serve our goals and concerns [
The current gold standard for researching emotion (dynamics) in daily life is the experience sampling method (ESM). Participants complete a short survey on how they feel multiple times a day, allowing data to be collected during their normal routine [
However valuable, the ESM has some drawbacks. Interrupting daily activities for a survey multiple times a day can be burdensome [
One such unobtrusive (passive) data collection method as an alternative to ESM is mobile sensing [
We need emotionally valid data that can be captured by a smartphone to be able to use mobile sensing in the detection of emotion and mood disorders. Language is one of the ways in which people (digitally) express their emotions [
Speech content: spoken words
Speech form: voice acoustics (eg, pitch and timbre)
Writing content: written words
Writing form: typing dynamics (eg, typing speed and key press duration)
Studies on speech and emotional word use have generally focused on positive or negative emotions. Induced positive emotions coincide with more positive and less negative emotions between persons [
Because of these inconsistencies,
Each voice has a unique sound because of age, gender, and accent. However, psychological features such as attitudes, intentions, and emotions also affect our sound [
Higher-arousal emotions (eg, fear, anger, and joy) generally induce a higher speech intensity, F0, and speech rate, whereas lower-arousal emotions (eg, sadness and boredom) induce a lower speech intensity, F0, and speech rate (
Expected emotion–speech form correlations.
Emotion | F0a-mean | F0-SD | F0-range | F0-rise | F0-fall | Loudness mean | Loudness rise | Loudness fall | Jitterb | Shimmerc | HNRd | Speech rate | Pause duration |
Valence |
|
|
|
|
|
|
|
|
|
|
|
|
|
Arousal | (+)e | + | + | + | + | + | + | + |
|
|
|
(+=)f | (−)g |
Anger | + | + | + | + | + | + | + | + | += | += | + | (+−)h | − |
Anxiety | + | +− | + | +− | +− | + |
|
|
+ | + | − | +− | +− |
Sadness or depression | − | − | − | − | − | − | − | − | + |
|
− | − | + |
Stress | + | + | + | + | + | + | + | + |
|
|
|
+ | − |
Happiness | + | + | + | + | + | + | + | + | += | += | + | + | − |
aF0: fundamental frequency.
bDeviations in individual consecutive fundamental frequency period lengths.
cDifference in the peak amplitudes of consecutive fundamental frequency periods.
dHNR: harmonics to noise ratio (energy in harmonic components and energy in noise components).
ePositive correlation.
fPositive or no correlation.
gNegative correlation.
hUndirected correlation.
Higher valence has repeatedly been associated with more positive and less negative emotion words on a within- and between-person level, along with a higher word count in both natural and induced emotion conditions (
Expected emotion–speech and writing content correlations.
Emotion | WCa | I | We | You | Negate | Posemob | Negemoc | Anxd | Anger | Sad | Certaine | Swear | Exclamf |
Valence | (+)g | (−)h |
|
|
|
|
|
|
|
+ |
|
+ | + |
Arousal |
|
|
|
|
|
|
|
|
|
|
|
|
|
Anger |
|
+ | + | + |
|
|
|
|
+ |
|
|
|
|
Anxiety |
|
+ | + | + | + | + | + | + |
|
|
+ |
|
|
Sadness |
|
+ | + |
|
+ |
|
|
+ |
|
|
|
|
|
Stress |
|
+ |
|
+ |
|
|
|
|
|
|
|
|
|
Happiness | + | − |
|
|
|
|
|
|
|
+ |
|
+ | + |
Depression |
|
+ | + | + | + | + | + | + |
|
|
+ |
|
|
aWC: word count.
bPosemo: positive emotions.
cNegemo: negative emotions.
dAnx: anxiety.
eCertain: absolutist words.
fExclam: exclamation marks.
gPositive correlation.
hNegative correlation.
Negative emotion, anxiety, and anger words recur as linguistic markers of anger within and between persons [
Initially, studies concerning typing dynamics used external computer keyboards to predict stress and depression, among other emotions [
Despite the high predictive accuracies of deep learning models, separate correlations between emotional states and typing dynamics are small (
Expected emotion–writing form correlations.
Emotion | Number of characters | Typing speed | Average key press duration | Number of entries | Backspaces | Typing duration |
Valence | (+)a |
|
|
|
|
|
Arousal |
|
+ | (−)b |
|
|
− |
Anger |
|
|
|
|
|
|
Anxiety |
|
|
|
|
|
|
Sadness |
|
|
|
|
|
|
Stress |
|
+ | − | − | − | − |
Happiness | + |
|
|
|
|
|
Depression |
|
|
− |
|
|
|
aPositive correlation.
bNegative correlation.
Despite this body of research, crucial questions remain. For instance, most research has focused on between-person relationships, whereas few studies have looked at state emotions within persons. Therefore, it is unclear to what extent mobile-sensed language can help predict moment-to-moment changes within individuals. Previous research has typically also examined particular language features in isolation. As a result, we do not know how the different types of language data compare in their predictive value nor to what extent combining them may enhance the prediction of moment-to-moment and trait emotions.
In this study, we will examine the separate and combined predictive value of 4 mobile-sensed language data sources for detecting momentary emotional experience as well as emotional traits and depression. A 2-week ESM study was designed, querying participants to indicate their valence, arousal, anger, anxiety, sadness, stress, and happiness on their smartphones 10 times a day. In addition, a custom-built app recorded data from several sensors. Relevant to this study, the participants were asked to use the provided custom keyboard software as often as possible and to make a voice recording regarding their emotional state at the end of each ESM survey. On the basis of these data, we will examine how self-reported emotional experience is correlated and can be predicted with spoken and written word use, acoustic voice features, and typing dynamics.
This study goes beyond previous work by comparing and combining all four sources of language behavior: speech, writing, content, and form. In addition, this study will examine the prediction of emotion traits as well as moment-to-moment emotional fluctuations in daily life, providing a comprehensive picture of the potential of language-based smartphone-sensing data for emotion detection.
Participants were recruited through notices on social media groups and notice boards around university buildings. In this notice, people were directed to a web survey for selection purposes. This web survey queried an email address, age, gender, and questions regarding the inclusion criteria. These entailed Dutch as mother tongue, availability for the duration of the study, ownership of an Android smartphone that supported the sensing app (not iPhone, Huawei, Wiko, Medion, or Xiaomi), always carrying that smartphone, and activating it at least 10 times a day. A total of 230 people completed the web survey, of whom 116 (50.4%) were excluded based on the aforementioned criteria. Of the remaining 114 people, 69 (60.5%) agreed to participate in the study. In the laboratory, 3% (2/69) of participants refused to sign the informed consent, and the installation of the apps failed with another 3% (2/69) of participants, leaving 65 actual participants. For the analyses, an extra inclusion criterion of having answered at least 30 surveys led to the exclusion of another 8% (5/65) of participants. Of the remaining 60 participants, 17 (28%) were men, and 43 (72%) were women (mean age 21.85 years, SD 2.31 years; range 17-32 years).
The participants were reimbursed depending on their cooperation in the study. A maximum of €50 (US $56) could be earned. A total of €10 (US $11.2) were earned after completing some baseline trait questionnaires at the start of the study. Another €5 (US $5.6) could be earned per 10% completed ESM surveys, ending at 80% completed surveys. This is a standard practice in ESM research. This study was approved by the Societal Ethical Commission of Katholieke Universiteit Leuven (G-2018 01 1095).
A total of 2 apps were installed on each smartphone. The first one, a custom-built app called Actitrack, recorded data from multiple mobile sensors, such as screen locks, light sensors, and location. The software also provided a custom onscreen keyboard display that could be used instead of the default soft keyboard on the host smartphone. This way, the app could register all typing activity with the custom keyboard as it had no access to the default keyboard. Because of the precariousness of these data, privacy measures were taken. All data were securely sent over https to a central server of Katholieke Universiteit Leuven and stored in 2 different files.
This study solely focused on the sensed keyboard and voice data. The participants were asked to use the custom-made keyboard as often as possible to render enough writing data. While doing so, the following variables were stored: content of the message, number of backspaces, number of characters, typing speed, typing duration, average duration of a key press, number of positive emojis, and number of negative emojis.
After each ESM survey, the participants were redirected to the sensing app to record a voice message. In the app, there was a button to start and a button to decline, and the instruction read “Make a recording of about one minute about what you have done and how it made you feel. Good luck!” This meant that keyboard activity was passively sensed the entire time of the study, whereas voice recordings were actively prompted and initiated by the participants. As the keyboard messages and voice recordings might contain sensitive personal information, the files were encrypted separately and could only be stored and handled on computers with an encrypted hard drive.
The second app, MobileQ, delivered the ESM surveys [
At the beginning of the study, each participant completed a mental health and personality survey. In this study, only the depression subscale of the Depression, Anxiety, and Stress Scale (DASS) was used [
After meeting the inclusion criteria, the participants attended a session in the laboratory. During each session, an informed consent was first proposed and signed. Next, the 2 apps were installed on the participants’ smartphones, and they received a booklet with user instructions and a unique participant number. The booklet included instructions to keep the phone turned on, charge it at night, not lend it to a friend, switch off the screen lock, and be connected to Wi-Fi as much as possible. It also included a guide on how to install and uninstall the apps. Finally, the participants were asked to complete the trait questionnaires. For each participant, the 2-week study began the day after the session, and the apps were automatically deactivated after 15 days. There was an optional feedback session at the end where the participants could receive a debriefing and help with uninstallation. The 60 participants that reached the cutoff of 30 completed surveys responded on average to 109.3 (SD 22) of the 140 notifications, yielding a compliance rate of 78% (mean compliance 0.78, SD 0.16; range 0.26-0.99).
The voice samples were converted to text files to be able to analyze the words used in speech. The voice recordings were initially transcribed using the open-source transcriber software Kaldi (NVIDIA) [
The content of the voice recordings was analyzed using the LIWC software [
Descriptive statistics of the speech data.
Item | Value, mean (SD; range) | |
|
||
|
Valence | 56.21 (11.3; 22.57 to 83.42) |
|
Arousal | 44.7 (11.35; 18.41 to 77.21) |
|
Anger | 10.63 (9.08; 1.7 to 52.05) |
|
Anxiety | 12.47 (12.62; 1.35 to 56.31) |
|
Sadness | 13.06 (9.38; 1.84 to 43.1) |
|
Stress | 27.58 (15.15; 5.08 to 74.16) |
|
Happiness | 56.44 (11.32; 21.41 to 80.68) |
|
Depression | 0.42 (0.45; 0 to 2.14) |
|
||
|
WC (word count) | 60.72 (31.76; 4 to 125.63) |
|
I (first-person singular) | 9.44 (3.69; 0 to 19.09) |
|
We (first-person plural) | 0.58 (0.83; 0 to 3.7) |
|
You (second-person singular) | 0.06 (0.11; 0 to 0.41) |
|
Negate (negations) | 1.29 (0.75; 0 to 3.28) |
|
Posemo (positive emotion words) | 3.54 (2.04; 0 to 12.5) |
|
Negemo (negative emotion words) | 0.98 (0.72; 0 to 2.73) |
|
Anx (anxiety-related words) | 0.36 (0.52; 0 to 2.38) |
|
Anger (anger-related words) | 0.27 (0.35; 0 to 1.47) |
|
Sad (sadness-related words) | 0.16 (0.18; 0 to 0.76) |
|
Certain (absolutist words) | 1.59 (1.36; 0 to 7.71) |
|
Swear (swear words) | 0 (0.03; 0 to 0.19) |
|
||
|
F0d mean | 29.93 (4.26; 20.25 to 40.63) |
|
F0 SD | 0.22 (0.05; 0.13 to 0.42) |
|
F0 range | 7.52 (3.63; 2.29 to 19.4) |
|
F0 mean rising slope | 303.85 (76.4; 126.97 to 556.56) |
|
F0 mean falling slope | 155.13 (50.45; 88.93 to 336.52) |
|
Loudness mean | 0.77 (0.37; 0.19 to 2.1) |
|
Loudness mean rising slope | 12.85 (5.01; 3.43 to 26.76) |
|
Loudness mean falling slope | 10.02 (4.08; 2.52 to 17.81) |
|
Jitter mean | 0.05 (0.01; 0.03 to 0.07) |
|
Shimmer mean | 1.29 (0.16; 1.02 to 1.75) |
|
HNRe mean | 4.61 (2.44; −4.16 to 8.6) |
|
Voiced segments per second (speech rate) | 2.12 (0.48; 0.55 to 3.38) |
|
Mean unvoiced segment length (pause duration) | 0.29 (0.56; 0.11 to 4.16) |
aEmotions were rated on a visual analogue scale of 0-100, and depression was rated on a scale of 0-3.
bExcept for word count, all Linguistic Inquiry and Word Count dimensions display percentages of the total word count.
cFundamental frequency measures are logarithmic transformations on a semitone frequency scale starting at 27.5 Hz. Loudness measures are the perceived signal intensity. The harmonics to noise ratio displays an energy-related harmonics to noise ratio and is indicative of voice quality along with jitter and shimmer.
dF0: fundamental frequency.
eHNR: harmonics to noise ratio.
The acoustic features of the voice recordings were extracted using the openSMILE software (audEERING GmbH) [
The content of the writing was analyzed in the same way as the content of the voice recordings—by using the LIWC software and the 12 chosen categories, adding also
Descriptive statistics of the writing data.
Item | Value, mean (SD; range) | ||
|
|||
|
Valence | 56.07 (10.88; 22.57-83.42) | |
|
Arousal | 44.34 (11.67; 9.27-77.21) | |
|
Anger | 10.49 (8.8; 1.5-52.05) | |
|
Anxiety | 12.14 (12.2; 0.15-56.31) | |
|
Sadness | 12.82 (9.35; 1.84-43.1) | |
|
Stress | 26.85 (15.12; 3.31-74.16) | |
|
Happiness | 56.31 (10.97; 21.41-80.68) | |
|
Depression | 0.45 (0.48; 0-2.14) | |
|
|||
|
Positive emojis | 1.4 (5.94; 0-45.09) | |
|
Negative emojis | 0.15 (0.26; 0-1.35) | |
|
WC (word count) | 82.4 (58.45; 1.8-358.93) | |
|
I (first-person singular) | 3.21 (1.22; 0-5.31) | |
|
We (first-person plural) | 0.57 (0.34; 0-1.38) | |
|
You (second-person singular) | 2.21 (0.82; 0.52-5) | |
|
Negate (negations) | 1.44 (0.81; 0-3.83) | |
|
Posemo (positive emotion words) | 0.1 (0.11; 0-0.38) | |
|
Negemo (negative emotion words) | 3.48 (1.57; 0-8.25) | |
|
Anx (anxiety-related words) | 0.85 (0.39; 0-1.68) | |
|
Anger (anger-related words) | 0.12 (0.12; 0-0.55) | |
|
Sad (sadness-related words) | 0.26 (0.2; 0-0.81) | |
|
Certain (absolutist words) | 0.24 (0.15; 0-0.65) | |
|
Swear (swear words) | 2.31 (1.06; 0-4.86) | |
|
Exclam (exclamation marks) | 1.56 (1.8; 0-9.26) | |
|
|||
|
Characters, N | 480.11 (293.67; 12.7-1764.5) | |
|
Typing speed (characters per second) | 2.1 (0.55; 1.18-4.68) | |
|
Average key press duration (ms) | 79.95 (16.48; 20.68-122.83) | |
|
Entries, N | 15.37 (10.69; 1-71.92) | |
|
Total backspaces, N | 0.17 (0.06; 0-0.3) | |
|
Total typing duration (seconds) | 2.68 (2.18; 0.44-9.85) |
aEmotions were rated on a visual analogue scale of 0-100, and depression was rated on a scale of 0-3.
bExcept for word count, all Linguistic Inquiry and Word Count dimensions display percentages of the total word count.
cNumber of backspaces and typing duration are divided by the total number of keystrokes (characters + backspaces).
The typing dynamics were immediately recorded during typing without any additional software. The variables extracted from the custom-made keyboard were the number of backspaces, typing duration, typing speed, number of characters, and average duration of a key press (
After standardization, pairwise correlations were computed between the emotions on one side and the language features on the other. At the momentary level, this was done by extracting the slopes of multilevel simple linear regressions using the lme4 and lmerTest packages in R with the restricted maximum likelihood modeling. At the trait level, Spearman correlations were applied to the aggregated data set. On each correlation table, a false discovery rate (FDR) correction was applied according to the step-down method by Holm [
Next, we were interested in how well the language features would predict emotional states and traits. The total data set for voice and keyboard separately was divided into an 80% training and 20% test set. We used all the significant correlations of the previous analyses for the 4 language types separately as possible predictors for a given emotion in a linear regression model with a random intercept and varying slopes for participants at the momentary level, allowing predictors to have different values for each participant. When the correlation analysis yielded no significant correlations for an emotion, the 3 most highly correlated features were chosen as possible predictors. A 10-fold cross-validation on the training set was applied to determine which of the possible predictors had an average
As we noticed that a different split of the test and training sets yielded different results, especially for the trait level, we chose to perform a 50-fold variation of the training and test sets in a bootstrap-like manner. This means we randomly created 50 different splits of the observations into 80% training and 20% test sets.
A total of 51 participants (51/60, 85%) recorded between 1 and 96 voice samples on the total number of ESM surveys they completed, with an average compliance rate of 19% (speech mean 0.19, SD 0.21; range 0.01-0.94). Within participants, there was a significant correlation between the day of the study (1 to 14) and the number of voice recordings (
A total of 59 participants (59/60, 98%) used the custom-made keyboard between 5 and 117 times in the hour around their completed ESM surveys, with an average use rate of 60% (writing mean 0.60, SD 0.21; range 0.07-0.95). Here, again, use declined throughout the study within participants (
After the FDR correction at the momentary level,
Multilevel correlations between the state emotions and speech content variables (n=1015). *
Spearman correlations between the trait emotions and speech content variables (n=51). *
After the FDR correction at the momentary level,
Multilevel correlations between the state emotions and speech form variables (n=1015). *
Spearman correlations between the trait emotions and speech form variables (n=51). *
After the FDR correction at the momentary level,
Multilevel correlations between the state emotions and writing content variables (n=3929). *
Spearman correlations between the trait emotions and writing content variables (n=59). *
After the FDR correction at the momentary level,
Multilevel correlations between the state emotions and writing form variables (n=3929). *
Correlations between the trait emotions and writing form variables (n=59). *
The highest predictive
Predictive
Predictive
Afterward, this process was repeated for the speech content and form features combined in 1 model, the writing content and form features combined in 1 model, and all 4 language features combined in 1 model (
Predictive
Predictive
In this study, we investigated the potential of mobile-sensed language features as unobtrusive emotion markers. We looked at pairwise multilevel correlations between emotions or mood and language features—distinguishing between speech content, speech form, writing content, and writing form—and at the (combined) predictive performance of those features in regression models.
Most of the significant correlations were found between speech content features and momentary emotions but were rather low, varying between |0.11| and |0.25|. However, they are in the range of those found in previous studies [
Speech form and momentary emotions also displayed some significant correlations, ranging from |0.11| to |0.23|. Most of the literature has focused on discriminating high-arousal from low-arousal emotions [
Writing content features showed only a few weak significant correlations. Varying between |0.05| and |0.10|, these were lower than expected yet not entirely surprising given the mixed results throughout previous work [
Writing form showed the least amount of significant correlations, also in the range of |0.05| to |0.10|. In the literature, typing speed and average key press duration have been seemingly linked to emotions; however, in this study, the number of characters and the number of keyboard entries were the most telling. They followed the direction of the word count correlations. Here, again, valence and happiness showed the most correlations, and the complete trait level was nonsignificant.
As could be expected based on the number of significant correlations, valence and happiness showed the highest predictive
When combining multiple types of momentary language data into the same models, writing does not contribute to better predictions. Combining speech content and form features yields more or less the same results as their separate models, whereas adding writing content and form features does not further improve the predictive performance. An important remark here is that not all ESM surveys with voice recordings had additional keyboard activity. Because of this, the data set was further reduced in size, which might contribute to the fact that the combined models seemingly have no added value.
No significant correlations were found at the trait level. In addition, by aggregating, the number of data points was reduced from multiple observations to a single observation per participant. As a result, our expectations for trait predictive performance were lower than those for the momentary models. As can be seen in
Overall, the found relationships were largely in the predicted direction but were very modest in size. For speech, these values are more or less in line with previously obtained results; however, writing performed below expectations. There are 3 main differences between voice recordings and keyboard activity that might account for this. One is the nature of collection—voice recordings were deliberately voiced, whereas keyboard activity was unobtrusively recorded. Second, keyboard activity was gathered without any instruction, whereas voice recordings came with the explicit instruction for the participants to say what they were doing and how they felt. Finally, although LIWC was able to categorize on average 87.17% (SD 38.83%) of the spoken words, it only recognized on average 54.23% (SD 25.81%) of the text messages because of typos and other distortions.
A second dichotomy exists between the momentary and trait values. Previous work has often focused on between-group designs; however, this study could only record significant within-person correlations. At the trait level, we found no significant correlations, and predictive trait models showed more variability and 0 or negative predictive
The first limitation entails that data collection was dependent on the participants’ willingness to use the custom-made keyboard instead of their default one and to make recordings. This reduced the number of observations and created an unbalanced data set. Ideally, the smartphone’s own keyboard and microphone could be activated and logged at will. This is impossible because of technical and ethical constraints. A solution might be to link reimbursement directly to the provision of valid data in the form of keyboard use or voice recordings, although this might lead too much to a perception of coercion.
A second limitation lies in the software used. We worked with LIWC as it is widely used in the literature and provides a fast and easy-to-use interface. A downside is that it only recognizes single words and not phrases. When the participants talked about feeling
A third limitation is inherent to the sample of participants. Despite the strong representation of depression and language use in the literature, this study was not able to link depressive symptoms to any language feature. Replicating this study in a more diverse or clinical population might yield other results for depression.
Finally, language is strongly dependent on the chosen medium. Talking to a smartphone with a specific instruction restrains the natural flow of language and can compromise the generalizability of these findings. More technically, this also means that the participants would sometimes talk softly in a quiet room, whereas they might be screaming over the noise in another recording. We should keep in mind the fact that loudness is as much a factor of the environment as it is of the voice. Then again, this context might also say something about the emotional experience in itself.
This study investigated the relationship between self-reported emotions and 4 types of mobile-sensed language. The found correlations and predictive performances were overall weak, remaining <0.25. The best-performing language type was speech content, which displayed the largest number of significant correlations and the largest predictive
Depression, Anxiety, and Stress Scale
experience sampling method
fundamental frequency
false discovery rate
Linguistic Inquiry and Word Count
The research reported in this paper is supported by Katholieke Universiteit Leuven Research Council grants C14/19/054 and C3/20/005 and by the European Research Council under the Horizon 2020 research and innovation program of the European Union and the European Research Council Consolidator grant SONORA (773268). This paper reflects only the authors’ views, and the Union is not liable for any use that may be made of the contained information.
CC contributed to writing and analysis. KN contributed to writing and analysis. MM contributed to the methodology and investigation. MB contributed to review, editing, and data curation. PV contributed to review, editing, and data curation. LG contributed to review and editing, methodology, and investigation. TvW contributed to review and editing, methodology, and investigation. PK contributed to writing and analysis.
None declared.