This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on http://mental.jmir.org/, as well as this copyright and license information must be included.
Traditional offline assessment of suicide probability is time consuming and difficult in convincing at-risk individuals to participate. Identifying individuals with high suicide probability through online social media has an advantage in its efficiency and potential to reach out to hidden individuals, yet little research has been focused on this specific field.
The objective of this study was to apply two classification models, Simple Logistic Regression (SLR) and Random Forest (RF), to examine the feasibility and effectiveness of identifying high suicide possibility microblog users in China through profile and linguistic features extracted from Internet-based data.
There were nine hundred and nine Chinese microblog users that completed an Internet survey, and those scoring one SD above the mean of the total Suicide Probability Scale (SPS) score, as well as one SD above the mean in each of the four subscale scores in the participant sample were labeled as high-risk individuals, respectively. Profile and linguistic features were fed into two machine learning algorithms (SLR and RF) to train the model that aims to identify high-risk individuals in general suicide probability and in its four dimensions. Models were trained and then tested by 5-fold cross validation; in which both training set and test set were generated under the stratified random sampling rule from the whole sample. There were three classic performance metrics (Precision, Recall, F1 measure) and a specifically defined metric “Screening Efficiency” that were adopted to evaluate model effectiveness.
Classification performance was generally matched between SLR and RF. Given the best performance of the classification models, we were able to retrieve over 70% of the labeled high-risk individuals in overall suicide probability as well as in the four dimensions. Screening Efficiency of most models varied from 1/4 to 1/2. Precision of the models was generally below 30%.
Individuals in China with high suicide probability are recognizable by profile and text-based information from microblogs. Although there is still much space to improve the performance of classification models in the future, this study may shed light on preliminary screening of risky individuals via machine learning algorithms, which can work side-by-side with expert scrutiny to increase efficiency in large-scale-surveillance of suicide probability from online social media.
Identifying individuals with suicide probability at an early stage is vital for suicide intervention and prevention. Over the past few decades, people have dedicated themselves to identifying the characteristics of individuals with high suicide probability. Clinicians found high suicide risk in individuals with physical or psychological disease, for example, cancer, Acquired Immune Deficiency Syndrome, and depression [
As the Internet has become a fast growing platform for social interaction in recent years, there are a large number of social network platforms containing suicide related information, which provide a rich source for monitoring suicide probability [
In this study, we examine the feasibility and effectiveness of identifying high suicide probability microblog users automatically based on Internet accessible data. As the dominant microblog service provider in China, Sina Weibo now has 167 million active users, and more than 100 million posts are published daily [
Participants were invited to take part in this Internet survey via three approaches on Sina Weibo: (1) recruiting information was published on our laboratory’s official Sina Weibo account with over 5000 followers. Some of the followers took part in the survey voluntarily; (2) a verified celebrity of Sina Weibo, who is a prestigious psychologist in mainland China and has more than 970,000 followers, retweeted our recruiting information and attracted more participants; and (3) another nonofficial Weibo account had been created to send invitation messages randomly on user’s home page. All participants interested in this survey were asked to log on to the Internet survey system by their Sina Weibo account. After they finished reading and signing an informed consent form specifying the objective of the survey and their rights, they were invited to fulfill a survey on demographic information and mental health status, including the Suicide Probability Scale (SPS) in Mandarin. They received a compensation of 30 Renminbi if they completed the whole survey. Contact information of a national suicide prevention hotline was shown on the survey Web page, and the participants were encouraged to seek help if they felt stressful or suicidal. Ethical considerations of the study have been reviewed and granted by the Review Board of the Institute of Psychology, Chinese Academy of Sciences.
A participant screening was conducted to assure the quality of this whole process. First, to comply with ethic code, only participants above 18 years of age would be involved. Next, to decrease the possibility that one fulfilled the survey more than once with different microblog accounts, participants’ Internet Protocol (IP) addresses were examined. Survey submissions from the same IP would be eliminated, thus only the first submission would be used. Last, but not least, it was considered that one should have an adequate amount of microblog posts for feature extraction to avoid the “floor effect”, and we only kept participants with more than 100 posts in total.
From May 22th to July 13th, 2014, 1196 Weibo users took part in the survey, 1040 completed the whole survey and 909 of them passed the screening. The final sample pool consisted of 909 Sina Weibo users (561 female, 348 male, mean of age 24.3, SD 5.0).
The SPS was developed by Cull and Gill to assess suicide risk of adults and adolescents above the age of 14. Previous studies have verified that SPS could be utilized as an effective screening tool in the community for individual suicide prevention and intervention [
SPS is substantially related to an externally developed index of suicide risk; individuals identified with high suicide probability require further expert scrutiny, or conditional evaluation with family members and friends. The Ontario Hospital Association and Canadian Patient Safety Institute suggested a total raw score of 78 as the cutoff point for high suicide risk [
SPS score distribution and score-based categorization.
Name of scale | Average score | Cutoff for high score class | Cutoff for low score class |
x (SD) | > cutoff point |
< cutoff point |
|
SPS | 69.4 (11.8) | >81 |
<58 |
Hostility subscale | 13.0 (2.5) | >15 |
<11 |
Suicide ideation subscale | 11.5 (3.2) | >14 |
<9 |
Negative self-evaluation subscale | 20.5 (4.4) | >24 |
<17 |
Desperation subscale | 24.6 (4.7) | >29 |
<20 |
Calling application programming interfaces, provided by Sina Weibo Data Center, allowed all of the publically available digital records of users to be downloaded, from which profile and linguistic features were extracted to train models.
Profile features consist of three types of categories: (1) participant profile or general behavior; (2) user settings; and (3) participant’s microblog behavior.
Category (1) includes: gender; length of username; total number of favorites/followers/follows/friends (mutual follow); length of self-description; length of domain name; count of numbers in domain name; number of openly published microblogs; number of originally published microblogs; number of originally published microblogs with photos; number of originally published posts with URL; numbers of originally published posts with “@”; number of microblogs published between 22:00 and 6:00; number of times that participant used first person plural/singular words; number of total/positive/negative emoticons; and number of days that participant stayed active. To determine positive and negative emoticons, five psychology professionals were recruited to evaluate all 1983 Sina Weibo emoticons. Based on their agreement, 48 positive emoticons and 118 negative emoticons were ultimately identified.
Category (2) includes: whether the user enables private message sending; whether the user allows all users to leave comments; whether the user enables geotagging of their account; and whether the user includes “I” in self-description.
Category (3) includes: the average/maximum/minimum/median number of words in participant’s single microblog; the average number of comments on participant’s single microblog; the average number of times that participant’s single microblog was retweeted; the average number of “likes” for participant’s single microblog; microblog originality (original posts/total posts in public domain); microblog transitivity (posts containing hyperlinks/total posts in public domain); microblog interaction (posts @ other users/total posts in public domain); group reference (the average number of first person plural words per post); self-reference (the average number of first person singular words per post); nocturnal activeness (posts published during 22:00 to 6:00/total posts in public domain); adoption of positive emoticons (the average number of positive emoticons per post); adoption of negative emoticons (the average number of negative emoticons per post); and social activeness (number of friends/number of followers). Ratio data were adopted in many of the Category (3) features to eliminate the impact of time discontinuity, since participants varied in the Weibo active period.
We adopted those features according to three criteria: (1) very few features are raised in previous research. For example, there has been a lot of work focusing on the connection between suicide intention, depressed thinking, and insomnia [
Using Simplified Chinese Micro-blog Word Count Dictionary (SCMBWC), a Chinese version of Language Inquiry and Word Count [
We built our models on a training set and then evaluated them on a hold-out test set. To do so, we first divided all the participants into three classes. As mentioned above, participants scoring one SD above the mean (mean+1SD) were labeled as high-risk individuals. Accordingly, participants scoring below mean-1SD were labeled as low-risk ones, and those scoring in between were labeled as medium-risk ones. Intuitively, there may exist significant difference in behavioral and linguistic features between high-risk individuals and low-risk ones, thus, models built upon these two groups might capture the appropriate patterns to differentiate high-risk individuals from low-risk ones. To ensure model applicability for the general Weibo user crowd, the proportion of each class in a test set follows the same distribution of the whole participant sample, in which case the performance of models can be genuinely reflected.
Therefore, the training sets are from two extreme groups only, but test sets consist of participants in all three groups, since we want to test the performance of the model in a real world scenario. Here, we run training and testing by 5-fold cross validation. Each training set consisted of 80% of the high-risk and low-risk individuals (suicide probability, 216/269; hostility, 224/279; suicide ideation, 201/250; negative self-evaluation, 272/339; and desperation, 196/245), and each test set consisted of 20% of high-risk, medium-risk, and low-risk individuals (181/909). Both training set and test set were randomly generated 5 times from the whole participant pool to balance the variance of stratified random sampling.
There were two machine learning algorithms that were employed for training classification models, SLR and RF. SLR is a type of probabilistic classification model which is a special case of linear model with binary dependent variable. RF is an ensemble method, training multiple decision trees and the final result is the mode of all decision trees’ outputs. The two algorithms have both been used in previous research to triage health problems [
In addition, we also defined “Screening Efficiency” to measure the capacity of workload saved comparing with traditional clinical suicide scrutiny. Screening Efficiency was calculated as, (total number of instances - total number of instances predicted to be positive)/total number of instances. For example, if there were in total 100 individuals, and 40 of them were prescreened by our model as highly risky, then only 40 of them would have to move forward for expert evaluation, thus the workload we might save should be (100-40)/100*100%=60%. Training and testing of models were all conducted via WEKA, a widely adopted machine learning workbench for data mining [
The majority of users (873/909, 96.0%) were adults below the age of 35, which is consistent with the current age distribution in Sina Weibo.
Model performance for classifying overall suicide probability.
Classifier | Trial number | Performance metrics | |||
|
|
Precision | Recall | F1 measure | Screening efficiency |
SLR | 1 | 0.13 | 0.50 | 0.20 | 0.38 |
|
2 | 0.14 | 0.54 | 0.23 | 0.42 |
|
3 | 0.23 | 0.79 | 0.35 | 0.46 |
|
4 | 0.13 | 0.50 | 0.21 | 0.41 |
|
5 | 0.19 | 0.79 | 0.31 | 0.36 |
RF | 1 | 0.13 | 0.57 | 0.21 | 0.32 |
|
2 | 0.18 | 0.75 | 0.29 | 0.34 |
|
3 | 0.20 | 0.82 | 0.32 | 0.36 |
|
4 | 0.16 | 0.64 | 0.26 | 0.38 |
|
5 | 0.15 | 0.64 | 0.24 | 0.33 |
Model performance for classifying hostility.
Classifier | Trial number | Performance metrics | |||
|
|
Precision | Recall | F1 measure | Screening efficiency |
SLR | 1 | 0.12 | 0.30 | 0.17 | 0.62 |
|
2 | 0.16 | 0.37 | 0.22 | 0.65 |
|
3 | 0.18 | 0.52 | 0.26 | 0.56 |
|
4 | 0.16 | 0.44 | 0.24 | 0.60 |
|
5 | 0.21 | 0.70 | 0.33 | 0.50 |
RF | 1 | 0.14 | 0.56 | 0.22 | 0.40 |
|
2 | 0.17 | 0.67 | 0.27 | 0.42 |
|
3 | 0.14 | 0.48 | 0.21 | 0.47 |
|
4 | 0.12 | 0.44 | 0.18 | 0.42 |
|
5 | 0.14 | 0.52 | 0.22 | 0.44 |
Model performance for classifying suicide ideation.
Classifier | Trial number | Performance metrics | |||
|
|
Precision | Recall | F1 measure | Screening efficiency |
SLR | 1 | 0.19 | 0.81 | 0.31 | 0.29 |
|
2 | 0.22 | 0.84 | 0.34 | 0.33 |
|
3 | 0.19 | 0.74 | 0.30 | 0.33 |
|
4 | 0.16 | 0.65 | 0.26 | 0.31 |
|
5 | 0.20 | 0.81 | 0.32 | 0.30 |
RF | 1 | 0.17 | 0.84 | 0.28 | 0.15 |
|
2 | 0.17 | 0.81 | 0.29 | 0.20 |
|
3 | 0.18 | 0.84 | 0.29 | 0.18 |
|
4 | 0.17 | 0.77 | 0.28 | 0.21 |
|
5 | 0.17 | 0.77 | 0.27 | 0.20 |
Model performance for classifying negative self-evaluation.
Classifier | Trial number | Performance metrics | |||
|
|
Precision | Recall | F1 measure | Screening efficiency |
SLR | 1 | 0.25 | 0.68 | 0.37 | 0.49 |
|
2 | 0.24 | 0.59 | 0.34 | 0.53 |
|
3 | 0.20 | 0.47 | 0.29 | 0.55 |
|
4 | 0.21 | 0.62 | 0.32 | 0.45 |
|
5 | 0.24 | 0.74 | 0.36 | 0.41 |
RF | 1 | 0.22 | 0.71 | 0.33 | 0.39 |
|
2 | 0.23 | 0.65 | 0.34 | 0.47 |
|
3 | 0.22 | 0.65 | 0.33 | 0.46 |
|
4 | 0.22 | 0.74 | 0.34 | 0.38 |
|
5 | 0.20 | 0.62 | 0.30 | 0.41 |
Model performance for classifying desperation.
Classifier | Trial number | Performance metrics | |||
|
|
Precision | Recall | F1 measure | Screening efficiency |
SLR | 1 | 0.15 | 1.00 | 0.26 | 0 |
|
2 | 0.17 | 0.89 | 0.29 | 0.22 |
|
3 | 0.15 | 1.00 | 0.26 | 0 |
|
4 | 0.14 | 0.48 | 0.21 | 0.48 |
|
5 | 0.15 | 0.63 | 0.24 | 0.36 |
RF | 1 | 0.14 | 0.67 | 0.24 | 0.31 |
|
2 | 0.13 | 0.67 | 0.22 | 0.26 |
|
3 | 0.13 | 0.56 | 0.21 | 0.37 |
|
4 | 0.10 | 0.44 | 0.17 | 0.37 |
|
5 | 0.15 | 0.78 | 0.25 | 0.21 |
The key finding of our study is that a high level of suicide probability along the dimension of hostility, suicide ideation, negative self-evaluation, and desperation can be identified with acceptable performance via the profile and text data of microblog users. It is shown that classification performance was generally matched between SLR and RF. Precision varies from 10% to 25%, Recall varies from 30% to 89%, F1 measures vary from 17% to 37%, and the Screening Efficiency varies from 21% to 65%. The performance of the classifiers seems to depend on the randomization of data between the training and testing sets. For example, the Recall on hostility using SLR varies by 40% (0.30-0.70), but only by 7% for suicide ideation using RF (0.77-0.84). It may suggest that the degree of generalizability is different for the four risk factors measured in subscales; for example, future studies may be designed to verify whether suicide ideation has the greatest potential in identifying individual suicide risk among all the emotional factors.
For any risky individual, suicide prevention and intervention is a continuous process, involving a constantly alternating process of suicide risk evaluation and intervention therapy [
As the evaluation result shows, among the three classic performance metrics, Recall is generally higher than the other two. This suggests that the models attempt to retrieve as many risky suicidal individuals as possible, even at the cost of partly increasing false alarm. Considering the severity of the suicide act, we do not want to miss any risky individual. Therefore, Recall is our primary concern in this study. However, low Precision and F1 measure indicate that the current model alone can only serve as a preliminary screening tool for suicide probability. Some of the latest research findings also suggest that even though prediction of psychological problems by machine learning algorithms have advanced in accuracy, they still cannot take the place of expert scrutiny [
It is thus of our particular interest to explore to what extent preliminary screening of high-risk individuals via machine learning algorithms can reduce the workload in traditional scale assessment for suicide risk. It is shown from our newly defined metric “Screening Efficiency” that, assuming the proposed models serve at their best performance, currently we are just able to save less than half of the traditional workload in general. Although not directly complementary to Recall, a sign of tradeoff has been revealed in many of the experiment trials between the amount of saved workload for further scrutiny, and the proportion of correctly retrieved high-risk individuals. Combining the model evaluation results, we believe there is still much space for advancement in improving the predictive power of models in successive research. Nevertheless, it has been a good start to concentrate on the progressive attempt of feature extraction, modeling design, and classifier selection.
In order to facilitate the usability of our Internet survey system, we allowed participants to complete the survey discontinuously. In other words, if a participant was interrupted and forced to pause the survey partly completed, the progress could be saved for the next access. We did find a few participants with long fulfilling time, and were unable to tell whether they were interrupted, or other reasons that might potentially bias the value of self-report assessment. This concern calls for the optimization of Internet assessment methodology. Some researchers have already been working on developing short, good quality tools to test suicidal behavior on the Internet [
It is natural to wonder whether there are some features with the strongest predictive power among all the proposed features. According to the model outputs of our study, the powerful indicators are not consistent among different models; the predictive features in models with the same algorithm would even appear different among different trials. In addition, the predictive features are often uninterpretable. Although one of the advantages of machine learning is to discover hidden relations that do not fit in with the current knowledge system, we admit that currently we have better knowledge concerning the overall predictive power of modeling than the specific predictive power of a single feature. It is of our interest to consolidate feature systems and to strengthen output interpretation.
In this pilot study, we categorized users into three classes, and particularly labeled those who scored mean+1SD as high-risk individuals to indicate that they are more likely in need of careful clinical evaluation of suicide risk. Because there has been no norm group with regard to suicide probability scores among China’s Sina Weibo users, we are aware of the possibility of potential bias with regard to this user sample and the based cutoff points for high suicide probability. For future studies that intend to advance in the suicide Internet research in China, they may investigate the localization of this measuring tool into a specific Internet group.
Social media is widely used at the present time. Our study indicates that high suicide probability can be evaluated via the publicized profile and text information of microblog users. Although currently our model is unable to reach sufficient accuracy to provide diagnosis, this innovative approach does shed light on the value of monitoring large-scale populations, and enables detecting potentially suicidal individuals for suicide prevention professionals’ further follow-up. Future studies need to focus on increasing the accuracy of classification, and testing the performance on a larger scope of social media users.
Chinese Academy of Sciences
Internet Protocol
Random Forest
Simplified Chinese Micro-blog Word Count dictionary
Simple Logistic Regression
Suicide Probability Scale
The authors gratefully acknowledge the generous support from the National High-Tech R&D Program of China (2013AA01A606), the National Basic Research Program of China (2014CB744600), the Key Research Program of Chinese Academy of Sciences (CAS) (KJZD-EWL04), and the CAS Strategic Priority Research Program (XDA06030800). The study was also partly supported by the Research Grants Council Strategic Public Policy Grant (HKU 7003-SPPR-12).
None declared.