This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on https://mental.jmir.org/, as well as this copyright and license information must be included.
Previous research has shown the feasibility of using machine learning models trained on social media data from a single platform (eg, Facebook or Twitter) to distinguish individuals either with a diagnosis of mental illness or experiencing an adverse outcome from healthy controls. However, the performance of such models on data from novel social media platforms unseen in the training data (eg, Instagram and TikTok) has not been investigated in previous literature.
Our study examined the feasibility of building machine learning classifiers that can effectively predict an upcoming psychiatric hospitalization given social media data from platforms unseen in the classifiers’ training data despite the preliminary evidence on identity fragmentation on the investigated social media platforms.
Windowed timeline data of patients with a diagnosis of schizophrenia spectrum disorder before a known hospitalization event and healthy controls were gathered from 3 platforms: Facebook (254/268, 94.8% of participants), Twitter (51/268, 19% of participants), and Instagram (134/268, 50% of participants). We then used a 3 × 3 combinatorial binary classification design to train machine learning classifiers and evaluate their performance on testing data from all available platforms. We further compared results from models in intraplatform experiments (ie, training and testing data belonging to the same platform) to those from models in interplatform experiments (ie, training and testing data belonging to different platforms). Finally, we used Shapley Additive Explanation values to extract the top predictive features to explain and compare the underlying constructs that predict hospitalization on each platform.
We found that models in intraplatform experiments on average achieved an
We demonstrated that models built on one platform’s data to predict critical mental health treatment outcomes such as hospitalization do not generalize to another platform. In our case, this is because different social media platforms consistently reflect different segments of participants’ identities. With the changing ecosystem of social media use among different demographic groups and as web-based identities continue to become fragmented across platforms, further research on holistic approaches to harnessing these diverse data sources is required.
Despite its relatively low prevalence compared with other mental health disorders, the burden of schizophrenia spectrum disorder (SSD) on patients, families, and society is substantial [
Given this information, there has been an established body of research on using social media data to identify and predict psychiatric outcomes of social media users with SSD using machine learning classifiers [
Although such results demonstrate the potential of automated techniques in predicting the mental health outcomes of individuals with SSD via social media data, many research gaps remain that need to be addressed before psychiatrists can reliably deploy such techniques for clinical purposes. Most prior work in this area primarily focused on a single source of social media data, either exclusively from Twitter or Facebook, for downstream classification and analysis tasks [
The research question we attempted to answer was as follows: given the preliminary evidence of fragmented identities that are reflected on the investigated social media platforms, can we build classifiers that can effectively detect users at risk of an upcoming psychiatric hospitalization using social media data from platforms unseen in the training data?
To answer our research question, we collated textual and image content (if available) from consenting participants’ social media data from Facebook, Twitter, and Instagram. We then trained platform-specific classifiers to distinguish between social media data from healthy controls and data from patients with SSD with an upcoming psychiatric hospitalization. We compared the performance of classifiers on testing data between seen and unseen social media platforms from the training data. We also compared and analyzed the top predictive features and the feature importance distributions between the 3 platform-specific classifiers, with a view toward finding potential empirical evidence for fragmented identities between the various social media platforms.
We recruited participants clinically diagnosed with SSD and clinically verified healthy controls aged between 15 and 35 years. These data were collected as part of a broader research initiative involving the authors of this paper to identify technology-based health information to provide early identification, intervention, and treatment for young adults with SSD [
For participants with SSD aged between 15 and 35 years (141/268, 52.6%), diagnoses were based on clinical assessment of the most recent episode and were extracted from participants’ medical records at the time of their consent. Participants in this group were recruited from the Northwell Health Zucker Hillside Hospital and collaborating institutions located in East Lansing, Michigan. Participants were excluded if they had an IQ of <70 (per clinical assessment), autism spectrum disorder, or substance-induced psychotic disorder.
In addition, healthy volunteers aged between 15 and 35 years (127/268, 47.4%) were approached and recruited from an existing database of eligible individuals who had already been screened for previous research projects at Zucker Hillside Hospital and had agreed to be recontacted for additional research opportunities. Healthy status was determined by either the Structured Clinical Interview for the Diagnostic and Statistical Manual of Mental Disorders conducted within the past 2 years or the Psychiatric Diagnostic Screening Questionnaire [
All consenting participants were asked to download and share their Facebook, Twitter, and Instagram data archives. We collected all linguistic content from participants’ Facebook and Twitter archives (ie, status updates and comments on Facebook and posts shared on Twitter). In addition, we collected image content from participants’ Facebook and Instagram archives, including profile pictures and story photos.
Next, we also collected the medical history of each participant (following consent and adoption of Health Insurance Portability and Accountability Act–compliant policies). This included primary and secondary diagnosis codes, the total number of hospitalizations, and admission and discharge dates for each hospitalization event. Hospitalization data were collected from the medical records at the time of consent. As all consented patient participants in the study had also received care at the Zucker Hillside Hospital, the medical records at the hospital were accurate and up to date to the best of the hospital’s efforts. We only counted psychiatric hospitalizations (not hospitalizations for other nonpsychiatric reasons). Thereafter, the study team accessed the corresponding consented patients’ medical records to extract all their recorded hospitalization events in a similar manner to previous studies using this source of data [
Finally, we collected social media data from all available platforms for each participant with at least one known hospitalization event within a 6-month window before the latest hospitalization event, ensuring that there were no hospitalization events within these 6 months. This was done to ensure that the data gathered were representative of the participants’ healthy mental status before symptomatic exacerbation and subsequent hospitalization. A 6-month period, which we refer to as the
Diagram representing the windowing process used to gather participants’ social media data before hospitalization events. Bold text represents the selected data windows. Crosses represent hospitalization events. The X represents invalid data windows. A: Windowing—with hospitalizations; B: Windowing—without hospitalizations.
To encode participants’ social media data for the downstream classification and analysis tasks outlined in our research objectives, we identified and extracted the following categories of features from these data for all 3 investigated social media platforms: (1) n-gram language features (n=500), (2) Linguistic Inquiry and Word Count (n=78), (3) lexico-semantic features (n=3), (4) activity features (n=9), and (5) image features (n=23; Instagram and Facebook only).
The specific feature categories were chosen based on relevant previous literature, particularly relating to the use of social media data to infer mental health attributes and psychiatric outcomes [
Using the aforementioned features, for each of the 3 examined social media platforms, we encoded available participants’ textual and image data on Facebook and Instagram into 613-dimensional feature vectors and textual data on Twitter into 590-dimensional feature vectors. This yielded a Facebook data set of dimension 254 × 613, a Twitter data set of dimension 51 × 590, and an Instagram data set of dimension 134 × 613. We shall refer to these data sets as F, T, and I for Facebook, Twitter, and Instagram, respectively.
As the feature set might contain features that are noisy and irrelevant, the classification models may be unstable and produce suboptimal results [
We trained a random forest model, with 5-fold stratified cross-validation to fine-tune hyperparameters, on data sets F, T, and I with an 80:20 train-test split, using only the top
To answer the research question laid out in the Introduction section, we adopted a 3 × 3 combinatorial classification design, where we trained and tested machine learning models on the psychiatric hospitalization prediction task using all possible pairs of training and testing data sets.
Diagram representing the classification experiments performed and their nature within the 3 × 3 combinatorial design.
For both intra- and interplatform experiments, training data represented by the top 20% of features (as described in the Feature Selection section) were fed into a model to learn the classification task. We tried training the model over several algorithms, including random forest, logistic regression, support vector machine, and multilayer perceptron [
We measured the performance of the models using the metrics outlined in
max_depth: 15
n_estimators: 100
max_features: none
Penalty: l2
C: 0.1
Kernel: rbf
C: 0.01
Gamma: scale
Alpha: 0.0001
Hidden_layer_sizes: (512, 256, 128)
Also known as Rand accuracy, the ratio of correct predictions to all predictions
The ratio of correct positive predictions to the total number of positive predictions
The ratio of correct positive predictions to the total number of true positive instances
The harmonic mean between precision and recall
The AUROC, which plots the false positive rate against the true positive rate and, in practice, is often estimated using the trapezoidal rule with the following formula:
We used Shapley Additive Explanations (SHAP) to examine how certain features affected our model’s decision to predict users with potential psychiatric hospitalization because of SSD given their social media data from the 3 inspected social media platforms. Our decision to use SHAP rather than other explainability methods stems from the fact that SHAP is not only model-agnostic but also the most theoretically sound explainability framework among the available options. This is because SHAP feature scores can be calculated for localized samples and for the entire global data set [
For each of the intraplatform experiments within the 3 × 3 combinatorial design and each machine learning model, we calculated the average SHAP values for each of the features (ie, their importance to the prediction) across all instances within the testing set. We then recorded the list of features sorted in descending order according to the average SHAP values measured by each model. In the case of models with native support for feature importance extraction, including random forest (Gini importance) and logistic regression (feature coefficients), we also calculated and recorded them in an equivalent manner to SHAP values.
To ensure that our findings regarding differences in model performance between models and between intra- and interplatform experiments still held when certain aspects of the training and testing data sets were made more ideal, we performed several robustness checks, which are described in
The study was approved by the institutional review board of Northwell Health (the coordinating institution) and the institutional review board of the participating partners (Georgia Tech approval H21403). Participants were recruited from June 23, 2016, to December 4, 2020. Written informed consent was obtained from adult participants and legal guardians of participants aged <18 years. Assent was obtained from participating minors.
In total, 268 participants (mean age 24.73, SD 5.64 years; male: 127/268, 47.4%; SSD: 141/268, 52.6%) with nonempty windowed data for at least one platform were included. Of these 268 participants, 254 (94.8%; SSD: 133/254, 52.4%) had valid windowed Facebook data, 51 (19%; SSD: 7/51, 13.7%) had valid windowed Twitter data, and 134 (50%; SSD: 42/134, 31.3%) had valid windowed Instagram data. Among participants with valid data for more than one platform, 17.5% (47/268; SSD: 5/47, 10.6%) had valid data for both Facebook and Twitter, 14.2% (38/268; SSD: 4/38, 10.5%) had valid data for both Twitter and Instagram, and 44.4% (119/268; SSD: 34/119, 28.6%) had valid data for both Facebook and Instagram. Finally, 14.2% (38/268; SSD: 4/38, 10.5%) of participants had valid data for all 3 platforms.
Demographic and clinical characteristics of the participants (N=268).
Characteristic | SSDa (n=141) | Control (n=127) | Full sample | ||||
Age (years), mean (SD) | 24.86 (5.49) | 24.57 (5.82) | 24.73 (5.64) | ||||
|
|||||||
|
Male | 89 (63.1) | 38 (29.9) | 127 (47.4) | |||
|
Female | 52 (36.9) | 89 (70.1) | 141 (52.6) | |||
|
|||||||
|
African American or Black | 64 (45.4) | 19 (15) | 83 (31) | |||
|
Asian | 20 (14.2) | 23 (18.1) | 43 (16) | |||
|
White | 37 (26.2) | 75 (59.1) | 112 (41.8) | |||
|
Mixed race or other | 15 (10.6) | 5 (3.9) | 20 (7.5) | |||
|
Hispanic | 5 (3.5) | 4 (3.1) | 9 (3.4) | |||
|
Pacific Islander | 0 (0) | 1 (0.8) | 1 (0.4) | |||
|
|||||||
|
Schizophrenia | 67 (47.5) | N/Ab | 67 (25) | |||
|
Schizophreniform | 26 (18.4) | N/A | 26 (9.7) | |||
|
Schizoaffective | 25 (17.7) | N/A | 25 (9.3) | |||
|
Unspecified SSDs | 23 (16.3) | N/A | 23 (8.6) | |||
|
No diagnosis | N/A | 127 (100) | 127 (47.4) |
aSSD: schizophrenia spectrum disorder.
bN/A: not applicable.
Summary statistics for windowed data for both the control class and the schizophrenia spectrum disorder (SSD) class (ie, participants hospitalized with SSD). In this table, we consider data from Facebook, Twitter, and Instagram, as mentioned previously.
|
Facebook (user: n=254; post: n=169,425) | Twitter (user: n=51; post: n=23,777) | Instagram (user: n=134; post: n=23,551) | |||||
|
SSD class | Control class | SSD class | Control class | SSD class | Control class | ||
Total users, n (%) | 133 (52) | 121 (48) | 7 (14) | 44 (86) | 42 (31) | 92 (69) | ||
Total posts, n (%) | 114,793 (68) | 54,632 (32) | 991 (4) | 22,786 (96) | 7111 (30) | 16,440 (70) | ||
Posts, mean (SD) | 863.1 (2365.1) | 451.5 (818.87) | 141.6 (255) | 519.9 (1166.9) | 169.3 (445.4) | 178.7 (234.6) | ||
Posts, median | 260 | 184 | 37 | 138 | 54.5 | 103 | ||
Posts, range | 2-23,589 | 1-4852 | 1-758 | 1-7056 | 1-2909 | 1-1328 |
Cumulative distribution function (CDF) curves of users and their number of posts for the schizophrenia spectrum disorder and control classes per data set: (A) Facebook (left), (B) Twitter (center), and (C) Instagram (right).
We report the full results of the intraplatform experiments in
Elaborating on the results from
By contrast, by aggregating the metrics for the interplatform experiments presented in
Classification results for all intraplatform classification experiments. In this table, for instance, Facebook indicates the Facebook-Facebook experiment.
Model | |||||||||||||||||
|
Acca | Pb | Rc |
|
AUROCd | Acc | P | R |
|
AUROC | Acc | P | R |
|
AUROC | ||
Random forest | 0.739 | 0.739 | 0.738 | 0.738 | 0.709 | 0.745 | 0.150 | 0.116 | 0.116 | 0.494 | 0.7 | 0.648 | 0.637 | 0.637 | 0.681 | ||
SVMe | 0.722 | 0.747 | 0.692 | 0.715 | 0.723 | 0.854 | 0.541 | 0.45 | 0.463 | 0.697 | 0.740 | 0.737 | 0.757 | 0.743 | 0.805 | ||
MLPf | 0.506 | 0.406 | 0.507 | 0.367 | 0.516 | 0.845 | 0.458 | 0.45 | 0.426 | 0.692 | 0.792 | 0.771 | 0.794 | 0.77 | 0.840 | ||
Logistic regression | 0.759 | 0.767 | 0.758 | 0.756 | 0.727 | 0.881 | 0.742 | 0.6 | 0.63 | 0.772 | 0.792 | 0.771 | 0.801 | 0.773 | 0.848 |
aAcc: accuracy.
bP: precision.
cR: recall.
dAUROC: area under the receiver operating characteristic curve.
eSVM: support vector machine.
fMLP: multilayer perceptron.
Classification results for the interplatform classification experiments for Facebook training data.
Model | |||||||||||
|
Acca | Pb | Rc |
|
AUROCd | Acc | P | R |
|
AUROC | |
Random forest | 0.392 | 0.221 | 0.88 | 0.354 | 0.579 | 0.379 | 0.328 | 0.952 | 0.488 | 0.537 | |
SVMe | 0.545 | 0.253 | 0.72 | 0.373 | 0.612 | 0.432 | 0.337 | 0.860 | 0.483 | 0.550 | |
MLPf | 0.587 | 0.240 | 0.55 | 0.334 | 0.573 | 0.435 | 0.332 | 0.812 | 0.471 | 0.539 | |
Logistic regression | 0.628 | 0.246 | 0.47 | 0.323 | 0.567 | 0.472 | 0.344 | 0.775 | 0.476 | 0.555 |
aAcc: accuracy.
bP: precision.
cR: recall.
dAUROC: area under the receiver operating characteristic curve.
eSVM: support vector machine.
fMLP: multilayer perceptron.
Classification results for the interplatform classification experiments for Twitter training data.
Model | |||||||||||
|
Acca | Pb | Rc |
|
AUROCd | Acc | P | R |
|
AUROC | |
Random forest | 0.531 | 0.569 | 0.378 | 0.452 | 0.536 | 0.628 | 0.331 | 0.207 | 0.252 | 0.512 | |
SVMe | 0.514 | 0.53 | 0.537 | 0.530 | 0.513 | 0.563 | 0.340 | 0.42 | 0.373 | 0.523 | |
MLPf | 0.533 | 0.561 | 0.440 | 0.492 | 0.536 | 0.557 | 0.325 | 0.395 | 0.356 | 0.512 | |
Logistic regression | 0.534 | 0.552 | 0.522 | 0.535 | 0.535 | 0.578 | 0.362 | 0.47 | 0.408 | 0.548 |
aAcc: accuracy.
bP: precision.
cR: recall.
dAUROC: area under the receiver operating characteristic curve.
eSVM: support vector machine.
fMLP: multilayer perceptron.
Classification results for the interplatform classification experiments for Instagram training data.
Model | |||||||||||
|
Acca | Pb | Rc |
|
AUROCd | Acc | P | R |
|
AUROC | |
Random forest | 0.51 | 0.523 | 0.612 | 0.563 | 0.507 | 0.751 | 0.369 | 0.42 | 0.386 | 0.624 | |
SVMe | 0.524 | 0.544 | 0.51 | 0.524 | 0.525 | 0.691 | 0.213 | 0.25 | 0.229 | 0.521 | |
MLPf | 0.554 | 0.584 | 0.48 | 0.526 | 0.557 | 0.683 | 0.201 | 0.23 | 0.214 | 0.51 | |
Logistic regression | 0.516 | 0.524 | 0.689 | 0.595 | 0.51 | 0.628 | 0.256 | 0.52 | 0.342 | 0.587 |
aAcc: accuracy.
bP: precision.
cR: recall.
dAUROC: area under the receiver operating characteristic curve.
eSVM: support vector machine.
fMLP: multilayer perceptron.
Receiver operating characteristic (ROC) curves for the classification experiments given the best logistic regression model. (A), (B), and (C) are curves for the Facebook, Twitter, and Instagram intraplatform results, respectively, from
We hypothesized that the decrease in performance from intraplatform experiments to interplatform experiments, as presented previously, was driven by differences in feature importance learned by models when trained on data from different social media platforms (even when they shared the same feature set). By extracting the list of SHAP features from the models per the method described previously, we found support for this hypothesis. Specifically, we observed little overlap between them across platforms among the top 25 features for each model and platform (when holding the model constant). On average, there were only 4.66 overlapping features for the same logistic regression classification model across platforms (the best-performing model based on the previous discussions). In addition, we found that the lists of feature importance for each of the platforms, based on the logistic regression model, had very weak rank correlation pairwise. Fully elaborating on the statistical results for the Kendall rank correlation coefficient, we found very weak rank correlations between the ranked lists of feature importance for Facebook and Twitter (τb=0.081;
Top 10 features for the logistic regression (LR) model for each of the platforms (Linguistic Inquiry and Word Count features are italicized) based on their Shapley Additive Explanations (SHAP) values.
Platform and feature acronym | Feature description | SHAP value | LR coefficient | SSDa group average (SD) | Control group average (SD) | |||||
|
||||||||||
|
Avg_post_readability | Average post readability, as measured using the SMOGb index | 0.761 | −0.268 | 5.6341 (2.74) | 6.8048 (1.92) | ||||
|
|
Ratio of words within the “quantifiers” category | 0.4195 | −0.189 | 0.0012 (0.0012) | 0.0016 (0.0012) | ||||
|
|
Ratio of words within the “negative emotions” category | 0.0953 | 0.244 | 0.0043 (0.0035) | 0.0031 (0.0022) | ||||
|
|
Ratio of words within the “money” category | 0.0739 | −0.216 | 0.0007 (0.001) | 0.0011 (0.002) | ||||
|
|
Ratio of words within the “swear” category | 0.0628 | 0.236 | 0.0017 (0.0025) | 0.0007 (0.001) | ||||
|
Ratio_octile8 | Ratio of activities from 9 PM to midnight | 0.0443 | 0.077 | 0.1443 (0.149) | 0.1241 (0.158) | ||||
|
Ratio_octile7 | Ratio of activities from 6 PM to 9 PM | 0.0409 | 0.177 | 0.1561 (0.1745) | 0.1054 (0.125) | ||||
|
|
Ratio of words within the “anger” category | 0.0095 | 0.191 | 0.0018 (0.002) | 0.0009 (0.001) | ||||
|
Dream | Ratio of “dream” within the overall bag of words | 0.0077 | 0.224 | 0.2028 (0.468) | 0.0746 (0.24) | ||||
|
Fun | Ratio of “fun” within the overall bag of words | 0.0043 | −0.209 | 0.5722 (1.19) | 1.1315 (1.76) | ||||
|
||||||||||
|
|
Ratio of words within the “conjunctions” category | 0.2319 | −0.063 | 0.0001 (0.0002) | 0.0003 (0.0004) | ||||
|
|
Ratio of words within the “adjectives” category | 0.1825 | −0.05 | 0.0057 (0.004) | 0.0080 (0.005) | ||||
|
Avg_post_negativity | Average post negativity, as calculated using the VADERc library | 0.1509 | 0.082 | 0.071 (0.042) | 0.0519 (0.036) | ||||
|
|
Ratio of words within the “male” category | 0.1355 | 0.039 | 0.0011 (0.0013) | 0.0007 (0.001) | ||||
|
Ratio_octile_8 | Ratio of activities from 9 PM to midnight | 0.1265 | 0.045 | 0.0231 (0.356) | 0.1227 (0.188) | ||||
|
|
Ratio of words within the “ingest” category | 0.0627 | −0.056 | 0.0003 (0.0007) | 0.0014 (0.0018) | ||||
|
|
Ratio of words within the “insight” category | 0.0516 | 0.053 | 0.0044 (0.004) | 0.0035 (0.003) | ||||
|
|
Ratio of words within the “power” category | 0.0308 | −0.058 | 0.0024 (0.0026) | 0.0042 (0.0036) | ||||
|
|
Ratio of words within the “we” category | 0.0196 | −0.056 | 0.0001 (0.0002) | 0.0002 (0.0004) | ||||
|
|
Ratio of words within the “prepositions” category | 0.0117 | 0.063 | 0.0028 (0.0026) | 0.0017 (0.0017) | ||||
|
||||||||||
|
Avg_post_readability | Average post readability, as measured using the SMOG index | 0.761 | −0.203 | 5.1018 (1.15) | 6.2564 (1.638) | ||||
|
|
Ratio of words within the “space” category | 0.733 | −0.147 | 0.0031 (0.0025) | 0.0042 (0.0025) | ||||
|
|
Ratio of words within the “affiliation” category | 0.6839 | −0.181 | 0.0032 (0.0027) | 0.0056 (0.0034) | ||||
|
|
Ratio of words within the “friend” category | 0.5336 | −0.159 | 0.0009 (0.0027) | 0.0018 (0.0034) | ||||
|
|
Ratio of words within the “female” category | 0.4576 | −0.168 | 0.0008 (0.001) | 0.0019 (0.0023) | ||||
|
|
Ratio of words within the “sad” category | 0.4554 | 0.113 | 0.0011 (0.0008) | 0.0007 (0.0012) | ||||
|
|
Ratio of words within the “quantifier” category | 0.4195 | −0.118 | 0.0012 (0.0013) | 0.0019 (0.0016) | ||||
|
Away | Ratio of “away” within the overall bag of words | 0.4064 | −0.105 | 0.0768 (0.276) | 0.2505 (0.5) | ||||
|
|
Ratio of words within the “assent” category | 0.3913 | −0.102 | 0.0008 (0.0012) | 0.0013 (0.0014) | ||||
|
Next | Ratio of “next” within the overall bag of words | 0.3854 | −0.12 | 0.0957 (0.267) | 0.6466 (1.236) |
aSSD: schizophrenia spectrum disorder.
bSMOG: Simple Measure of Gobbledygook.
What could explain the observed differences in construct validities of the intraplatform models? Early in this paper, we posited that these differences might stem from people’s identities being fragmented across different platforms. To situate that these divergent identities are indeed the drivers behind differential cross-platform model construct validities and, by extension, performance, we adopted a strategy to measure the differences within the extracted feature space between the investigated platforms for a given participant. As social media data for participants on all platforms are encoded via feature vectors in this study, we calculated the pairwise similarity between platform-specific data using cosine similarity [
We found that the average between-platform, within-participant cosine similarity was 0.3093 for Facebook-Twitter, 0.2304 for Facebook-Instagram, and 0.3905 for Twitter-Instagram. This was either lower than or similar to the average within-platform, between-participant cosine similarity for the investigated platforms: 0.5072 for Facebook, 0.5427 for Twitter, and 0.373 for Instagram. The same trend holds even when calculating the averages using data from both participants with SSD and healthy controls with data from all 3 platforms.
Our study aimed to measure the ability (or inability) of mental health classifiers to generalize across platforms and surface evidence of fragmented identities on social media among patients with SSD. Overall, we found that, across the board, models trained on data from social media platforms have poor generalizability when evaluated on data from other social media platforms even when holding the feature set constant across training and testing data. This trend holds true even in the 2 robustness tests, where the same participants and data set size were used in the training and testing data (as described in the Methods section). This trend is also true even when the training data come from a platform with high data availability and the testing data come from a platform with low data availability. For instance, the best
Next, we discuss the findings regarding feature importance in more detail. First, looking at the theoretical validity of the top 10 features per platform and interpretation of the sign of the features’ logistic regression coefficient, we found alignment with previous literature and evidence of clinical meaningfulness [
That said, each model corresponding to each platform seemed to pick up contrasting signals from its respective training data, which is why we note the low overlap in the aforementioned top SHAP features. Among the few that overlap in the top 10 features reported previously, we found “avg_post_readability” to be picked up as a highly predictive feature by both Facebook and Instagram models, whereas “ratio_octile8” was selected by both Facebook and Twitter models. In our case, “avg_post_readability” is calculated using the Simple Measure of Gobbledygook index, which approximates the years of education needed to fully comprehend a piece of written text. The negative logistic regression coefficient and the averages of the SSD and control groups for this feature suggest that texts written by patients with SSD are simpler in nature, which is indicative of language dysfunction. This is a known negative symptom of schizophrenia and related psychotic disorders, as observed in prior work [
At the crux of these differences, we found that the models had inherently different construct validity across platforms. Data on each platform reflect only a segment of an individual’s identity—a segment that may be absent in another platform. The fragmentation of one’s identity on social media can be most clearly seen among participants with data on all 3 platforms. In the analysis presented at the end of the Results section, we found low average pairwise cosine similarities within participants between platforms, especially when comparing with cosine similarities of different participants within the same platform. This indicates that, even within the same feature space for the same participant, social media data between platforms are likely to diverge into multiple distinct directions mapping to these fragments of identities. This divergence is at least equal to, if not even greater than, the divergence in identity presentation between different individuals within the same social media platform. Therefore, when models trained on data from one platform learn this specific fragment of identity, they are less effective on testing data that capture a different identity.
I
Omfg the Damn
Im a
Yo stay
Our findings provide replicative validity to several threads in previous research. Specifically, we found that the performance of models trained on social media data with clinically verified labels (ie, SSD or control) is consistent with similar models presented in previous research, including those trained on similar patient populations and clinical sites [
Our findings have important implications for mental health research and practice. Hospitalization prediction for psychiatric illnesses by harnessing digital trace data has been of significant interest in recent years. These previous studies have explored the utility of smartphone sensor data (ie, geolocation, physical activity, phone use, and speech), wearables, and social media activity to predict symptom fluctuations as well as understand the diagnostic process and hospitalization identification [
Finally, digital interventions that are touted to be powered by social media data should consider the significant aspect of fragmented web-based identities of patients [
Our work has some limitations that could be addressed in future research. First, despite the use of data augmentation techniques to rebalance the ratio between SSD data and control data for each data set and make the data set sizes of the 3 examined platforms (ie, Instagram, Twitter, and Facebook) comparable with each other, we acknowledge that a limited quantity of available data may have affected the observed classification performance. Although it is widely recognized that patient social media data are challenging to collect, as was the case in this study, future research may consider the potential of creating large benchmarked data sets that may support better reproducible research in this field [
In this study, we showed that it is challenging to build effective models for predicting future psychiatric hospitalizations of patients with SSD on new social media data from platforms previously unseen in the models’ training data. Specifically, we demonstrated that models built on one platform’s data do not generalize to another as each platform consistently reflects different segments of participants’ identities. This fragmentation of identity is empirically backed up by both significant differences in the construct validity of intraplatform classifiers and divergent feature vectors within participants between the 3 investigated social media platforms. To ensure the effective incorporation of digital technology into early psychosis intervention, especially in the prevention of relapse hospitalizations, further research must explore precisely how symptoms of mental illness manifest on the web through changing patterns of language and activity on various platforms as well as how comprehensive, ethical, and effective treatment and engagement strategies should be devised that function seamlessly across patients’ fragmented web-based identities.
Additional information on the feature selection process and robustness checks.
area under the receiver operating characteristic curve
Shapley Additive Explanations
schizophrenia spectrum disorder
This research was partly funded by National Institute of Mental Health grant R01MH117172 (principal investigator: MDC; co–principal investigators: MLB and JMK). The research team acknowledges the assistance of Anna Van Meter and Asra Ali in the early phases of patient data collection. The authors also thank members of the Social Dynamics and Wellbeing Lab at Georgia Tech for their valuable feedback during the various phases of the study.
MLB is a consultant for HearMe and Northshore Therapeutics. JMK is a consultant to or receives honoraria from Alkermes, Allergan, Boehringer-Ingelheim, Cerevel, Dainippon Sumitomo, H. Lundbeck, Indivior, Intracellular Therapies, Janssen Pharmaceutical, Johnson & Johnson, LB Pharmaceuticals, Merck, Minerva, Neurocrine, Newron, Novartis, Otsuka, Roche, Saladax, Sunovion, Teva, HLS, and HealthRhythms and is a member of the advisory boards of Cerevel, Click Therapeutics, Teva, Newron, Sumitomo, Otsuka, Lundbeck, and Novartis. He has received grant support from Otsuka, Lundbeck, Sunovion, and Janssen and is a shareholder of Vanguard Research Group; LB Pharmaceuticals, Inc; and North Shore Therapeutics. The other authors have no conflicts of interest to declare.