This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on http://mental.jmir.org/, as well as this copyright and license information must be included.

A majority of adults in the United States are exposed to a potentially traumatic event but only a handful go on to develop impairing mental health conditions such as posttraumatic stress disorder (PTSD).

Identifying those at elevated risk shortly after trauma exposure is a clinical challenge. The aim of this study was to develop computational methods to more effectively identify at-risk patients and, thereby, support better early interventions.

We proposed machine learning (ML) induction of models to automatically predict elevated PTSD symptoms in patients 1 month after a trauma, using self-reported symptoms from data collected via smartphones.

We show that an ensemble model accurately predicts elevated PTSD symptoms, with an area under the curve (AUC) of .85, using a bag of support vector machines, naive Bayes, logistic regression, and random forest algorithms. Furthermore, we show that only 7 self-reported items (features) are needed to obtain this AUC. Most importantly, we show that accurate predictions can be made 10 to 20 days posttrauma.

These results suggest that simple smartphone-based patient surveys, coupled with automated analysis using ML-trained models, can identify those at risk for developing elevated PTSD symptoms and thus target them for early intervention.

Posttraumatic stress disorder (PTSD) is a psychiatric condition that leads to significant disability and impairment [

A diagnosis of PTSD requires symptoms to be present for at least 30 days. Previous studies suggest that symptoms first appear in the days and weeks after a traumatic event and gradually increase over time [

In this paper, we take the initial steps toward such a predictive model. We investigated whether correlations exist between PTSD symptoms present shortly after trauma and at 1 month after an event and whether these correlations can be discovered by supervised

The study presented here uses data collected during a clinical study involving 90 individuals who experienced a criterion A traumatic event and who were recruited from the critical care service of a level-1 trauma center in Northern New England [

In this study, we took a comparative approach to investigating not only whether predictive ML-induced models may exist but also which are the best approaches to model the induction. The development of PTSD symptoms among those who go on to have severe PTSD symptoms follows a complex course, which may be nonlinear [

Specifically, this study considered 4 research hypotheses:

Hypothesis 1: ML can demonstrate significant statistical correlations between observable symptoms and elevated PTSD 1 month after trauma.

Hypothesis 2: ML can identify the relevance of early symptoms used to predict PTSD by care providers.

Hypothesis 3: ML can identify the number of days needed to predict elevated PTSD 1 month after trauma.

Hypothesis 4: ML-induced models can be used to predict elevated PTSD 1 month after a trauma, given that symptoms are displayed between 10 and 20 days posttrauma.

In this section, we describe the dataset we used for our study, which was collected during a clinical study involving 90 individuals who experienced a criterion A traumatic event and were recruited from the critical care service of a level-1 trauma center in Northern New England [

To recruit participants in the cited study [

Participants (N=90) were aged mean 35 (SD 10.41) years, were a majority of males (n=57), and had completed college (n=36). The sample was predominately white (n=80). The most common type of injury was motor vehicle accident (n=45). Cell phone ownership included 52 iPhones and 35 Android devices. In addition, 3 participants identified having another type of device but had access to an Android or iPhone device. PTSD symptoms were assessed with the PTSD checklist-5 at 1-month posttrauma [^{th} Edition (DSM-5) criteria, the PTSD checklist for DSM-5 (PCL-5) is a 20-item self-reported measure that assesses PTSD symptoms experienced over the last month. Items assess symptoms across 4 symptom clusters of PTSD (re-experiencing, negative mood, avoidance, and hyperarousal) on a 0- to 4-point Likert scale. Total scores range from 0 to 80. A score of 33 or higher is associated with a likely diagnosis of PTSD [

Each mobile assessment consisted of 10 items. These included the 8 items (items 1, 4, 6, 7, 9, 12, and 18) of the abbreviated PCL-5 [

Data, in the form of 11 main features, were collected from each patient, namely, _{1}, _{2}, _{1}, _{2}, _{1}, _{2}, _{1}, _{2}, _{33}, based on the following conditions:

_{33}=1, when

_{33}=0, otherwise.

The experiments were conducted using a score of 33, which corresponds to a clinical cutoff for likely PTSD [

Although the features shown in

After determining the relevant features and target binary classification labels (PTSD or no PTSD at threshold value 33), the resulting data still contained a nontrivial amount of missing data. Specific patient response instances with missing values were not removed as they could potentially retain relevant information. Instead, missing values were replaced with an average calculated from the associated patient’s previous entries.

Standardization or normalization of features is a common preprocessing step in ML, producing features centered around a zero mean with unit variance. Feature standardization is a requirement for gradient descent–based ML algorithms (such as SVMs and logistic regression) for faster convergence and better performance. The general method for calculating standardized features is:

where for a given feature

Dataset table (higher values signify more severe pathology).

Attribute (feature) | Description | Nonnull value | Range |

Posttraumatic stress disorder symptoms 1-month posttrauma | 975 | 0-80 | |

Days since trauma occurred | 1144 | 1-49 | |

_{1} |
Distress related to trauma-related intrusive thoughts | 651 | 0-4 |

_{2} |
Emotional reactivity to trauma cues | 649 | 0-4 |

_{1} |
Avoidance of thoughts about trauma | 650 | 0-4 |

_{2} |
Avoidance of environmental trauma-related reminders | 651 | 0-4 |

_{1} |
Negative beliefs about self and the world | 651 | 0-4 |

_{2} |
Loss of interest in activities | 649 | 0-4 |

_{1} |
Exaggerated startle reaction | 650 | 0-4 |

_{2} |
Difficulty in concentrating | 650 | 0-4 |

Sleep difficulty | 650 | 0-4 | |

Self-reported pain | 646 | 0-10 |

(a) Missing feature value distribution across all data input vectors. Yellow signifies missing values. (b) Missing feature percentage. _{1}: distress related to trauma-related intrusive thoughts; _{2}: emotional reactivity to trauma cues; _{1}: avoidance of thoughts about trauma; _{2}: avoidance of environmental trauma-related reminders; _{1}: negative beliefs about self and the world; _{2}: loss of interest in activities; _{1}: exaggerated startle reaction; _{2}: difficulty in concentrating;

To identify both feature relevance and potential duplication of information surrounding early symptoms used to predict PTSD, we measured the correlation between all pairs of features. The correlation is statistically calculated between each feature variable and another using the average of the products between the standardized values of each sample. This process summarizes the relationship between features, known in statistics as the covariance method.

In general, removing correlated features will not always enhance model performance but can aid in data preparation for ML algorithms. More importantly, this process can reduce the number of symptoms needed to predict PTSD. _{1}, _{2}, _{1}, and _{2} were highly correlated. Thus, _{2} was retained, whereas _{1}, _{1}, and _{2} were removed from our input feature set. Later, in the Results section, we discuss in detail the effect of this feature selection on model performance.

Correlation between features in our dataset, prior to feature selection. _{1}: distress related to trauma-related intrusive thoughts; _{2}: emotional reactivity to trauma cues; _{1}: avoidance of thoughts about trauma; _{2}: avoidance of environmental trauma-related reminders; _{1}: negative beliefs about self and the world; _{2}: loss of interest in activities; _{1}: exaggerated startle reaction; _{2}: difficulty in concentrating;

Feature importance methods score each feature by providing a quantitative measurement surrounding its relevance. The RF algorithm is capable of providing an importance score for each feature. RF can score the relevance of each feature through either statistical permutation tests or the Gini impurity index, which is used in this study, as shown in

where _{i}

To calculate feature importance, we sum the Gini impurity index values for each feature in the dataset over RF trees. These sums are then normalized and ranked to indicate the feature importance index. For more details on the Gini variable importance approach, see the study by Garcia-Lorenzo et al [

Features with smaller importance values can be removed from the dataset, thus, further reducing the number of relevant early symptoms to be used for PTSD prediction. _{2}, _{2}, and _{1} are less important than others. Furthermore, although the

In the Results section of this study, we discuss the effect of removing _{2}, _{2}, and _{1} as they are low in importance, as well as _{1}, _{1}, and _{2}, which are highly correlated with _{2}, as discussed above. Notice that _{1} and _{2} are both low in importance and highly correlated with other features.

Ranked feature importance determined using the Gini method. _{1}: distress related to trauma-related intrusive thoughts; _{2}: emotional reactivity to trauma cues; _{1}: avoidance of thoughts about trauma; _{2}: avoidance of environmental trauma-related reminders; _{1}: negative beliefs about self and the world; _{2}: loss of interest in activities; _{1}: exaggerated startle reaction; _{2}: difficulty in concentrating;

In this paper, we studied multiple classifiers—logistic regression, naive Bayes, SVM, and RFs—to classify PTSD versus non-PTSD cases. In addition, we proposed

Multiple binary classifiers were chosen for use in this study because of their established predictive power.

We applied logistic regression because it is widely used for binary classification problems. We built a linear classifier without performing any nonlinear transformation on the features. For more information about logistic regression classifiers, refer to the study by Held [

The naive Bayes classifier is simple, fast, and reliable and is derived from the Bayes theorem. The naive Bayes classifier assumes independent features with conditional independence, making the computation simpler (hence,

SVMs are known for their generalization power, where the SVM kernel trick is used to implicitly enforce a nonlinear transformation on input features. In this study, we used linear, Gaussian Radial basis function (RBF), and polynomial kernels. We expect the results of the linear SVM to have similar or close results to the logistic regression classifier. For more information about the SVM algorithm, refer to the study by Burges [

RFs are an ensemble learning approach made up of multiple small decision trees, which are trained on a subset of data and features at each node split. In this study, we used RFs because of their predictive power and ability to work despite missing data (in light of missing data in our dataset). We did

Ensemble methods work by combining several weaker classifier predictions, thus improving overall robustness. In this study, we ensembled the single classifiers: SVM (linear, Gaussian, and polynomial kernels), logistic regression, naive Bayes, and RF algorithms. We investigated 2 main techniques.

In the case of hard voting, the final predicted class is taken to be the majority class label, as predicted by each individual classifier.

In the case of soft voting, the class label is calculated by summing the predicted probabilities across each class label and classifier and subsequently selecting the class with the highest probability. For this ensemble method, we used a uniform weight distribution.

To study the effect of time posttrauma on prediction performance, we trained the proposed classifiers on several different cutoff days. Specifically, we evaluated our models on data over 7, 10, 15, 20, 25, 30, 35, and all days posttrauma. Through comparative analysis, our aim was to determine how long surveys need to be performed to accurately predict elevated PTSD symptoms 1-month posttrauma.

We also studied the effect of reducing the number of indicators (features) based on the feature correlation and feature importance methods discussed above. On the basis of these, we modeled without the features AAR2, Reexp1, Avoid1, and Avoid2, as they are highly correlated with other features or low in importance.

Standard scoring metrics for ML models include accuracy (or error rate), true positive rate (TPR), false positive rate (FPR), true negative rate (TNR), false negative rate (FNR), recall-precision curves, and receiver operating characteristics (ROC) curves. These metrics provide a simple and effective way to measure the performance of a classifier [

These scoring methods have been evaluated using 2 main methods: the holdout method [

For the implementation of the ML algorithms, our dataset was partitioned randomly into 70.0% and 30.0% for training and testing, respectively. The training set is used to train the models and to find the model hyperparameters, whereas the testing set is used to evaluate the model performance and its ability to generalize to new unseen data. The hyperparameters used for all the classifiers were manually assigned, and then hyperparameter tuning was performed using random search, as described in the study by Bergstra and Bengio [

Accuracy is a common metric to evaluate the performance of ML algorithms. It gives the ratio of correct predictions over the total number of predictions. In the case of imbalanced datasets, classification accuracy alone is insufficient to determine if the model is robust. For example, in a notable degenerative case, a model can predict only the majority class label and still achieve high classification accuracy.

Models trained using a holdout technique might overfit or underfit depending on the distribution of the data split. To overcome this issue, K-fold cross-validation was performed on the dataset. This technique divides data into equal disjoint subsets of size K. The model being evaluated is then trained on all folds except one, which is reserved for testing. This process is then repeated K−1 times, selecting each fold to be used for testing one time. Finally, the results from each of the testing folds are averaged and returned as the final results. In this study, we used 10 folds, each fold is used once in testing and 9 times in training. This 10-fold cross-validation reduces the variance in the results by averaging over 10 different partitions, providing more reliable and generally accurate methodology than the Holdout method.

Confusion matrices offer a comprehensive evaluation of the quality of an ML algorithm. In contrast to the singular dependence on 1 number from the accuracy metric, a confusion matrix provides a method of evaluating performance across all of the classes. For binary classification, the confusion matrix is simplified to 2 classes as follows:

where, TP is the number of true positives, FP is the number of false positives, FN is the number of false negatives, and TN is the number of true negatives. TP and TN represent the number of correctly predicted labels, whereas FP and FN are those that are mislabeled by the classifier. The higher true values in the confusion matrix the better, indicating more correct predictions.

The ROC curve is a simple graphical representation and powerful methodology to evaluate binary classifiers. It has become a popular method because of its ability to evaluate overall performance [

The ROC space is built and plotted using TPR and FPR from the equation TPR and FPR as the y-axis and x-axis, respectively. Each point (FPR, TPR) represents a classifier at a different threshold applied to the predicted labels’ probability [

Independent of class distribution and error costs, the ROC curve connects the points in the ROC space. ROC curves describe the predictive performance and characteristics of a classifier at different probability levels. The area under the ROC curve, denoted as area under the curve (AUC), can be used to rank or compare the performance of classifiers [

For RFs, we used the Gini [

Logistic regression, Lambda=0.02380

SVM-linear kernel, C=62

SVM-RBF kernel, C=57, Sigma =0.004

SVM-polynomial kernel C=44, Sigma =0.017, degree=3

Where C and Lambda are the regularization terms and Sigma is the Gaussian kernel parameter. For ensemble methods, we investigated 2 main techniques: hard voting (majority voting) and weighted average probabilities (soft voting). For soft voting, we equally weighted the predicted probabilities from each classifier. We ensembled all the classifiers, that is, logistic regression, naive Bayes, SVM with linear kernel, SVM with Gaussian kernel, SVM-polynomial kernel, and RF.

We reduced the use of _{2}, _{1}, _{1}, and _{2} features because of their high correlation and low predictive power, as discussed in the Feature Correlation and Feature Importance dataset subsections.

Accuracy results.

Machine learning method | Train-test accuracy | Cross-validation accuracy |

Logistic regression | .8735236 | .82110961 |

Naive Bayes | .8711414 | .82210961 |

SVM^{a}-linear kernel |
.8349277 | .76406263 |

SVM-Gaussian kernel | .8632561 | .81908724 |

SVM-polynomial kernel | .8682245 | .81857010 |

Random forest | .8212457 | .77888143 |

Voting classifier-soft | .8798578 | .82045190 |

Voting classifier-hard | .85919181 | .80702013 |

^{a}SVM: support vector machine.

Receiver operating characteristics graphs for single and ensemble models. SVM: support vector machine; RBF: Radial basis function.

Area under the curve results.

Machine learning model | Receiver operating characteristics area under the curve |

Logistic regression | .8325350 |

Naive Bayes | .8422145 |

SVM^{a}-linear kernel |
.8179543 |

SVM-Gaussian kernel | .8465576 |

SVM-polynomial kernel | .8337800 |

Random forest | .7844874 |

Voting classifier-soft | .8559346 |

Voting classifier-hard | .8357976 |

^{a}SVM: support vector machine.

Receiver operating characteristics graphs for single and ensemble models with difficulty in concentrating, distress related to trauma-related intrusive thoughts, avoidance of thoughts about trauma, and avoidance of environmental trauma-related reminders features eliminated. SVM: support vector machine; RBF: Radial basis function.

Receiver operating characteristics area under the curve for reduced features models.

Machine learning model | Receiver operating characteristics area under the curve |

Logistic regression | .87685409 |

Naive Bayes | .88154251 |

SVM^{a}-linear kernel |
.87553263 |

SVM-Radial basis function kernel | .88182758 |

SVM-polynomial kernel | .88158036 |

Random forest | .85092592 |

Voting classifier-soft | .88920514 |

Voting classifier-hard | .88145900 |

^{a}SVM: support vector machine.

Area under the curve for the system trained and evaluated on different numbers of days. ROC: receiver operating characteristics.

As discussed in the Results section,

In addition, as shown in _{2}, _{1}, _{1}, and _{2} did not improve the performance, indicating that these features might be considered noise and could be eliminated from the study. This reduction eliminated symptoms from the avoidance cluster of PTSD. Although these results suggest that the removal of these symptoms did not impact prediction, replication is needed before firm conclusions can be made about the role these symptoms play in PTSD prediction. Allowing for a shorter survey by removing these items reduces the burden of each assessment and is likely to increase survey compliance, which will provide a more accurate assessment of recovery.

Finally, as a key result,

In summary, our results shed light on our research hypotheses stated in the Introduction section, as follows.

An ML-induced ensemble model is able to demonstrate significant statistical correlations between observable symptoms and elevated PTSD 1 month after trauma with an AUC of 0.85, as shown in

As detailed in the Results section, under the

In the Results section, under the

In the Results section, under the

Our experimental results are quite promising in that they suggest the potential for using a combination of self-reported symptoms and ML-induced models to automatically predict elevated PTSD in a manner that supports earlier interventions by care providers for 10 to 20 days posttrauma. These results were obtained using only data collected with a mobile device, suggesting that this method of symptom tracking is widely disseminable. Furthermore, our results suggest that smartphone surveys for self-reporting symptoms can be simplified more than previously understood.

We also explored various techniques for building predictive models. Although nonlinear learners did not outperform linear learners, an ensemble method with nonlinear models performed marginally better than single-linear models and will form the basis of our ongoing work in this area. In future studies, we plan to explore the application of these tools in a real clinical setting as a means to provide better care for at-risk patients. The prediction algorithm might also be improved if additional data were incorporated, such as baseline PTSD symptoms, demographic variables, and trauma histories, which is also an interesting topic for future studies.

area under the curve

false positive rate

machine learning

posttraumatic stress disorder

Radial basis function

random forest

receiver operating characteristics

support vector machine

true positive rate

None declared.