This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on http://mental.jmir.org/, as well as this copyright and license information must be included.
In an electronic health context, combining traditional structured clinical assessment methods and routine electronic health–based data capture may be a reliable method to build a dynamic clinical decision-support system (CDSS) for suicide prevention.
The aim of this study was to describe the data mining module of a Web-based CDSS and to identify suicide repetition risk in a sample of suicide attempters.
We analyzed a database of 2802 suicide attempters. Clustering methods were used to identify groups of similar patients, and regression trees were applied to estimate the number of suicide attempts among these patients.
We identified 3 groups of patients using clustering methods. In addition, relevant risk factors explaining the number of suicide attempts were highlighted by regression trees.
Data mining techniques can help to identify different groups of patients at risk of suicide reattempt. The findings of this study can be combined with Web-based and smartphone-based data to improve dynamic decision making for clinicians.
Over 800,000 people die of suicide every year, and it is estimated that for each suicide, there may have been >20 other attempted suicides. A previous attempt is the major predictor of death by suicide [
Decision-support tools help providers in their decision-making process. The use of these tools has been on the rise in recent years owing to their ability to bring evidence-based medicine to the point of care. A clinical decision-support system (CDSS) is a health information system that is integrated into electronic health records (EHR), enabling easy and effective use by physicians [
Thus, there is still an important need to develop a CDSS that supports clinician decision makers to choose, for example, the most appropriate treatment, the nature of a psychosocial strategy, or the duration of treatment in suicide prevention strategies. A key feature of such a CDSS is to identify a patient’s risk in terms of a repeated attempt, the number of reattempts, or suicide death within a period of time. The development of both passive and active collection of patients’ data provides the opportunity to improve clinician knowledge and thus determine risk factors and relevant combinations of risk factors [
This study aims to combine data from EHRs to provide support to decision making for clinicians in suicide prevention. We present the main results of a data mining process on a sample of suicide attempters to first identify groups of similar patients and then identify risk factors associated with the number of suicide attempts. We hypothesize that a data mining process helps to better characterize the population of suicide attempters by identifying the most relevant groups of patients and their associated risk factors for suicide reattempt (or other variables of interest). The ultimate goal is to build a CDSS for clinician decision support and propose a personalized prevention and intervention strategy to each patient.
Suicide attempters aged >18 years were recruited from consecutive admissions to the Emergency Department or specialized Acute Care Unit of three university hospitals (University Hospital Ramon y Cajal, Madrid, Spain; Fundación Jimenez Diaz, Madrid, Spain; and Academic Hospital of Montpellier, Montpellier, France) between 1994 and 2006. Owing to their specific characteristics [
The French or Spanish version of the Mini-International Neuropsychiatric Interview (MINI) [
Suicide risk was assessed using the Suicide Intent Scale [
A robust data-qualification process was performed to ensure data quality and consistency before statistical analyses. Although data were intended to be collected according to the same clinical procedures, quality variations were expected between the hospitals. Missing data were identified, and a variable was retained only when the completion rate reached 70%. When needed, new variables were created. For instance, 34 answers of the Barratt Impulsiveness Scale survey, version 10 (BIS10), were treated to build 3 scores of impulsiveness in terms of motor impulsivity, attentional impulsivity, or nonplanning impulsivity [
Unidimensional and two-dimensional analyses for both quantitative and qualitative variables were carried out. In addition, Fisher-Snedecor procedures were used to compare the two subgroups (male vs female) when needed. An unsupervised approach was used to extract homogeneous patterns from the data without any prior hypothesis. The approach is based on a multiple correspondence analysis (MCA) of qualitative variables to reduce the dimensionality. It consists of representing patients in a factorial space where each dimension is a combination of initial variables. Quantitative variables (eg, age) are not used during the calculations but are projected onto the factorial space. Hierarchical Clustering on Principal Components is then performed from the patients’ representation in the initial factorial space. Hierarchical clustering has many advantages, including the construction of a hierarchical tree called dendrogram that enables a visual interpretation of the dataset. The dendrogram depicts the emergence of groups of patients who share common risk patterns. In addition, it facilitates discussion between statisticians and practitioners to choose the optimal number of clusters. Each cluster was then interpreted through the association between the cluster and the list of qualitative and quantitative variables (V test). In the second step, the focus was on the variable of interest—the number of suicide attempts. Recursive partitioning has been used as a multivariable procedure that classifies individuals (patients) by successively splitting into subpopulations. Furthermore, a regression tree was built, and the number of suicide attempts was explained by different binary tests on predictive variables.
From the original database, the first step relied on data qualification. Several redundancies (eg, duplicated surveys or alternative coding) were observed among 263 initial variables. Subsequently, a completion threshold was applied to the resulting variables, and only 23 variables satisfied a 70% minimum completion rate. Three additional variables related to the types of impulsivity (as described above) were added. With respect to the 2802 initial patients, we decided to keep only suicide attempters with a 100% completion rate for the 26 variables. In the final filtering, 5 variables were disregarded for redundancy or useless purpose (the type of patients, assessment date, source, and year and day of birth). This rigorous process ensured high data quality for both patients and variables; it also provided a final dataset of 681 patients and 21 variables. Participants were predominantly young (mean age 40.1 years), female, employed, and married. Most patients included in the final analysis had a history of mental disorders, including major depression (482/681, 70.8%), bipolar disorder (160/681, 23.0%), dysthymic disorder (30/681, 4.4%), obsessive-compulsive disorder (58/681, 8.5%), and alcohol misuse (178/681, 26.1%) (
Clinicosociological main features of the postfiltering dataset of 681 suicide attempters.
Features | Value | |||
Female | 493 (72.4) | |||
Male | 188 (27.6) | |||
Single | 239 (35.1) | |||
Married | 240 (35.2) | |||
Separated or divorced | 181 (26.6) | |||
Widowed | 21 (3.1) | |||
No | 272 (39.9) | |||
Yes | 409 (60.1) | |||
Low | 31 (4.6) | |||
Intermediate | 368 (54.0) | |||
High | 282 (41.4) | |||
Employed | 451 (66.2) | |||
Unemployed | 110 (16.2) | |||
Incapacity | 41 (6.0) | |||
Retired | 79 (11.6) | |||
Quantitative variable, age (years), median (Q1-Q3) | 40.6 (28-49.6) | |||
No | 6 (0.9) | |||
Yes | 675 (99.1) | |||
No | 424 (62.3) | |||
Yes | 257 (37.7) | |||
No | 199 (29.2) | |||
Yes | 482 (70.8) | |||
No | 521 (76.5) | |||
Yes | 160 (23.5) | |||
No | 651 (95.6) | |||
Yes | 30 (4.4) | |||
No | 623 (91.5) | |||
Yes | 58 (8.5) | |||
No | 571 (83.8) | |||
Yes | 110 (16.2) | |||
No | 465 (68.3) | |||
Yes | 216 (31.7) | |||
No | 586 (86.0) | |||
Yes | 95 (14.0) | |||
No | 503 (73.9) | |||
Yes | 178 (26.1) | |||
Number of suicide attempts, median (Q1-Q3) | 2 (1-3) | |||
Motor impulsivity | 26 (22-30) | |||
Attentional impulsivity | 27 (23-30) | |||
Nonplanning impulsivity | 28 (24-31) |
The first step of the analysis was to perform an MCA to reduce the dimension, followed by a hierarchical clustering from the principal components to highlight groups of homogeneous patients. The tree structure (in terms of inertia gain) and a discussion between statisticians and practitioners allowed us to study patients partitioned into three clusters.
We conducted an in-depth analysis of the 3 groups for data interpretation. Statistical association tests (V tests) enabled identification of over- or underrepresented modalities in the three clusters. Cluster 1 was mainly related to an average patient profile of women (positive association V test,
Hierarchical clustering (left) and multiple correspondence analysis factor map (right) with three projected clusters.
The second step of the analysis aimed to identify factors associated with a higher risk of suicide attempts (variable “number of suicide attempts”) for men and women separately, following the principal outcome.
While analyzing both groups, the first conclusion is a clear difference between genders. For instance, eating disorders are linked to a higher number of suicide attempts for women (mean 2.9 in women vs 2.3 in men,
The decision tree on the variable "number of suicide attempts" according to gender "male". BIS: Barratt Impulsiveness Scale.
The decision tree on the variable “number of suicide attempts” according to gender “female”. BIS: Barratt Impulsiveness Scale.
A systematic assessment before discharge from hospital has allowed building of a large database suitable for modern data mining techniques. In this study, we identified clusters of suicide attempters and variables that may explain the repetition of suicide attempt in suicide attempters. This study shows how a simple structuration of the assessment of discharged patients after a suicide attempt may provide relevant data for clustering methods. The clustering may help clinicians allocate a patient into a risk cluster. Therefore, it is the first step of the CDSS design. This model may lead to a stratified approach in decision making for suicide prevention. Furthermore, analyzing larger datasets could allow the discovery of new risk factors that are not currently considered relevant during clinical interviews. However, we did not propose a model for suicide prediction; our model mining big databases is a prerequisite toward better decision making for suicide prevention. Furthermore, this model could also be applied to other data sources like personal health records or ecological momentary assessment (EMA).
Our findings are in line with recent studies showing how suicide risk assessment could lead to patient clustering from a preventative perspective [
In this study, participants were assessed by trained clinicians before discharge from the ED. Data were captured using paper-based formularies of the actual MeMind Web-based EHR [
In this study, patients were recruited after a suicide attempt. We postulate that the development of a CDSS would be more relevant in a population of suicide attempters. Suicide attempters are also defined as an “indicated population” [
This study illustrates the need for high-quality and large databases for extracting significant patient profiles or risk factors. In this study, starting from an initial set of 2802 patients with 263 variables, the data-qualification process resulted in a final sample of 681 patients with 21 complete variables. Although this volume of data already ensures statistical significance, it underlines the importance of better ways to standardize data collection in participating institutions. The CDSS quality strongly depends on input data. Moreover, a critical challenge may be the clinician acceptance of such tools that directly impact the completion rate of the EHR [
Guidelines recommend that all patients presenting to the hospital services with self-harm should receive a psychological assessment before discharge, to determine the risk of further reattempt [
Most clinicians have use EHRs daily in emergency services and psychiatric units. However, few institutions have taken advantage of recent technological advances opportunities in risk assessment. Combining electronic health–based assessment with data mining techniques represents an opportunity to foster suicide-prevention research. This new paradigm is useful in providing personalized intervention strategies by itself, but it also affords the opportunity to identify novel mechanisms to be targeted in suicide-prevention strategies. In addition, we believe that computational models can provide data-assisted ideas emerging from these repositories and will have special appeal for the empirically minded clinicians [
Although studies have highlighted the value of self-reports in clinical assessment, they are rarely routinely implemented [
The next step is to take advantage of new technologies and current developments of Web-based mobile apps to design the next-generation dynamic CDSS (
The decision-support system based on ecological momentary assessment and data mining.
clinical decision-support system
electronic health records
ecological momentary assessment
multiple correspondence analysis
mobile health
Mini-International Neuropsychiatric Interview
Authors acknowledge the French
None declared.