This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on http://mental.jmir.org/, as well as this copyright and license information must be included.
Each year, approximately 800,000 people die by suicide worldwide, accounting for 1–2 in every 100 deaths. It is always a tragic event with a huge impact on family, friends, the community and health professionals. Unfortunately, suicide prevention and the development of risk assessment tools have been hindered by the complexity of the underlying mechanisms and the dynamic nature of a person’s motivation and intent. Many of those who die by suicide had contact with health services in the preceding year but identifying those most at risk remains a challenge.
To explore the feasibility of using artificial neural networks with routinely collected electronic health records to support the identification of those at high risk of suicide when in contact with health services.
Using the Secure Anonymised Information Linkage Databank UK, we extracted the data of those who died by suicide between 2001 and 2015 and paired controls. Looking at primary (general practice) and secondary (hospital admissions) electronic health records, we built a binary feature vector coding the presence of risk factors at different times prior to death. Risk factors included: general practice contact and hospital admission; diagnosis of mental health issues; injury and poisoning; substance misuse; maltreatment; sleep disorders; and the prescription of opiates and psychotropics. Basic artificial neural networks were trained to differentiate between the suicide cases and paired controls. We interpreted the output score as the estimated suicide risk. System performance was assessed with 10x10-fold repeated cross-validation, and its behavior was studied by representing the distribution of estimated risk across the cases and controls, and the distribution of factors across estimated risks.
We extracted a total of 2604 suicide cases and 20 paired controls per case. Our best system attained a mean error rate of 26.78% (SD 1.46; 64.57% of sensitivity and 81.86% of specificity). While the distribution of controls was concentrated around estimated risks < 0.5, cases were almost uniformly distributed between 0 and 1. Prescription of psychotropics, depression and anxiety, and self-harm increased the estimated risk by ~0.4. At least 95% of those presenting these factors were identified as suicide cases.
Despite the simplicity of the implemented system, the proposed methodology obtained an accuracy like other published methods based on specialized questionnaire generated data. Most of the errors came from the heterogeneity of patterns shown by suicide cases, some of which were identical to those of the paired controls. Prescription of psychotropics, depression and anxiety, and self-harm were strongly linked with higher estimated risk scores, followed by hospital admission and long-term drug and alcohol misuse. Other risk factors like sleep disorders and maltreatment had more complex effects.
The World Health Organization recognizes suicide as a public health priority. The World Health Organization Member States are committed to working towards the global reduction of suicide rates worldwide by 10% by 2020 [
Unfortunately, the prediction of suicide risk has proven to be a challenging problem for epidemiological studies and how they apply to health care practice. The pathways to suicide are mediated by highly complex processes, integrating many interdependent risk factor variables [
Short-term suicide risk prediction (ie, days, weeks, or months) is particularly useful for targeted interventions; but less is known about the processes underlying short-term suicidality than longer-term presentations [
At the same time, we have databanks curating a wealth of electronic health records (EHRs), and administrative information which, when linked, could provide a representative picture of the biological, societal and health status of an individual at any point in time. Use of this data at scale is expected to make a pivotal contribution to the study of many diseases [
Although the application of AI techniques in different areas of medicine is extensive [
In the last decade, the use of machine learning (a branch of AI) to analyze EHRs has grown dramatically, spurred in part by advances in artificial neural networks (ANNs) and deep learning [
Indeed, the application of AI in psychiatry is a field that has received relatively little attention but has great potential for innovation [
Passos and colleagues [
Kessler and colleagues [
We aim to explore the use of ANNs with routinely collected EHRs to estimate suicide risk within the general population. This approach builds on Passos et al and Kessler et al research, taking it a step further by relying on routinely collected EHRs across health settings rather than mental health questionnaires. Hence, our system would not depend on information that is collected only in specific circumstances (eg, outpatient visits or hospital admissions), and could therefore be used to screen the entire population without increasing the workload of health care practitioners.
Our system aims to improve not only the quality of suicide risk assessment, but also its coverage. This is a crucial factor when considering that only 35% of those who died in Wales by suicide between 2010 and 2015, were admitted to hospital in the year prior to death, and around 40% had an emergency department admission. Furthermore, of those who died in Wales by suicide between 2001 and 2015, 65% did not have a mental health records in the year prior to death; and 40% never had. However, approximately 83% of these suicide cases had at least one contact with their general practitioner (GP) during that period. Therefore, our system seeks to utilize these contacts to assess suicide risk and increase population coverage.
Additionally, our system has the potential to perform risk assessment continuously over time and in the background (ie, without human intervention) across healthcare settings. Rather than using this as an assessment of immediate “at risk” or “not at risk,” it will be used to flag patients, even those attending for reasons other than mental health, so that appropriate questions can be asked. The UK National Institute for Health and Care Excellence recommends that risk assessment tools and scales should not be used to predict future suicide or repetition of self-harm [
The goal of this study is to test the feasibility of this concept, validating the methodology from functionality (performance) and medical (validity of factors-risk model) points of view. Using an oversimplified system (shallow ANN), conservative results regarding model complexity and performance are ensured. We combine data from primary and secondary care, use repeated cross-validation during evaluation, and explore the distribution of factors across different levels of estimated suicide risk to describe the system’s behavior.
In the remainder of this article, we describe the data sources used, how we defined our cohorts of suicide cases and controls, and the risk factors used during experimentation. A brief introduction to ANNs is provided, followed by a detailed description of the models evaluated here. We detail the analyses that were run to assess raw performance and the resulting factors-risk model. Following the presentation of the results, we discuss their interpretation as well as the potential of the proposed model, how it compares with the current state of the art approaches, its limitations and implications for practice, and conclusions.
Data available within the Secure Anonymized Information Linkage (SAIL) Databank [
For this study, we linked and analyzed the National Statistics Annual District Deaths Extract (ADDE), the Welsh Demographic Service (WDS), the Welsh Primary Care GP dataset (WGP), the Patient Episode Database for Wales (PEDW) and the Emergency Department Data Set (EDDS). While all datasets were used to define the study case-control cohort, only WDS, WGP and PEDW were used to build the feature vectors for experimentation.
Data availability varied across individuals and databases. While ADDE and PEDW datasets have a nationwide coverage, WPG contains data from 348 out of 474 (73%) GP practices in Wales. This variation was reduced by restrictions applied during the cohort definition (see below). At the same time, while the WGP and PEDW datasets were available over the full study period (2001 to 2015), ADDE was only available from 2009. However, ADDE data was used only to determine a key date before death, not to train or test the ANN system, and therefore we do not expect this has significantly biased our results.
We extracted our cohort from SID-Cymru, a population based electronic case-control study on completed suicide in Wales between 2001 and 2015 defined within SAIL [
The case-control study cohort was built according to the following steps:
We identified those that died through suicide at age 10 or older between 2001 and 2015. Deaths of undetermined intent in those under 10 years of age may be related to abuse or neglect and thus were excluded.
We followed individuals’ health histories retrospectively from death date to identify the full calendar of health services contact leading up to death (CLD). This could include multiple entries within the WGP, PEDW and EDDS databases (eg, attendance at A and E, admission to hospital, transfer to another hospital, and finally GP letters received from hospitals notifying of deaths). A maximum CLD duration of one month was considered to avoid including unrelated hospital stays. The CLD was subsequently removed from the analysis to avoid using information directly linked with the death of cases.
Only those residing in Wales at the time of their death, with GP data available for at least 80% of the five years prior to CLD were included in the study. This ensured that similar data coverage was available for all cases and controls. The value of five years was chosen to balance between the length of health history and number of cases retained.
For each case, 20 controls were randomly selected, without replacement and excluding cases, after matching by gender and week of birth (±1 year). During control selection, those with a similar period of Welsh residency and GP data coverage were prioritized to ensure similar coverage quality. Although this number is unnecessarily large for traditional paired case-control studies, the proposed methodology benefitted from increased data availability during training.
A total of 2604 suicide cases were identified—2012 (77.3%) of which were males, and 58,080 controls. These had a perfect (deterministic) or very high (probabilistic) linkage score (between 0.95 and 1) within SAIL.
Only data from WDS, WGP and PEDW were used during experimentation. Not all events recorded in WGP and PEDW represent face-to-face contact with the patient, and a single event may have multiple associated entries (eg, multiple diagnoses).
Each entry was categorized in WGP and PEDW into types of health event: depression and anxiety; other common mental disorders; other mental health; non-intentional injury and poisoning; self-harm; alcohol misuse; drugs misuse; possible maltreatment; physical sleep disorders; non-physical sleep disorders; and “others.” We also identified the prescription of opiates and psychotropics from WGP (PEDW has no prescription information) and recorded whether there were any entries recorded in WGP or PEDW (representing a hospital admission). This made a total of 15 factors (11 diagnoses, two prescriptions, WGP entries and hospital admissions).
The above categories were defined in terms of ReadCodes for WGP and ICD10 for PEDW with the help of expert clinicians and based on previous publications when available (depression and anxiety [
We identified the presence of the above 15 health events during four non-overlapping time-frames:
1M: Between CLD and 1 month before CLD [CLD – 1 month, CLD].
6M: Between 1 and 6 months before CLD [CLD – 6 months, CLD – 1 month).
1Y: Between 6 and 12 months before CLD [CLD – 1 year, CLD – 6 months).
5Y: Between 1 and 5 years before CLD [CLD – 5 years, CLD – 1 year).
The final feature vector also included age at CLD and sex, resulting in a length of 62: 1 float age + 1 binary sex + 15 binary health events * 4 time-frames. This feature vector does not include data directly related to the CLD. Interactions between these factors are automatically taken into account by the ANN.
Artificial neural networks (ANNs) are biologically inspired computing systems capable of learning tasks through examples and experience, without the need of explicit programming of task-specific rules or any a priori knowledge of the solution [
ANNs are typically composed of an input layer, one or more hidden layers and an output layer (
The term “black-box” is sometimes used to describe ANNs. This has contributed to the widespread misconception of ANNs not being transparent, which in turn has gained them a bad reputation in fields such as medicine, where understanding how and why decisions are taken is important. However, “black-box” alludes to the fact that the input-output model generated by the network is too complex to be expressed by a set of simple rules that are syntactically meaningful. Such a model can nevertheless be expressed as a mathematical equation. For example, a simple ANN composed of no hidden layers and a single output neuron with a logistic activation function is equivalent to the logistic regression model;
where
Structure of an artificial neural network with 1 input layer (red), 2 hidden layers (green) and one output layer (blue) all fully connected.
A simple ANN was implemented with seven different configurations: no hidden layers (nn0), one hidden layer of size 10, 50 or 100 (nn10, nn50, nn100) and two hidden layers with sizes 10, 50 or 100 (nn10-10, nn50-50, nn100-100). All layers were fully connected (ie, each neuron in layer
The mean square error was adjusted to account for data imbalance (20 controls per case) so that the resulting cost of both classes (case and control) was equal to 1. The final cost included l2 weight regularization with scale 0.01.
All ANNs were trained with the gradient descent algorithm and exponential learning rate decay starting at 1. Training was performed sequentially with three different batch sizes: 25, 100 and all cases and their respective controls (ie, total batch size 525, 2100 and full). The learning rate was reset with every change in batch size. Training within each batch size continued until a maximum number of epochs was reached, the change of cost function evaluated on the validation set was lower than a threshold or the change was in the negative direction (ie, not improving).
Using the oversimplified system (ie, small number of features and shallow ANNs) described above, we favored obtaining conservative results in terms of model complexity and performance, which we hope would counteract some of the limitations of the study (described below). In addition, in a practical application the cost of misidentifying suicide cases and controls will probably not be the same. Whether the system should be tuned to have a high sensitivity at the cost of low specificity or vice versa depends on many factors and is beyond the scope of this study. For simplicity, we equalized this cost to be the same for cases and controls. Hence, accounting for the unbalanced 1:20 distribution of cases and controls, the cost of misclassifying a case was 1, while the cost of misclassifying a control was 1/20. All experiments and ANNs were designed and executed using TensorFlow in Python.
We followed a 10x10-fold cross-validation approach to evaluate the performance of the ANNs. On each iteration, one-fold was used for testing, one for validation (used to inform the early stopping training algorithm) and eight for training. Cases were randomly distributed across folds, followed by their respective controls so that case-control pairs were always maintained during partitioning (this partitioning rule was also applied during batch partitioning in training).
On each iteration, as well as measuring the classification error obtained with the threshold resulting from training, the threshold was varied to compute the receiving operating characteristics (ROC) curve and the area under the ROC curve (AUC). We compared performance between systems using a corrected resampled
Finally, we repeated the above analysis shuffling the labels of each samples, ie, we randomly assigned the label “case” to one of the 20 paired controls of a case and rebranded the original case as “control,” This aims at evaluating whether our initial results are due to real relationships between labels and data, rather than to random idiosyncratic patterns in the data.
In addition to measuring system performance, we attempted to assess the factors-risk model obtained by the best performing ANN. Due to the dimensionality of the feature vector (ie, number of input factors) and the freedom of the ANN to build complex models with numerous non-linear interactions, getting the full representation of the factors-risk model was not practical. However, the following results gave us insight into how large a role each factor played in the computation of the risk score:
The histogram of the number of cases and controls across estimated risk scores. This will provide information additional to the performance measurements about the classification capability for cases and controls.
The histogram of the estimated risk difference when turning specific factors “on” and “off’” across the whole dataset. This will show an estimated role of each individual factor in the computation of the risk score, and how it varies due to interactions with other factors.
The distribution of each factor (ie, individuals presenting a factor) across estimated risk scores. This will work in conjunction with the previous point to draw an estimate of the role of each individual factor.
The incidence of each factor within estimated risk scores. This will allow us to compare incidences across risk levels for cases and controls.
Results of this analysis refer to the factor-risk model built by our ANN and do not necessarily agree with the real factor-risk model. Our confidence of how similar these two are depends on the size and quality of the testing data and on the performance of our system. This is true for any AI application, but it is especially important in medical applications such as the one proposed here.
The error rate of the described ANNs decreased slightly from 28.9% to 26.8% when increasing the number of hidden layers from 0 to 2 (
Crucially, results after shuffling the labels were characteristic of a random process (ie, 50% error rate and 0.5 AUC).
The distribution of cases and controls across estimated risk scores reflects the results of
Prescription of psychotropics, depression and anxiety, and self-harm seem to have the strongest effect on the estimated risk, increasing
Most samples were assigned a risk below the 0.5 threshold, with only 70 individuals resulting in a a very low risk r≤0.17 (
Looking at how factors (individuals with factors “on”) were distributed across risk scores (
Mean and standard deviation of the error rate, sensitivity, specificity and AUC obtained on the 10x10-fold experiments for each neural network.
ANNa model | Error rate, mean (SD) | Sensitivity, mean (SD) | Specificity, mean (SD) | AUCb, mean (SD) |
nn0c | 28.89% (1.47) | 57.28% (2.97) | 84.94% (0.54) | 0.78 (0.02) |
nn10d | 27.12% (1.42) | 64.19% (2.94) | 81.57% (0.57) | 0.79 (0.02) |
nn50e | 27.09% (1.42) | 64.25% (2.92) | 81.57% (0.58) | 0.79 (0.02) |
nn100f | 27.11% (1.42) | 64.18% (2.93) | 81.61% (0.61) | 0.79 (0.02) |
nn10-10g | 26.78% (1.46) | 64.57% (3.00) | 81.86% (0.58) | 0.80 (0.02) |
nn50-50h | 26.83% (1.43) | 64.52% (2.92) | 81.82% (0.59) | 0.80 (0.02) |
nn100-100i | 26.83% (1.47) | 64.54% (3.04) | 81.79% (0.61) | 0.80 (0.02) |
aANN: artificial neural network.
bAUC: area under the ROC curve.
cnn0: No hidden layers.
dnn10: 1 hidden layer with 10 neurons.
enn50: 1 hidden layer with 50 neurons.
fnn100: 1 hidden layer with 100 neurons.
gnn10-10: 2 hidden layers with 10 neurons.
hnn50-50: 2 hidden layers with 50 neurons.
inn100-100: 2 hidden layers with 100 neurons.
Receiving operating characteristics (ROC) curve for nn0, nn50 and nn10-10. FPR: false positive rate; TPR: true positive rate; nn0: no hidden layers; nn50: 1 hidden layer with 50 neurons; nn10-10: 2 hidden layers with 10 neurons.
Distribution of cases and controls across estimated risk score levels. Those with risk score >0.5 were identified as “cases.”.
Histogram of the difference in estimated risk score when turning specific factors ‘on’ and ‘off’ across the whole dataset.
Number of individuals, gender and mean age for controls, cases and estimated risk levels from very low to very high.
Description | Number of Individuals | Number of Males, n (%; 95% CI) | Mean age, years |
Controls | 52080 | 40240 (77.37%; 76.9%-77.6%) | 48.04 |
Cases | 2604 | 2012 (77.27%; 75.9%-78.6%) | 48.04 |
Very low risk ( |
70 | 4 (5.7%; 2.6%-12.1%) | 54.32 |
Low risk (0.17< |
25744 | 17884 (69.5%; 68.9%-69.9%) | 48.07 |
Moderate-low risk (0.33< |
17818 | 15850 (88.9%; 88.6%-89.3%) | 46.52 |
Moderate-high risk (0.5< |
6011 | 4765 (79.3%; 78.4%-80.1%) | 49.31 |
High risk (0.67< |
3675 | 2703 (73.5%; 72.3%-74.7%) | 53.03 |
Very high risk ( |
1366 | 1046 (76.6%; 74.6%-78.4%) | 47.75 |
Samples presenting a specific factor and their distribution across cases and controls, and across estimated risks from very low (VLR) to very high (VHR). To the left of each bar group, the total number of individuals presenting the factor (sample size). At the top, the distribution of the full population.
Incidence of factors for cases, controls and estimated risk levels from very low (VLR) to very high (VHR). Panels on the right hand column (shaded) have y-axis limits between 0% and 30% to facilitate visualization.
In terms of incidence (
The presented oversimplified system successfully differentiated between 2604 suicide cases and 52,080 matched controls in 73.22% of tested instances during 10x10-fold cross-validation. It achieved this using only routinely collected EHRs from GP and hospital admissions in the five years before the case’s CLD.
The reduction in error rate as the number of hidden layers increased is representative of the complexity of the underlying suicide factors-risk model. In our case, results barely changed when the number of neurons in the hidden layers increased. In fact, performance differences between networks with the same number of layers came from a better tuning of the output scores resulting in an operational point closer to the optimal (ie, equal error rate). Overall, we expect the advantages of having more layers and neurons to become obvious when more factors are fed into the model.
The disparity that was observed between sensitivity and specificity and on the score distribution between cases and controls highlights the variation in the level of difficulty experienced when analyzing both groups. Controls seem to follow more uniform patterns and are therefore easier to identify, hence the higher specificity and the clustering of controls below a 0.5 score. On the other hand, patterns of the cases are more heterogeneous, with some having feature vectors identical to controls, which explains the lower sensitivity and the almost uniform distribution of cases across risk scores.
The presented behavioral evaluations do not unequivocally explain the factor-risk model built by the network. However, they do provide a general idea of what is driving the output score upwards. The input factors prescription of psychotropics, depression and anxiety, and self-harm, and, to a lower degree, drugs and alcohol misuse, were strongly linked with increasing estimated risk scores. This is in keeping with previous literature [
On the other hand, some risk factors identified in the literature did not exhibit the same behavior in our results. Physical sleep disorders seemed to decrease rather than increase the estimated risk. Due to the relatively low incidence of this factor in our data, its effect may be attenuated by and highly dependent on more active factors. This would also explain the dispersion of its effect on the estimated risk score (
Due to the non-perfect specificity and relatively low sensitivity obtained, results from the behavioral analysis should not be directly extrapolated to the real-world factor-risk model. Having said that, the remarkable agreement between our model and the existing literature works as an indication of the feasibility of our proposal. Additionally, we expect to substantially improve performance with a more complex system design, which will in turn increase our confidence in the validity of the obtained factors-risk model.
Perfect estimation of suicide risk using EHRs will never be possible, mainly because some individuals take their own life without ever seeking help or without presenting to health care services with signs of being at risk. In addition, of those that seek help or present with evidence, signs may be missed or inaccurately or insufficiently recorded. Others may simply present insufficient evidence to distinguish them from controls (ie, having a pattern identical to controls).
According to our data, around 90% of those that died through suicide in Wales had one or more contacts with health services in the year prior to their CLD, and approximately 30% of them had a contact related to their mental health. Therefore, the proposed methodology still has a good scope for application.
To our knowledge, Passos’ [
Interestingly, while Kessler’s method also suffered from low sensitivity, Passos’ system obtained comparable sensitivity and specificity. This may be due to the latter using data from the questionnaire Structured Clinical Interview for DSM-IV axis-I Disorders, which records highly specific diagnoses. In addition, Passos’ system aimed at differentiating previous suicide attempters from non-attempters, rather than identifying future risk.
The results presented here are limited by the purposely oversimplified system design used both in terms of the number of factors considered (only 15 over four time-frames) and the design of the ANN (a maximum of two hidden layers). Still, our system improved chance identification by almost 50%. As we move from feasibility to pilot study and increase the complexity of the system we expect to increase performance substantially.
The problem of suicide risk estimation suffers not only from a highly complex factors-risk model, but also from a lack of a quantitative measure of the real risk of suicide which is only known with certainty within a short time span before a recorded attempt. At any other time-point, we do not know the real risk for any individual. Someone at risk may refrain from ever attempting suicide, whereas another person may become at risk and attempt suicide within a very short period. This will have implications for a more practical evaluation (compared to the feasibility analysis presented here), as we will need to find ways to assess performance fairly without knowing the real risk ourselves.
Without properly labelled data, we need to rely on clinicians to assess the factors-risk model constructed by the algorithm. In our case, most of the individuals with a self-harm event were classified as cases or as being at risk (ie, r>0.5). Some of them belonged to the control group, and we considered these as errors in our evaluation. However, should all these instances be considered errors? The answer to this question is not trivial, and has technical, clinical and ethical implications that we need to explore in more depth.
Our proposal will be most practical in settings where professionals do not have specialist mental health training but are in contact with individuals at risk of suicide. Nurses, emergency department staff, ambulance services, police and prison workers would be among the ones benefiting the most from the tool proposed here. These professionals face both the challenge of seeing large numbers of people where it is difficult to discern those at risk, and of assessing the suicidality of individuals often without having received sufficient training and under staff shortages [
Prescription of psychotropics, depression and anxiety, and self-harm were strongly linked with higher estimated risk scores, followed by hospital admissions and long-term drugs and alcohol misuse which is in keeping with the current literature. Other risk factors such as sleep disorders and maltreatment had more complex effects.
The system presented here is an oversimplified one, using a short feature vector and shallow ANNs to assess the practicality of using EHRs in this way. As a feasibility study, we are more interested in (a) confirming the existence of discriminant information, and (b) validating the proposed methodology, than obtaining high accuracy rates. Nevertheless, our system obtained an accuracy like other published methods based on specialized questionnaire data.
Prescription of psychotropics, depression and anxiety, and self-harm were strongly linked with higher estimated risk scores, followed by hospital admissions and long-term drugs and alcohol misuse. Age and gender had no effect on risk. Interestingly, possible maltreatment had the opposite effects in the short and long terms, decreasing risk when recent and increasing it when more than a year before CLD.
The promising performance obtained with a basic ANN, and the fact that the resulting factors-risk model was in line for the most part with the literature, supports the hypothesis of the possibility of building a tool capable of estimating suicide risk in the general population using only routinely collected EHRs. We are a long way from employing such methods in clinical practice, but this is a first step to harness the potential of routinely collected electronic health records to support clinical practice in real time.
Untitled.
National Statistics Annual District Deaths Extract
artificial intelligence
artificial neural networks
area under the ROC curve
contact leading to death
Emergency Department Data Set
electronic health records
general practice
high risk (0.67<r≤0.83)
low risk (0.17<r≤0.33)
moderate-high risk (0.5<r≤0.67)
moderate-low risk (0.33<r≤0.5)
Patient Episode Database for Wales
receiving operating characteristics
Secure Anonymised Information Linkage databank
very high risk (r>0.83)
very low risk (r≤0.17)
Welsh Demographic Service
Welsh Primary Care GP dataset
None declared.