This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on http://mental.jmir.org/, as well as this copyright and license information must be included.

Depression in people with bipolar disorder is a major cause of long-term disability, possibly leading to early mortality and currently, limited safe and effective therapies exist. Although existing monotherapies such as quetiapine have limited proven efficacy and practical tolerability, treatment combinations may lead to improved outcomes. Lamotrigine is an anticonvulsant currently licensed for the prevention of depressive relapses in individuals with bipolar disorder. A double-blinded randomized placebo-controlled trial (comparative evaluation of Quetiapine-Lamotrigine [CEQUEL] study) was conducted to evaluate the efficacy of lamotrigine plus quetiapine versus quetiapine monotherapy in patients with bipolar type I or type II disorders.

Because the original CEQUEL study found significant depressive symptom improvements, the objective of this study was to reanalyze CEQUEL data and determine an unbiased classification accuracy for active lamotrigine versus placebo. We also wanted to establish the time it took for the drug to provide statistically significant outcomes.

Between October 21, 2008 and April 27, 2012, 202 participants from 27 sites in United Kingdom were randomly assigned to two treatments; 101: lamotrigine, 101: placebo. The primary variable used for estimating depressive symptoms was based on the Quick Inventory of Depressive Symptomatology—self report version 16 (QIDS-SR16). The original CEQUEL study findings were confirmed by performing

From weeks 10 to 14, the mean difference in QIDS-SR16 ratings between the groups was −1.6317 (

Adding lamotrigine to quetiapine treatment decreased depressive symptoms in patients with bipolar disorder. Our classification model suggested that lamotrigine increased the coefficient of variation in the QIDS-SR16 scores. The lamotrigine group also tended to have a lower DFA exponent, implying a substantial temporal instability in the time series. The performance of the model over time suggested that a trial of at least 44 weeks was required to achieve consistent results. The selected model confirmed the original CEQUEL study findings and helped in understanding the temporal dynamics of bipolar depression during treatment.

EudraCT Number 2007-004513-33; https://www.clinicaltrialsregister.eu/ctr-search/trial/2007-004513-33/GB (Archived by WebCite at http://www.webcitation.org/73sNaI29O).

Bipolar disorder, a psychiatric condition characterized by repeated elevated mood (mania) and low mood (depression) states [

There is, however, no consensus on the effectiveness or safety of antidepressant drugs, such as fluoxetine, for bipolar depression [

The comparative evaluation of quetiapine plus lamotrigine versus quetiapine monotherapy (CEQUEL) trial was a double-blind randomized placebo-controlled parallel group trial comparing a lamotrigine plus quetiapine treatment and a quetiapine monotherapy treatment in patients diagnosed with a bipolar I or II disorder (EudraCT Number: 2007-004513-33) [

The original analysis reported significant depressive symptom improvements for the lamotrigine subjects compared with the placebo subjects. In this paper, the data collected in the trial were reanalyzed using machine learning approaches with the main objective being the identification of the most appropriate binary classifier to distinguish between the lamotrigine and the placebo effects. In addition to replicating the findings obtained from the original statistical analysis, we also wanted to determine the time it took for the drug to provide statistically significant outcomes to provide some guidance with respect to the minimum amount of time required to undertake trials that aim to establish the treatment or drug efficacy.

To assess the differences between patients taking lamotrigine and those taking the placebo, a binary classification approach was used to identify the relevant features to be extracted from the time series. This approach considered the different characteristics in the collected data and sought to classify the participants into 2 distinct groups based on the observed features; in this case, the group taking lamotrigine and the group taking the placebo. The features were determined based on the different statistical metrics computed from the data, and the analysis determined which features would facilitate the classification; for example, it was expected that the mean QIDS-SR16 scores would differ between the 2 groups. For every feature, a kernel based density estimation was used to examine its probability distribution between the 2 groups and to test whether the 2 distributions were different; the bigger the differences, the greater was the explanatory power of the feature. For the same reason, the performance of a classifier for each individual feature was also examined. Finally, the features and classifier with the best cross-validated accuracies were identified.

Data were collected from 202 participants over a period of 52 weeks; 149 with bipolar type I disorder and 53 with bipolar type II disorder; of which 90 were male and 112 were female.

Subject compliance: number of subjects reporting in a particular week.

Initially, there were 202 registered participants. After the first 2 weeks, there was a large drop in the number of people who continued to report; however, there were almost equal numbers of subjects in both the lamotrigine and placebo groups. A more detailed trial profile is presented elsewhere [

There was a 60% overall compliance with the self-reported mood ratings at 12 weeks; however, even though fewer participants submitted ratings for the entire 52 weeks, no between group differences were observed. For this reason, 2 data subsets were explored; 153 participants who submitted mood data for at least 5 weeks and 138 participants who reported for at least 10 weeks. Another challenge was that the participants did not submit mood ratings at regular time intervals; therefore, the target frequency of one report per week was not always achieved because patients were able to submit scores at any time during the week, which resulted in unevenly sampled data. As a result, extra care was required when using the general time series techniques that had been originally developed for uniformly sampled data.

Because the overall goal was to build a binary classifier that could differentiate the patients taking the lamotrigine from those taking the placebo, the first exploratory step was to study the different statistical metrics, called features, that were calculated using the dataset with the aim of identifying the “good” features to feed into the classifier, that is, those features that had sufficient explanatory power to facilitate the classification task. Therefore, how each feature contributed to the classification accuracy when used individually was also investigated.

A common approach when evaluating the explanatory power of a feature has been to assess its associated probability distribution. Kernel smoothing density estimation [

A receiver operating characteristic (ROC) curve is a graphical plot that shows the classification threshold variations of a binary classifier. Given a labeled dataset, a binary classifier is able to produce the following 4 results from the comparisons of the predicted class to the original labels: true positives, false positives, true negatives, and false negatives. Because the classification threshold is varied, the ROC curve plots the true positive rate versus the false positive rate with the area under the (ROC) curve providing the single metric for the evaluation of the classification model; the larger the area under the curve, the greater the possibility of realizing a high true positive rate and a low false positive rate.

In practice, using one single variable or using all available variables for the classification may not result in an optimal classifier. Various feature selection techniques can be used to select the best model from among the set of available variables, which involves selecting those variables that are representative of the data variability, while improving the classification accuracy. Two types of techniques for feature selection were considered. On one hand, lasso, elasticnet and ridge regression [_{1} and L_{2} norms of the error obtained on a cross-validation set. With these techniques, the best performing features were added to the model one by one up to the point at which the cross-validation error did not reduce. On the other hand, sequential feature selection techniques work in a similar manner but use the misclassification rate as an error metric. We studied both techniques for feature selection to test the robustness of our results.

The different variables computed from the original dataset were investigated as candidate features to feed into the classifiers. The raw dataset contained different subject attributes, such as demographic information (age and gender) and bipolar type, and reported QIDS-SR16 and ASRM values. The age, gender, and bipolar type were kept as they were, and a range of statistics was computed from the QIDS-SR16 values. The ASRM values were not investigated further as these did not indicate any valuable information in the early exploratory stage. Further, because the lamotrigine was being used to treat bipolar depression, the QIDS-SR16 values offered greater information. The QIDS-SR16 mean and SD were the initial candidate features, that is, if the drug worked, the mean QIDS-SR16 value would be lower for the subjects taking the lamotrigine than for those taking the placebo. The coefficient of variation, which is defined as the ratio of SD to the mean, was also used as a separate feature. Other simple statistics considered were the skewness and kurtosis in each subject’s QIDS-SR16 time series, which summarized the QIDS-SR16 data distribution but did not reflect the temporal dynamics.

In addition to these basic statistics, other more complicated features were computed from the QIDS-SR16 time series data. The Lomb periodogram [

Detrended fluctuation analysis (DFA) [

To maximize the classification performance, a number of linear and nonlinear classifiers were also investigated. A linear classifier is a classification algorithm, the objective function of which is a function of a linear combination of features. A binary linear classifier has a linear decision boundary, and a nonlinear classifier has a nonlinear decision boundary. Detailed algorithmic procedures for the classifiers used here are not given and interested readers can consult the relevant references. The linear classifiers investigated were logistic regression [

In short, the classification performances were evaluated using the different linear and nonlinear methods for each of the individual features and the subset of features obtained in the feature selection step. However, when using complex nonlinear models, it is relatively easy to overfit the data; therefore, to avoid this problem, the evaluation metric was based on the classification accuracy obtained from the out-of-sample data for which 10-fold cross-validation was used to evaluate the out-of-sample performances. The construction, evaluation, and comparison of the many different models for the several features involved an exhaustive search and comparison of the performance of the quantitative classifiers. In reality, because there was no perfect model, the more these different models agreed with each other, the more confidence there was in the results. Parsimonious models are attractive not only because the risk of overfitting is reduced but also because the simpler the model, the easier it is to interpret the results and improve understanding. Data from one, two, and three quarters of the trial period were also considered so that the performance of the classifier could be monitored over time to assess the optimal trial durations for the lamotrigine and placebo comparison.

The initial data exploration analyzed the trends in the QIDS-SR16 time series.

The probability distributions for the reported QIDS-SR16 and ASRM values were also investigated. The kernel density estimates shown in

The results in

After designing many features to feed into the classifier, feature selection was conducted using lasso, elasticnet, ridge regression, and the sequential feature selection techniques. All these methods agreed on the choice of only 2 variables, the scaling exponent alpha from DFA and the coefficient of variation (σ/μ). The performance evaluations for these 2 features are summarized in

These 2 selected features produced easily understood information. The coefficient of variation (σ/μ) reflected the shape of the QIDS-SR16 distribution and was not affected by the weekly fluctuations, and the DFA exponent alpha quantified the nature of the temporal weekly fluctuations and was not affected by the distribution. Therefore, intuitively, the coefficient of variation (σ/μ) was seen as a standardized measure of the dispersion and the DFA exponent alpha as a measure of the temporal stability.

Trend analysis: The linear regression lines for the Quick Inventory of Depressive Symptomatology—self report version 16 (QIDS-SR16) data were plotted.

Upper half: cumulative distribution function (CDF) density plot for the Quick Inventory of Depressive Symptomatology and Altman self-rating scale value estimates. Lower half: plot for the expanding window average for the Quick Inventory of Depressive Symptomatology—self report version 16 (QIDS-SR16) and Altman self-rating scale (ASRM).

Results from 2 sample

Four-week period | Sample size | QIDS-SR16 difference | 95% CI | ||

Lamotrigine | Placebo | ||||

1-12 | 101 | 101 | −0.33 | .63 | −1.02 to 1.69 |

10-14 | 81 | 77 | −1.63 | .09 | −0.24 to 3.50 |

20-24 | 62 | 64 | −1.52 | .18 | −0.73 to 3.77 |

30-34 | 52 | 55 | −3.03 | .007 | 0.83 to 5.23 |

40-44 | 50 | 49 | −2.09 | .08 | −0.22 to 4.40 |

48-52 | 54 | 48 | −2 | .09 | −0.34 to 4.35 |

Cumulative distribution function (CDF) density estimates and receiver operating characteristic (ROC) curves for the selected variables: detrended fluctuation analysis exponent alpha (DFAα) and coefficient of variation (σ/μ).

Because the chosen model only had 2 variables, it was easy to visualize the classification decision boundary.

Given the linear separation of the dataset, it was also possible to visualize the time series for the participants lying in the 4 decision area corners;

A binary classification analysis was conducted, for which a variety of linear and nonlinear models were applied to the individual features, all features, and to the 2 selected features (coefficient of variation and DFA exponent alpha). The results are summarized in

As mentioned, participant compliance was important and was found to yield time series of varying durations. A clinician who was using the classifier to test whether lamotrigine was better than the placebo would need to know how many weeks of data were required before a definitive decision could be made. For this reason, the performance of the logistic regression model was evaluated against the amount of data reported by participants. For this evaluation, the participants were selected if they had provided at least 10 QIDS-SR16 responses for trial durations of 13 to 52 weeks. ^{th} to the 52^{nd} weeks). The maximum of the fractions for the lamotrigine and placebo participants served as the “no information” benchmark for the classifier. In addition, for most periods from 20 weeks onwards, the classifier was outperforming the benchmark. For trials ranging from 44 to 52 weeks, the classification accuracy was greater than the benchmark and was statistically significant.

Classification decision boundary using logistic regression. BN: Brownian noise; DFAα: detrended fluctuation analysis alpha; N: pink noise; WN: white noise; σ/μ: coefficient of variation.

The four Quick Inventory of Depressive Symptomatology—self report version 16 (QIDS-SR16) time series for the subjects lying near the 4 decision boundary plot corners.

Data requirements for individual subjects and classification accuracy for data available after 13 weeks.

The results from the classification model confirmed the original findings of the CEQUEL study that the addition of lamotrigine to quetiapine for the treatment of severe bipolar depression decreased depressive symptoms compared with a placebo. Although the original CEQUEL data analysis had relied on linear regression model fitting, the classification model used in this study not only examined the data from a different perspective but also provided robust explanatory power because it allowed for an estimation of the extent to which the 2 groups of observations (ie, treatment allocation: active lamotrigine vs placebo) could be distinguished based only on the QIDS-SR16 scores. Therefore, a model was constructed without the need for any prior information about the clinical interventions during the treatment period. Using out-of-sample classification accuracy and simpler models allowed us to avoid overfitting that is problematic when employing complicated machine learning models on shallow datasets. Another advantage of the machine learning models was the ability to test multiple features computed from the raw data and select only those features that had sufficient explanatory power.

The use of DFA on the QIDS-SR16 time series allowed us to examine the temporal stability of the treatment, an aspect that was not considered in the original study. Future data analyses could attempt to explain why the QIDS-SR16 scores of the participants taking the active lamotrigine were generally temporally unstable. The effect of other clinical interventions and concurrent treatments during the study period could also be considered. Finally, it was demonstrated that machine learning techniques could be generally used in clinical trials to provide greater insights into what the data represent beyond classical statistical analyses, especially when there are large, complex datasets available. One drawback of using machine learning techniques, however, is that the analyst must deal with the bias-variance tradeoff. Another disadvantage is that some powerful machine learning models require very large datasets to be able to generalize well.

This study confirmed that the use of lamotrigine decreases depressive symptoms in bipolar patients. The selected classification features suggested that lamotrigine increased the coefficient of variation (achieved by increasing SD or decreasing the mean of the QIDS-SR16 time series). It was also found that patients taking lamotrigine tended to have rougher time series, which was indicative of a greater temporal instability in the time series. The 2 features, the coefficient of variation and DFA exponent, implied that a two-dimensional visualization diagram and linear decision boundary can be constructed to better understand bipolar disorder and the ways that the participants are affected by lamotrigine. The statistical significance of the classification was evaluated, from which it was determined that a trial of at least 44 weeks was required to distinguish between lamotrigine and the placebo. It would be useful to conduct additional studies to obtain a larger cohort of compliant participants. The selected features provided a deeper understanding of the temporal dynamics of subjects experiencing bipolar disorder and offered the potential for the better monitoring of symptoms over time.

Comparison of models: Logistic regression (LR), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Linear Support Vector Machine (LSVM), Gaussian Kernel SVM (GSVM) and K-Nearest Neighbors (KNN). We show both the in-sample and out-of-sample classification accuracy.

Altman self-rating scale

Comparative evaluation of Quetiapine-Lamotrigine

detrended fluctuation analysis

Quick Inventory of Depressive Symptomatology–Self-Report version 16

receiver operating characteristic

None declared.