This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on http://mental.jmir.org/, as well as this copyright and license information must be included.
A major problem in mental health clinical trials, such as depression, is low assay sensitivity in primary outcome measures. This has contributed to clinical trial failures, resulting in the exodus of the pharmaceutical industry from the Central Nervous System space. This reduced assay sensitivity in psychiatry outcome measures stems from inappropriately broad measures, recall bias, and poor interrater reliability. Limitations in the ability of traditional measures to differentiate between the trait versus state-like nature of individual depressive symptoms also contributes to measurement error in clinical trials. In this viewpoint, we argue that ecological momentary assessment (EMA)—frequent, real time, in-the-moment assessments of outcomes, delivered via smartphone—can both overcome these psychometric challenges and reduce clinical trial failures by increasing assay sensitivity and minimizing recall and rater bias. Used in this manner, EMA has the potential to further our understanding of treatment response by allowing for the assessment of dynamic interactions between treatment and distinct symptom response.
Mental health treatment development and testing has been at an impasse for the past several decades; our clinical trials increasingly fail more often than in other fields [
The randomized, placebo-controlled trial is still considered the gold standard test of treatment efficacy. However, over the past 60 years of treatment research in psychiatry, we have observed that treatment effect sizes remain stable, whereas placebo responses rise [
The contribution of poor measures to treatment failures is particularly well-illustrated in antidepressant trials [
It seems unlikely, given the financial and intellectual resources brought to bear in the early phases of discovery, that investigators could have gotten the scientific rationale so wrong. A more probable explanation for the failed studies might lie in how the primary outcome was determined and measured. Although the MADRS is considered a standard assessment tool in depression research, poor interrater reliability (ie, imprecision of measurement) is one of many limitations to this measure’s assay sensitivity.
Measurement
Self-report measures may incorporate reporter bias, whereas clinician-administered assessments incorporate bias on the part of the clinician. For example, there may be bias in recruitment or sample ascertainment, such as career patients who serially enroll in research studies for financial reasons and are thus motivated to answer questions in such a way as to increase likelihood of enrollment. Investigators may unconsciously inflate baseline measures of psychiatric symptoms to meet recruitment goals [
Nonetheless, these arguments fail to explain why academic studies, in which less financial gain accrues to the patient and investigator, also see a high placebo response and failure rate [
The idea of using technology to increase the accuracy and precision of symptom assessment in clinical trials is gaining momentum. For example, the National Institutes of Health toolbox was designed specifically for this purpose [
Clearly, we are not the first to contemplate the problem of assay sensitivity in our field. However, public discussion as to why progress in the field of psychometrics has stalled has not extended to industry trials. Open scientific discourse has also been limited on the subject of developing novel, effective, Food and Drug Administration (FDA)–sanctioned instruments, which could be used to track mental health disorder outcomes with greater assay sensitivity. As the success or failure of antidepressant treatment trials often rests solely on the presumed validity and reliability of symptom measures, it should follow that these assessments deserve the same degree of scrutiny regarding assay sensitivity as any laboratory test.
In this viewpoint, we will examine 3 major problem areas we believe the field needs to address in getting to precision assessment: overly complex assessment tools, contributions of human error, and limitations of infrequent sampling. First, we will review the 2 gold standard depression instruments used at present to track psychiatric symptoms in industry-funded drug trials. Next, we will examine the role of clinician assessment and how human involvement in measurement contributes to error. We will then discuss challenges to adequate measurement frequency in obtaining valid self-report data. Finally, we propose a solution to the measurement problem in depression clinical trials. We will explore contributions from the fields of mathematics, human psychology, and computer science to the development of mobile technology–based measures, which we believe may offer significant improvements over traditional symptom assessment.
Key point:
Overly broad measures that attempt to cover multiple symptoms or symptom domains compromise signal detection. To meaningfully reduce error, consensus on what to measure is needed.
Psychiatric rating scales frequently use diagnostic criteria or descriptive psychopathology to track a patient’s progress throughout a clinical trial. The descriptive psychopathology for a given psychiatric disorder is by nature more expansive than the diagnostic criteria alone, which can be helpful for identifying clinically significant features for treatment targets. This problem is not restricted to mental health research; trials in cardiology have also been compromised by failing to adequately confine outcome measures for meaningful signal detection [
Take for example the MADRS discussed above [
To further complicate matters, measuring multiple constructs inflates the chance that items tied to each construct will shift unpredictably over time (eg, due to lack of longitudinal factorial invariance) [
The shortened 6-item HAM-D and MADRS scales, which favor core items such as low mood, anhedonia, and guilt, have both been shown to be more sensitive than HAM-D-17 and the 10-item MADRS, respectively [
Consensus on the most clinically, functionally, or personally relevant features of treatment response or remission is needed to improve signal detection. If we simply wish to use our existing scales more pragmatically, we would take a treatment we know to be effective and choose the individual items from a selected scale that reveal the greatest amount of separation in favor of the proven treatment. We would then use the items from that same scale to determine whether or not an unproven treatment is effective. Alternatively, the field could adopt a universal consensus around measuring the core emotional symptoms of the illness to determine treatment success or failure. This is a difficult and unlikely scenario as we do not have the evidence base at present necessary to establish what exactly these core symptoms might be. In either case, improvement from a functional or pharmacoeconomic perspective may not map well onto any of the items in the measures we currently use. This may force the field to revisit some of its a priori assumptions about clinical relevance. In short, although we can confidently say that our current approach is suboptimal, fixing it will not be so easy.
Key points:
Clinician-administered scales compound response bias
Self-report alone is imperfect but minimizes rater contribution to measurement error
Psychiatric treatment research has traditionally considered clinician-administered assessments to be the
Error or bias on the part of the clinician is routine, rather than idiosyncratic. It would be unfair to presume it to be the result of malice or laziness. It may happen unconsciously and even in good faith because clinical judgment is not completely objective. Interviewers are also susceptible to either a positive or negative rater bias depending on whether research participant attributes, often irrelevant to the assessment at hand, are perceived as positive or negative. This can result in sometimes pronounced unconscious alterations of judgment [
Although the evidence is still far from conclusive, a decent body of literature has elevated the stature of PROs vis-a-vis traditional, clinician-administered rating scales. Self-report assessments represent an improvement over clinician-administered assessments insofar as they eliminate rater bias and reduce the likelihood that participants will feel compelled to give socially desirable responses (a type of response bias) or affirmative answers when interviewed face-to-face [
Key points:
Retrospective patient symptom report in the context of a clinical trial may be inaccurate
Ecologically valid symptom reports collected in real time are needed to interpret treatment effects
Self-report also has inherent limitations. This was recognized by Arthur Schopenhauer in the 19th century [
Infrequent measurement or sampling in clinical trials tacitly makes the assumption that we know enough about how an illness behaves over time to ask questions with a time frame modifier (eg, “In the last week...”) and is associated with measurement error in clinical trials. This has been illustrated in disciplines outside of psychiatry. For example, the Heart Outcomes Prevention Evaluation trial evaluated the effect of the angiotensin-converting enzyme inhibitor ramipril in patients at high risk for adverse cardiovascular events [
Similar to blood pressure, depressive symptoms also appear to fluctuate throughout the day or in response to specific situations [
Symptoms of many psychiatric illnesses are characterized as trait-like in advance of any evidence to support this assumption. However, variation is routinely observed in behaviors studied over time, irrespective of how trait-like they seemed to be (eg, personality traits such as sociability) [
Despite this, we continue to measure mood as a stable trait-like symptom (eg, “in the last 7 days, how has your mood been?”). This is the case for most psychiatric symptom assessments, where dynamic versus stable or trait-like nature of symptoms are poorly described. The only way to ascertain variation or lack thereof is to sample the illness frequently
Even if the symptoms of psychiatric illness are predominantly trait-like, we would continue to favor frequent sampling, even if this requires us to use a smaller number of items. This is in contrast to classical test theory, from which we take the maxim that adding equally good items to a measure leads to greater reliability and therefore, a better shot at validity [
Ecological momentary assessment (EMA) is frequent, real time, patient-reported assessment delivered via surveys (eg, “right now, my mood is...”) and completed by the patient typically via mobile device to collect information about the patient in a real-world setting [
Frequent, real-time EMA sampling has been shown in the same study to both qualify positive findings in clinical trials and detect treatment effects that the HAM-D was unable to detect between groups after 18 weeks of treatment [
An example of how infrequent sampling adversely affects assay sensitivity in clinical trials was recently provided by Moore et al [
EMA may also increase measurement precision by tracking how symptoms of an illness behave and interact over time [
It may also be possible to discern which symptoms are central to the disorder under study and how certain upstream symptoms may influence a cascade of symptoms downstream. How many EMAs are
Another question that might be asked is whether patients responding to an intervention or placebo get better in the same way. In other words, do the
EMA may also help us detect the phenomenon of regression to the mean. This phenomenon occurs when a baseline assessment of symptoms in a clinical research study is inflated at the initial visit before regressing to where those symptoms normally
Once individual symptom characteristics are known, targeted interventions can be developed. For instance, if insomnia leads to anergia the following day, which in turn leads to anhedonia, one might examine whether applying an intervention at the onset of insomnia changes the observed course of symptomatology downstream. This sort of intervention is called an ecological momentary intervention (EMI) because it relies on EMA or a just-in-time adaptive intervention. An EMI is an intervention informed by data gathered by EMA. We can already find examples of researchers using EMA data to provide an EMI. For example, EMI has already been shown to be very successful in providing patients with substance use disorders relapse prevention tools precisely when they need it the most [
Multiple methods, including multilevel vector autoregression and multilevel dynamic structural equation modeling, can help researchers examine how individuals may vary from group trends over time [
The use of EMA to gather the data needed to deliver a just-in-time EMI is also consistent with the concept of target engagement raised by the National Institute of Mental Health in an effort to address the declining success of clinical trials in mental health. A target is defined as something “molecular, cellular, circuit, behavioral or interpersonal, commensurate with the intervention,” which is expected to be changed in some way by the intervention being studied [
Although smartphone ownership is not universal, it is increasing, particularly among individuals with psychiatric conditions. John Torous found in a recent survey of 457 individuals with schizophrenia or schizoaffective disorder that greater than half (54%) of such individuals owned a smartphone [
Use of EMA in the real world often leads to missing data that have historically made analysis problematic. Users may not be compliant with the number of surveys they are required to complete in a timely manner, and, as described above, frequency of assessments increase precision only up to a point. Beyond this point, with too frequent assessment, the risk increases of either introducing noise by sampling irrelevant aspects of the human condition or of the assessment itself becoming a negative part of the intervention. Investigators will have to consider an assay sensitivity assessment as part of the startup process to determine how the target population will best respond to EMA.
Although the FDA has made its expectations for PRO measures clear [
The conceptualization of disorders based on Diagnostic and Statistical Manual of Mental Disorders/International Classification of Diseases criteria has been called into question and may eventually be replaced altogether by Research Domain Criteria [
EMA may not be ideal for detecting rare events, especially if they occur infrequently relative to the sampling frequency (ie, as the sampling frequency decreases so too does the probability of capturing
EMA should not be mistaken for a panacea so long as p-hacking, publication bias, and alpha inflation continue to affect the integrity of clinical research. Any scale used to evaluate the efficacy of an intervention in large industry-sponsored clinical trials must be uniform and well-validated. Thus, to create a standard efficacy measure for a given psychiatric disorder, we first must form a consensus about the types of items that should be included in the EMA scales, the frequency and duration of assessments, and the types of analytical approaches that will be used to interpret the data. The FDA would be unlikely to accept an EMA-based primary outcome measure over existing efficacy end point measures without standardization across multiple field trials in different populations. These data should then clearly establish test-retest reliability, external validity, and other parameters necessary to validate an EMA scale.
Moving from clinician-administered rating scales toward real-time patient-reported measures such as EMA offers significant advantages across medical settings. In clinical research studies, EMA may reduce placebo response and increase intervention-placebo separation. EMA also offers an obvious advantage over clinician-administered rating scales in inpatient and community settings given that time, cost, and staff pressures make use of the latter measure impractical. In community and inpatient settings, EMA can be used to identify individual factors leading to relapse, provide a more accurate picture of how a patient has been doing between clinical visits, and link real-world functional outcome measures over time (eg, rates of rehospitalization, days lost because of disability, and likelihood of self-harm) to
Overall, we believe that the continued use of clinician-administered retrospective self-report assessments in clinical trials contributes significantly to observed treatment failures and squanders innovative potential. As we have described, the instruments currently being used are too broad to adequately assess outcomes, suffer from poor interrater reliability, make inappropriate assumptions about how the illness being studied behaves, and rely on patient recall despite a sizeable body of research, which cautions against this. EMA instruments may play an increasingly important role in addressing the disparity between the need for and investment in novel mental health treatments. Self-report assessment via EMA addresses the limitations of traditional assessment methods but has not yet made its way into large multisite clinical trials sponsored by the industry. Although the FDA’s recent efforts to advance mobile technology in clinical trials [
central nervous system
ecological momentary assessment
ecological momentary intervention
Food and Drug Administration
Hamilton Rating Scale for Depression
17-item Hamilton Rating Scale for Depression
24-item Hamilton Rating Scale for Depression
Montgomery Asberg Depression Scale
mindfulness-based stress reduction
needed to treat
patient-reported outcome
None declared.