Utilizing a Personal Smartphone Custom App to Assess the Patient Health Questionnaire-9 (PHQ-9) Depressive Symptoms in Patients With Major Depressive Disorder

Background Accurate reporting of patient symptoms is critical for diagnosis and therapeutic monitoring in psychiatry. Smartphones offer an accessible, low-cost means to collect patient symptoms in real time and aid in care. Objective To investigate adherence among psychiatric outpatients diagnosed with major depressive disorder in utilizing their personal smartphones to run a custom app to monitor Patient Health Questionnaire-9 (PHQ-9) depression symptoms, as well as to examine the correlation of these scores to traditionally administered (paper-and-pencil) PHQ-9 scores. Methods A total of 13 patients with major depressive disorder, referred by their clinicians, received standard outpatient treatment and, in addition, utilized their personal smartphones to run the study app to monitor their symptoms. Subjects downloaded and used the Mindful Moods app on their personal smartphone to complete up to three survey sessions per day, during which a randomized subset of PHQ-9 symptoms of major depressive disorder were assessed on a Likert scale. The study lasted 29 or 30 days without additional follow-up. Outcome measures included adherence, measured by the percentage of completed survey sessions, and estimates of daily PHQ-9 scores collected from the smartphone app, as well as from the traditionally administered PHQ-9. Results Overall adherence was 77.78% (903/1161) and varied with time of day. PHQ-9 estimates collected from the app strongly correlated (r=.84) with traditionally administered PHQ-9 scores, but app-collected scores were 3.02 (SD 2.25) points higher on average. More subjects reported suicidal ideation using the app than they did on the traditionally administered PHQ-9. Conclusions Patients with major depressive disorder are able to utilize an app on their personal smartphones to self-assess their symptoms of major depressive disorder with high levels of adherence. These app-collected results correlate with the traditionally administered PHQ-9. Scores recorded from the app may potentially be more sensitive and better able to capture suicidality than the traditional PHQ-9.


Survey Questions
In this section, we will detail the questions and assumptions used in our study.
The PHQ-9 asks nine questions with the intent of assessing the severity of depression. These questions are answered on a 0-3 Likert scale, with increasing scores indicating increased severity. Questions presented in the PHQ-9 are phrased retrospectively, asking patients to assess the severity of their symptoms over the last two weeks. The difference between these perspectives is included in the estimation procedures that follow.
The Mindful Moods application adapts the PHQ-9 for Ecological Momentary Assessment (EMA). For each of the nine retrospective questions asked in the PHQ-9, Mindful Moods asks analogous questions that are phrased to assess the severity of symptoms in the present or recent past. Furthermore, in order to make Mindful Moods more engaging, each question has two phrasings with opposite valence, and one phrasing is selected at random for each question asked. These questions are assumed to be interchangeable after tranforming scores with opposite valence to match that of the original PHQ-9. Specifically, if a subject responds with a score of Y to a Mindful Moods question with the same valence as its analogous PHQ-9 question, the oppositely phrased version is assumed to be comparable if transformed as 3 − Y . Table 1 provides both the original phrasing of the PHQ-9 (as well as the retrospective preamble) and the two versions of each question posed the Mindful Moods application.
Original PHQ-9: Over the past 2 weeks, how often have you been bothered by any of the following problems?

Estimation
The PHQ-9 asks each of its nine questions at one time, and the administrator requires the subject to give a response to each question. In contrast, the Mindful Moods application is a continual momentary ecological assessment, and patients might not adhere to all nine questions if asked repeatedly over time.
To this end, the application asks only a subset of its questions throughout the day, with the goal of achieving high response rates over time. However, because only a subset of the questions are asked at a given time, the complete momentary PHQ-9 score requires estimation.
Let these responses to application questions be notated Y tq = 0, ..., 3, where q = 1, ..., 9 indexes the question and t = 1, ..., T represents the day the question was asked. Thus, if every question was asked and answered, a daily PHQ-9 score is exactly P t = q Y tq . In our scheme, 3 sets of questions were asked throughout the day, and each set independently sampled 3 questions without replacement from 1, ..., 9, and the app therefore measures Y tatq , where each a tq ∈ {1, ..., 9}. Thus, on most days not every question will be asked, and some will be asked more than once. We wish to estimate P t as P t with this information.
If our estimate of a patient's daily PHQ-9 score was a naive sum of the questions we asked, the estimate will be biased because a patient might reliably score higher on some questions than others, so this estimate of PHQ-9 could reflect either how they feel or what we asked. Furthermore, we must account for missing data, which might be common with EMA questionaires. Another estimate is required that incorporates this heterogeneity of the data.
Our approach is to predict a subject's likely responses to each PHQ-9 question for that day, and to replace this prediction when actual responses are given for questions asked. Because PHQ-9 questions are phrased retrospectively and Mindful Moods questions are asked momentarily, we assume that the best prediction for a PHQ-9 question response Y tq is the average of Mindful Moods responses given for the same question item over the last two weeks: This leads to an a priori prediction for each day's PHQ-9 score P t = 9 q=1 Y tq . We then estimate the patient's PHQ-9 score as this predicted score, plus the total difference between that prediction and the day's measured responses: In the special case that all 9 questions are asked and answered daily, a tq = q, so P t = P t . This also acts as an imputation for questions or days missing at random.

Inference
The severity of depression symptoms varies over time, so a momentary PHQ-9 score is expected to vary over time as well. This variance can be used to assess how unusual a subject's daily PHQ-9 score is when compared to variation in previous scores. We can capture this variation in estimated daily PHQ-9 scores by using the variation in recent responses to Mindful Moods question items. We begin by first estimating the variance in question responses Y tq as the maximum likelihood variance in responses over the last two weeks: Assuming the responses to questions are independent and that the predicted responses Y tq are approximately fixed, the variance in P t may therefore be estimated as the sum of variance in the given responses: Assuming estimated PHQ-9 scores P t are approximately normal, we may proceed to construct level 100(1 − α)% confidence intervals and level α tests for the daily PHQ-9 score: Thus P t falls outside this region with estimated minimum probability α. We traditionally consider this an error rate. However, in our context, α represents the approximate proportion of unusual daily PHQ-9 scores.

Results
Many of the considerations detailed in previous sections can be summarized in a single plot for each patient. For example, PHQ-9 predictions and estimates P t and P t are defined for each day of the study, with the most information being used after the first two weeks of gathering data. Confidence intervals can also be constructed for each day. These results can be compared to naïve averages of daily question responses. A summary of this information is shown for Patient 1 in our dataset in Figure 1. The vertical gray bar indicates the two week marker, after which time the maximum information is used for estimates and confidence intervals, so daily results reflect the most information. The plot displays significant scores for questions assessing suicidal tendencies (when Y t9 = 2 or 3), shown with red diamonds. For comparison, the plot also shows complete paper PHQ-9 assessments in green, which were administered at days 1 and 30. Estimate Suicide > 1 Prediction Naive Paper Figure 1: Daily PHQ-9 predictions, confidence intervals, and estimates for Patient 1. Naïve daily averages are included for comparison. Significant suicidal thoughts are shown in red, and paper PHQ-9 scores are shown in green.