Assessing the Equivalence of Paper, Mobile Phone, and Tablet Survey Responses at a Community Mental Health Center Using Equivalent Halves of a ‘Gold-Standard’ Depression Item Bank

Background The computerized administration of self-report psychiatric diagnostic and outcomes assessments has risen in popularity. If results are similar enough across different administration modalities, then new administration technologies can be used interchangeably and the choice of technology can be based on other factors, such as convenience in the study design. An assessment based on item response theory (IRT), such as the Patient-Reported Outcomes Measurement Information System (PROMIS) depression item bank, offers new possibilities for assessing the effect of technology choice upon results. Objective To create equivalent halves of the PROMIS depression item bank and to use these halves to compare survey responses and user satisfaction among administration modalities—paper, mobile phone, or tablet—with a community mental health care population. Methods The 28 PROMIS depression items were divided into 2 halves based on content and simulations with an established PROMIS response data set. A total of 129 participants were recruited from an outpatient public sector mental health clinic based in Memphis. All participants took both nonoverlapping halves of the PROMIS IRT-based depression items (Part A and Part B): once using paper and pencil, and once using either a mobile phone or tablet. An 8-cell randomization was done on technology used, order of technologies used, and order of PROMIS Parts A and B. Both Parts A and B were administered as fixed-length assessments and both were scored using published PROMIS IRT parameters and algorithms. Results All 129 participants received either Part A or B via paper assessment. Participants were also administered the opposite assessment, 63 using a mobile phone and 66 using a tablet. There was no significant difference in item response scores for Part A versus B. All 3 of the technologies yielded essentially identical assessment results and equivalent satisfaction levels. Conclusions Our findings show that the PROMIS depression assessment can be divided into 2 equivalent halves, with the potential to simplify future experimental methodologies. Among community mental health care recipients, the PROMIS items function similarly whether administered via paper, tablet, or mobile phone. User satisfaction across modalities was also similar. Because paper, tablet, and mobile phone administrations yielded similar results, the choice of technology should be based on factors such as convenience and can even be changed during a study without adversely affecting the comparability of results.


Executive Summary
Vector Psychometric Group, LLC (VPG) was contracted by Telesage, Inc. to conduct analysis of provided item response theory (IRT) scale scores for depression to determine the predictive effect, if any, that mode of item administration (paper, tablet, smartphone) and form had on the final scores. A sample of 129 unique subjects was provided, in which each subject completed two visits. Each subject completed a paper administration at one visit and an electronic administration (either on a tablet or smartphone) during the other; forms and mode of administration was based on random assignment. Analysis of variance (ANOVA) models allowing for subject-specific random intercepts were fit to the data as the primary analyses. Results indicated that all scores were equivalent across the forms and modes of administration; that is, neither form, mode, nor their interaction was a statistically significant predictor of the provided depression IRT scale scores. Additional analyses (i.e., t-tests) using simpler models were also conducted to aid in the dissemination of results.

Methods
Data were provided by Telesage for 129 individuals, each of whom had completed two administrations of depression items from the PROMIS item bank. Available demographic characteristics of the sample are presented in Table 1.
In terms of research design, one data collection instance was administered on paper for all subjects, the other administration was via an electronic data collection (EDC) medium, either on a tablet or a smartphone. Additionally, across the two visits, subjects were presented both a Form A and a Form B, which contained non-overlapping subsets of 17 items each from the previously validated PROMIS depression item bank. It is VPGs understanding that the order of form presentation and administration mode were randomly assigned at the subjects' first visit. Electronic device (tablet or smartphone), when warranted by randomization to EDC, was also randomly assigned.
Data were analyzed using mixed effects models with a random intercept, which allowed subjects to vary in their depression levels, and it was specified that subjects were repeated in the dataset to allow the model to account for within-subject dependencies across visits. Visit order was not available for explicitly modeling change over time. Fixed effects predictors included modality (paper, smartphone, tablet), form (Form A or B), and the interaction between modality and form. Analyses were run twice, once using all available subjects and a second omitting observations that had been flagged by TeleSage as questionable due to data irregularities.
Further, to provide more easily disseminated methods and results, several t-test were conducted. Each of these analyses, being limited to comparing only two groups at a time, does fail to model some of the dependencies in the data, but the presentation of group means, SDs, and t-tests is intended more for descriptive purposes and intuitive understanding of the trends seen in the data, rather than as definitive analyses. Specifically, an independent samples t-test was conducted between the smartphone and tablet groups and paired (repeated measures) t-tests were conducted comparing the group means of form A and form B, as well as paper versus smartphone scores, and paper versus tablet scores. Due to the randomization to the two EDC devices, the available sample sizes for these analyses was approximately half of the total sample.

Results
For the full data analyses, the first model included modality, form, and the modality-by-form interaction as predictors. Results showed a statistically non-significant interaction, indicating that the difference between forms did not depend on modality; F(2,125) = 0.44, p = .64. For parsimony, the non-significant modality-by-form interaction was dropped and a second, main effects only model was estimated using modality and form as predictors. Results from this main effects model demonstrated that there was not a statistically significant effect of either form, F(1,126) = 0.06, p = 0.81, or modality, F(2,126) = 0.16, p = 0.85 on the provided IRT-scale scores for depression.
Similar results were obtained for the model using only data from unflagged observations, which resulted in the removal of 12 subjects for a reduced N of 117. Initial model results showed a statistically non-significant interaction, indicating that the difference between forms did not depend on modality; F(2,113) = 0.39, p = .68. For parsimony, the non-significant modality-by-form interaction was dropped and a second model was estimated with only modality and form as main effect predictors. Results from this model again demonstrated that there were no statistically significant effects due to either form, F(1,114) = 0.15, p = 0.70, or modality, F(2,114) = 0.23, p = 0.79 on depression scores.
The dataset was also analyzed using t-tests to provide what VPG considers a more accessible analysis for non-statistically oriented reviewers. The group means, SDs, and associated t-test values for the previously described comparisons are presented in Tables 3 and 4. As can be seen, and confirming the general findings of the previously reported ANOVAs, no statistically significant differences were found among any of the modality comparisons or across forms.

Conclusions
Overall, analyses supported the contention that the forms were equivalent and that mode of administration played little role in influencing the IRT-based scale scores for depression. The primary analyses that should be referenced in supporting this conclusion are the reported F-test values from the ANOVAs; the reported t-tests analyses provide an incomplete analysis of all dependencies among the data points and were provided primarily for descriptive purposes.    (Dunlop, Cortina, Vaslow, & Burke, 1996)