This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on http://mental.jmir.org/, as well as this copyright and license information must be included.
Efficient screening questionnaires are useful in general practice. Computerized adaptive testing (CAT) is a method to improve the efficiency of questionnaires, as only the items that are particularly informative for a certain responder are dynamically selected.
The objective of this study was to test whether CAT could improve the efficiency of the Four-Dimensional Symptom Questionnaire (4DSQ), a frequently used self-report questionnaire designed to assess common psychosocial problems in general practice.
A simulation study was conducted using a sample of Dutch patients visiting a general practitioner (GP) with psychological problems (n=379). Responders completed a paper-and-pencil version of the 50-item 4DSQ and a psychometric evaluation was performed to check if the data agreed with item response theory (IRT) assumptions. Next, a CAT simulation was performed for each of the four 4DSQ scales (distress, depression, anxiety, and somatization), based on the given responses as if they had been collected through CAT. The following two stopping rules were applied for the administration of items: (1) stop if measurement precision is below a predefined level, or (2) stop if more than half of the items of the subscale are administered.
In general, the items of each of the four scales agreed with IRT assumptions. Application of the first stopping rule reduced the length of the questionnaire by 38% (from 50 to 31 items on average). When the second stopping rule was also applied, the total number of items could be reduced by 56% (from 50 to 22 items on average).
CAT seems useful for improving the efficiency of the 4DSQ by 56% without losing a considerable amount of measurement precision. The CAT version of the 4DSQ may be useful as part of an online assessment to investigate the severity of mental health problems of patients visiting a GP. This simulation study is the first step needed for the development a CAT version of the 4DSQ. A CAT version of the 4DSQ could be of high value for Dutch GPs since increasing numbers of patients with mental health problems are visiting the general practice. In further research, the results of a real-time CAT should be compared with the results of the administration of the full scale.
General practitioners (GPs) are often the first point of contact for persons with mental health problems, and they make important decisions about treatment and referrals. However, GPs vary in their ability to detect mental problems in patients during consultations [
Using a short, good quality screener to distinguish between mild psychological symptoms and severe disorders has become of particular importance for Dutch GPs, as they have been restricted to refer only patients with a Diagnostic and Statistical Manual of Mental Disorders 4th edition (DSM-IV) disorder [
The Four-Dimensional Symptom Questionnaire (4DSQ;
Computerized adaptive testing (CAT) is a method to reduce patient burden of traditional questionnaires, by letting a computer dynamically select only the items that give new information about the patient. Based on a patient’s answer to a single first item, a responders underlying trait (eg, level of depression) is estimated. In addition, an automated algorithm selects the next item that is most appropriate or informative for this responder. The benefit of using CAT is the reduction in items without a loss in reliability or precision in measurement [
CAT relies on item response theory (IRT) [
The aims of this simulation study were (1) to investigate if responses of a clinical sample to a paper-and-pencil version of the 4DSQ meet the psychometric requirements needed for IRT; and (2) to determine if a simulated adaptive version of the 4DSQ would yield inferences similar to those based on the full version of the 4DSQ. This simulation study is the first step necessary for the development of a CAT version of the 4DSQ.
We used data collected in the baseline measurement of a study evaluating triage decisions in general practice. All patients with mental health problems visiting a GP working in a primary care center in the northern part of the Netherlands between January 1 and December 31, 2014 were included in the study (N=408). All included participants provided informed consent. Participants filled in the Dutch paper-and-pen version of the 4DSQ and only patients with complete data were included in the analyses (92.9%, 379/408). As a result, our final sample consisted of 379 participants with a mean age of 44.8 years (SD 16.5, range 16 to 87). Of the participants, 66.8% (253/379) were female. No significant differences in age (
Since all four of the 4DSQ scales are used and interpreted separately, we performed the psychometric evaluation and our analyses for each of the four scales separately. We followed the five steps described in the analysis plan used for the PROMIS study, which was aimed at improving patient-reported outcome instruments [
Descriptive statistics were calculated for each single item (
Within IRT, data have to agree with three basic assumptions: unidimensionality, local independency, and monotonicity [
Unidimensionality means that a person’s response to an item is accounted for by his or her level on the underlying trait and not by any other factor. A confirmatory factor analysis (CFA) with ordinal data was performed to study unidimensionality for each scale. The model’s fit was assessed using four frequently used fit indices: comparative fit index (CFI) greater than 0.95 for good fit, root mean square error of approximation (RMSEA) less than 0.06 for good fit, Tucker Lewis index (TLI) greater than 0.95 for good fit, and standardized root mean residuals (SRMR) less than 0.08 for good fit.
Local independence means that there should be no significant association among item responses, except for the association controlled for by the underlying trait. This assumption was checked by inspecting residual correlations between item pairs within the CFA. Items with high residual correlations (greater than 0.2) were considered as possibly locally dependent.
The assumption of monotonicity means that an item response related to a higher level of the trait should increase with the level of the trait. This assumption was studied by plotting trace lines. In addition, we studied scalability coefficients of IRT probability curves (greater than 0.3 indicates monotonicity).
Within IRT, several models are commonly used; however, because of the ordered-response categories of the 4DSQ, a graded response model (GRM) was preferred for our data [
An item displays differential item functioning (DIF) if persons with different characteristics (eg, males and females) respond differently to an item, despite equivalent levels of the underlying trait [
The GRM parameter estimates from Step 3 were used for a CAT simulation. As no information on a subject is available before the first item is administered, θ is initially set at 0. After the first item is answered, the choice for the next item is based on the GRM parameters of all potential next items in relation to the response to the item that was answered first. All optimal next items are selected based on the maximum Fisher estimation method. The CAT selects new items until a pre-defined stopping rule is reached. A stopping rule is based on either a maximum number of items administered or on a pre-specified level of measurement precision [
We combined the two following stopping rules: (1) stop when the standard error of the trait is similar to the standard error of the full lengths scale, or (2) stop when half the number of the full scale is administered. We compared CAT outcomes with the first stopping rule only and with both stopping rules. Regarding the first stopping rule, we inspected varying levels of standard error (from 0.2 to 0.8). The pre-defined standard error of theta that corresponded with the standard error of the full scale was used as a reference point. Correlations were calculated between trait levels based on CAT and on the scores from the full version of the 4DSQ. We added a second stopping rule because questionnaires in mental health often are most informative for patients with relatively high levels of clinical outcomes [
The descriptive statistics and the estimation of the GRM parameters were done in STATA 14.0. The CFA model was estimated using the lavaan package in R [
The sample’s mean total score on the 4DSQ distress scale was 18.6 (SE 0.43, range 0-32, median 20), with an overall Cronbach alpha of .92. The mean depression score was 3.4 (SE 0.20, range 0-12, median 2), with a Cronbach alpha of .90. The mean score for anxiety was 5.5 (SE 0.27, range 0-23, median 4), with a Cronbach alpha of .87. Finally, for the somatization scale, the sample scored 11.6 on average (SE 0.35, range 0-32, median 11), with a Cronbach alpha of .85. These results were comparable to other studies [
Regarding the first assumption, unidimensionality, we concluded that the items of the anxiety scale showed a good model fit for all four fit indices of the CFA. The items of the distress and depression scales showed a good fit for three of the four indices, but not for RMSEA, although they nearly did. For good fit, RMSEA should be lower than 0.06, but it was 0.08 (distress) and 0.07 (depression). The items of the somatization scale showed good fit for two out of four indices, but not for RMSEA (0.07 instead of less than 0.06) and TLI (0.94 instead of greater than 0.95).
Regarding the second assumption, out of 321 items pairs within the four scales (equation 1), two item pairs with a residual correlation above 0.2 were observed, indicating local independency. They were items 20 and 39 (sleep-related), and items 47 and 48 (trauma-related), all from the distress scale.
321=(½)(6)(5) + (½)(16)(15) + (½)(12)(11) + (½)(16)(15) (1)
The scalability coefficient of all items was higher than 0.3, indicating that all items met the third assumption of monotonicity.
The parameter estimates of the GRM for all items of the four scales are shown in
It was found that 43 items showed CRCs as expected. Five items on the anxiety scale (40, 42, 43, 49, and 50) and two items on the somatization scale (5 and 14) did not show CRCs as expected. For those items, the probability to answer “sometimes” was always lower than the probability for one of the other responses, regardless of the trait level.
As an example,
Category response curves of items 33 and 35 of the Four-Dimensional Symptom Questionnaire depression scale. The probability (y-axis) represents the chance on a certain response (0=never; 1=sometimes; 2=regularly, often, very often, or constantly) given a certain level of theta. Theta (x-axis) represents the underlying trait level; in this figure, depression. The abbreviation Pr is probability.
For the depression, anxiety, and somatization subscales, no items showed DIF. The only item that showed significant and relevant uniform and non-uniform DIF was item 41 (“I quickly get emotional”) from the distress scale for the covariate gender.
The characteristics of the simulated CAT under different levels of measurement precision (allowing the standard error of the estimated underlying trait to gradually increase; stopping rule 1) are shown in
Mean number of items administered under varying levels of measurement precision and correlations between computerized adaptive testing scores and full version scores of the Four-Dimensional Symptom Questionnaire.
Stopping rule | Distress | Depression | Anxiety | Somatization | ||||
Number of items, mean (SD) | Correlation^{a} | Number of items, mean (SD) | Correlation^{a} | Number of items, mean (SD) | Correlation^{a} | Number of items, mean (SD) | Correlation^{a} | |
None | 16 | 1.00 | 6 | 1.00 | 12 | 1.00 | 16 | 1.00 |
SE^{b} (θ)<0.2 | 15.7 (0.8) | 1 | 5.7 (0.9)^{c} | 1^{c} | 12 (0) | 1 | 16 (0) | 1 |
SE (θ)<0.3 | 8.8 (4.5) | 0.98 | 5.4 (1.2) | 0.99 | 8.7 (4.3)^{c} | 0.97^{c} | 14 (0) | 0.97 |
SE (θ)<0.4 | 6.3 (4.3)^{c} | 0.96^{c} | 5.0 (1.3) | 0.99 | 8.3 (4.3) | 0.97 | 12.9 (2.1)^{c} | 0.95^{c} |
SE (θ)<0.5 | 4.9 (3.8) | 0.92 | 4.9 (1.4) | 0.99 | 8.1 (4.4) | 0.97 | 11.2 (4.9) | 0.95 |
SE (θ)<0.6 | 4.1 (2.6) | 0.86 | 4.6 (1.4) | 0.99 | 5.9 (4.2) | 0.94 | 7.5 (4.6) | 0.86 |
SE (θ)<0.7 | 3.8 (2.5) | 0.84 | 3.9 (1.3) | 0.97 | 5.9 (4.1) | 0.94 | 4.6 (3.4) | 0.73 |
SE (θ)<0.8 | 3.7 (2.3) | 0.79 | 3.9 (1.3) | 0.97 | 5.6 (4.0) | 0.93 | 4.6 (3.4) | 0.73 |
^{a}Correlation between CAT θ and complete test θ.
^{b}SE: standard error.
^{c}The standard error of theta (θ) is equal to the standard error of the full version scale.
The results of combining the first stopping rule with the second stopping rule are shown in
Mean number of items administered and correlation with total estimated theta under one or two stopping rules.
Stopping rule | Distress | Depression | Anxiety | Somatization | ||||
Number of items, mean (SD) | Correlation^{a} | Number of items, mean (SD) | Correlation^{a} | Number of items, mean (SD) | Correlation^{a} | Number of items, mean (SD) | Correlation^{a} | |
None | 16 | 1.00 | 6 | 1.00 | 12 | 1.00 | 16 | 1.00 |
SE^{b} (θ) = SE (full) | 6.3 (4.3) | 0.96 | 5.4 (1.2) | 0.99 | 8.7 (4.3) | 0.97 | 12.9 (2.1) | 0.95 |
Maximum items^{c} | 5.0 (2.1) | 0.79 | 3.0 (0) | 0.96 | 4.9 (1.4) | 0.92 | 7.9 (0.3) | 0.92 |
^{a}Correlation between CAT θ and complete test θ.
^{b}SE: standard error.
^{c}Maximum items are determined by dividing the number of items by 2.
In summary, when applying CAT to the 4DSQ and applying two stopping rules to the subscales of anxiety, depression, somatization, and one stopping rule to the subscale distress, the total number of items on the 4DSQ could be reduced by 56% on average (from 50 to 22 items), without losing a considerable amount of measurement precision.
Our simulation study showed that CAT may increase the efficiency of the 4DSQ and could reduce responders’ burden by more than 50%. These results were also found in other CAT studies, such as on the
Some CATs to measure anxiety and depression have already been used and evaluated in clinical (specialist) care [
A CAT version of the 4DSQ seems especially useful in general practices, for example, as part of a broad online assessment to investigate the severity of psychological problems of patients. As the number of patients visiting their GP with mental health problems is increasing [
However, some obstacles for the successful implementation of a CAT version of the 4DSQ in general practice exist. First, current information and communication technology (ICT) possibilities in general practices are insufficient for the implementation of CAT, which requires sophisticated statistical software. Second, it is not clear to what extent GPs are willing to implement a CAT version of the 4DSQ. GPs may use responses from individual 4DSQ items, such as item 47 or 48 on traumatic events, for a quick clinical evaluation, and this information may be lost when applying CAT. Lastly, it is not clear if CAT is appropriate for all patients. Previous research on CAT after inpatient rehabilitation suggests that it might only be feasible to collect (complete) data for a specific subset of patients [
As this was a simulation study, we used responses to a paper-and-pencil version of the 4DSQ. In reality, responders might behave differently when receiving a computerized adaptive assessment. For example, we do not know if the actual computer administration might influence responses or what effect differences in the item order may have. However, a previous study showed that differences between results from a simulation CAT and a real CAT were small [
Regarding the psychometric evaluation, our data showed some weaknesses. For most items of the four subscales of the 4DSQ, the assumptions for an IRT analysis were met. The assumption of unidimensionality was not met perfectly for all four scales, although it nearly was. Moreover, some items showed other limitations, such as correlations between item pairs or differential item functioning. These items might be left out in future (real-time) CAT versions of the 4DSQ. As in other studies, we found relevant DIF for the item “emotionality” on the distress scale. Women tend to more easily agree with this item compared to men, even when they have a similar underlying level of distress. When looking at the individual responses to the CAT of the distress scale, the item “emotionality” was only administered to participants with a very low level of distress. This indicates that the DIF on this item does not bias the CAT outcomes, as this item is not informative enough to be included in the final CAT. When looking at the distribution and the CRC of some items of the anxiety and somatization scales, participants either endorse option 0 or option 1 to 2. Patients apparently have difficulties differentiating between response categories 1 and 2. This might be solved in future studies by grouping response options 1 and 2 for certain items, making them dichotomous.
Data from this simulation study in general agreed with assumptions needed for CAT. CAT seems useful for improving the efficiency of the 4DSQ by 56%, without losing a considerable amount of measurement precision. Of course, this simulation study is only the first step towards a CAT version of the 4DSQ that could be implemented in clinical practice and it should be followed by a study on a real-time CAT and eventually by an evaluation of the developed CAT version in a clinical setting.
English version of the Four-Dimensional Symptom Questionnaire.
Descriptive statistics of items of the Four-Dimensional Symptom Questionnaire.
Graded response model parameter estimates of the Four-Dimensional Symptom Questionnaire.
Four-Dimensional Symptom Questionnaire
computerized adaptive testing
category response curve
confirmatory factor analysis
differential item functioning
general practitioner
graded response model
item response theory
root mean square error of approximation
We would like to thank Marjolein Jansen and Thomas de Kok for their contributions to data collection.
TM, DB, and PFV designed the study. TM and DB analyzed the data. All authors contributed to and approved the final manuscript.
BT is the copyright owner of the 4DSQ and receives copyright fees from companies that use the 4DSQ on a commercial basis (the 4DSQ is freely available for non-commercial use in health care and research). BT received fees from various institutions for workshops on the application of the 4DSQ in primary care settings.