Test-retest reliability of an upper-extremity discomfort questionnaire in an industrial population

RA. Test-retest reliability of an upper-extremity discomfort questionnaire in an industrial population. Scandd Work 1997;23(4):299-307. Objectives Efforts to understand or to monitor upper-extremity musculoskeletal disorders among workers have usually involved the use of questionnaires. The goal of this study was to assess the test-retest reliability of an upper-extremity discomfort questionnaire among industrial workers. Methods Test-retest agreement among 148 workers was analyzed using the kappa coefficient for categorical outcomes. Values of kappa greater than 0.75 are considered excellent, values between 0.40 and 0.75 are fair to good, and values of less than 0.40 represent poor agreement beyond chance alone. Test-retest results of continuous measures (eg, visual analogue scale responses) were compared with paired t-tests. Results The test-retest reliability of the questionnaire used to elicit demographic information, medical history, exercise participation, and information on musculoskeletal symptoms among industrial workers appears to be good to excellent in most instances. COflcl~~iOll~ These results suggest that most results of this discomfort questionnaire are reliable and suitable for use in epidemiologic studies. For reassurance of the robustness of these findings, similar studies should be carried out in other worker populations with this, and other, questionnaire instruments.


Subject recruitment
The participants in this study were recruited from among workers at a unionized plant manufacturing spark plug and engine components in midwestern United States. The plant was selected as part of a larger ongoing investigation of the relationship between workplace ergonomic exposures and upper-extremity musculoskeletal disorders. For the purposes of this ongoing investigation, certain jobs in the plant were selected on the basis of the frequency of repetitive hand movements ["low", "medium" and "high" (7)]. All the workers in selected jobs with at least 6 months' tenure in such jobs and employed on the day or afternoon shifts were invited to participate in the medical survey. For logistical reasons, only workers with "medium" repetition jobs on the nightshift were invited to participate.
There were 263 subjects eligible for, and invited to participate in, the first round of this study, and 202 (77%) were ultimately included. Eighteen of the round 1 participants were employed on the nightshift; for logistical reasons these 18 persons were not solicited to participate in round 2. One hundred and forty-eight (80%) of the remaining 184 round 1 participants completed questionnaires in round 2. Table I displays the demographic characteristics for the 184 subjects who completed round I and were eligible for round 2, and table 2 shows job- Table 1. Demographic characteristics of the study group. related hand repetition categories, seniority, and symptom prevalences for the same group.
The study participants provided written informed consent which had been approved by the Human Subjects Review Committee of the University of Michigan School of Public Health. No personally identifiable results were provided to the company or union. Each participant was sent a confidential summary of their personal medical survey test results, an interpretation of the results, and recommendations for medical follow-up, if indicated. All letters to subjects were sent after completion of the data collection in round 2 of the study.

Survey procedures
This study included 2 rounds of data collection from the same cohort of workers. The round 1 medical survey procedures included a self-administered questionnaire, a physical examination focused on the upper extremities, ulnas and median sensory nerve conduction studies in both wrists, and measurement of anthropometric factors. The examiners were masked to the data collected by other members of the study team. The physical examination procedures, electrodiagnostic tests, and anthropometric factors have been described elsewhere (8); these results were not used in the present analyses.
The self-administered questionllaire focused on demographic information, prior medical conditions, current health status, and symptoms potentially related to  A variety of medical conditions potentially related to upper-extremity musculoskeletal disorders was listed: diabetes, rheumatoid arthritis, thyroid dysfunction, ruptured cervical disk, carpal tunnel syndrome, ulnas neuropathy, other peripheral neuropathy, tendinitis, sprains, broken bones in the upper extremities, thoracic outlet syndrome, gout, rotator cuff injury, ganglion, degenerative arthritis, use of medication, or history of any surgery involving the neck or upper extremities.
The subjects were instructed to report a symptom if it had been present in at least 3 separate episodes, or 1 episode had lasted more than 1 week, in the 12 months preceding the survey. The survey queried subjects about 9 symptoms (burning, stiffness, pain, cramping, tightness, aching, soreness, tingling and numbness) in each of 15 body locations (neck and right or left of each of the following: shoulder, upper arm, elbow, forearm, wrist, hand and fingers). For purposes of the analyses, if one or more symptoms were checked for a body location, then the finding was "positive" (eg, stiffness and pain in the neck); if none of the symptoms was checked for a location, then the finding was "negative". In essence, checking 1 symptom or 9 symptoms resulted in a "positive" response for a body location. If any symptoms were present, then the subjects were asked to rate the frequency of discomfort, the impact of such discomfort on work standard or quality, whether he or she had sought a job change related to the symptoms, and whether medical treatment had been sought. The questionnaire also instructed the subjects to rate the severity of their current discomfort and their worst discolnfort in the 30 days preceding the survey, in each of the 3 body regions (neck, shoulders or upper arms, elbows or forearms, and wrists, hands or fingers) using 10-cm visual analogue scales. The left and right verbal anchors for the scales were, respectively, "no discomfort" and "worst discomfort imaginable".
The women were asked to complete additional questions related to reproductive hormone status (currently pregnant, current use of birth control pills, history of double oophorectomy, history of hysterectomy, and history of natural menopause).
Because of concern and interest related to the carpal tunnel syndrome, results for symptoms in the wrists, hands, and fingers were grouped into various clinically relevant combinations. In addition to analyzing results for "all symptoms" in the wrists, hands and fingers, symptoms potentially more indicative of carpal tunnel syndrome (numbness, tingling, burning or pain in the wrists, hands, or fingers) were also analyzed separately. The subjects were also asked to complete hand diagrams reflecting these 4 symptoms, which were then scored for the likelihood of carpal tunnel syndrome (9).
The round 1 medical screening examinations were performed by University of Michigan personnel in the plant medical department during each worker's normal work hours; no one was required to assive early or stay late. In particular, workers were relieved from their job duties for the time required to complete the entire survey protocol (mean: approxiinately 70 min for the entire protocoltypically about 30 to 40 min to complete the self-administered questionnaire). Note that this time was separate from, and in addition to, regularly scheduled breaks and lunch. All data from round 1 were collected in 3 consecutive days.
The round 2 survey consisted solely of the readministration of a slightly abridged version of the questionnaire. This procedure was completed in 1 day for all subjects 21 to 23 days after the first round of data collection. During the interval between rounds, there were no inodifications in job tasks, and no significant changes in work load or production rates. In order to avoid additional disruption to the work process, the workers who participated in round 2 completed the questionnaire during one of their regularly scheduled morning or afternoon breaks, or their lunch break (about 22 inin each). In particular, they were not relieved of regular job duties to participate in the second round, but they also did not have to anive early or stay late. Most participants in round 2 completed the survey in less than 10 min.
The instrument used in round 2 was identical to the questionnaire used in round 1, except for the deletion of all psychosocial questions and some of the demographic questions (eg, gender and race). Since the time available to subjects for completing the questionnaires in round 2 was more limited, it was decided to reduce the number of questions so that the participants would be more likely to finish in the time allotted. The psychosocial questions were largely derived from previously existing survey tools whose reliability had been studied (10, 11, unpublished report: Kasasek RA, Pieper C, Schwartz J. Job content questionnaire and user's guide. Revision 1.1. Department of Work Environment, University of Massachusetts-Lowell, Lowell, MA 1985), so it was also felt to be less critical to include these items in round 2.

Statistical analyses
The analyses were performed using STATA for Windows, version 4.0 (12). The tests were considered statistically significant if P50.05. Test-retest agreement was analyzed using the kappa coefficient. Where appropriate, weighted kappa coefficients were calculated using quadratic weights included with the program, which corresponded to the intraclass correlation coefficient. Values of kappa greater than 0.75 are considered excellent, values between 0.40 and 0.75 are fair to good, and values of less than 0.40 represent poor agreement beyond chance alone (13). It is known that the value of kappa depends, in part, on the prevalence of "positive" findings: kappa approaches zero when the true prevalence approaches zero or 100% (14). Therefore, to aid the interpretation of the kappa coefficients, the prevalences of "positive" findings are presented for rounds 1 and 2, and these findings are compared using the McNemar x2 test statistic. The McNemar x2 test statistic (and corresponding P-value) is shown only for the items which were significantly different between rounds (P < 0.05).

Results
All values of kappa related to prior medical conditions were in the "good" to "excellent" range. (See table 3.) Questions related to female hormone status demonstrated excellent test-retest reliability, except for current pregnancy. (See table 3.) The kappa coefficient for current pregnancy is undefined due to P, = 0. Table 3 also lists the results for hand dominance, any participation in regular exercise (yes, no), frequency of participation in regular exercise (< I time a month, 1 time a month but < 1 time a week, 1-2 times a week, and 2 3 times a week), and level of education (less than high school, high school, and more than high school). The test-retest results for these latter items were all good or excellent. Only "history of surgery involving the neck or upper extremities" demonstrated a significant change in the overall prevalence between rounds 1 and 2 (P, = 3 1 % and P, = 23%; The results pertaining to symptoms involving the neck, shoulders, or upper arms are shown in table 4. The overall agreement on presence or absence of symptoms anywhere in the neck, shoulders, or upper arms was excellent (kappa = 0.76, 95% CI 0.60-0.92). Similarly, the test-retest reliability for the presence or absence of symptoms separately in the neck, right shoulder, or left shoulder was excellent or nearly so. The kappa coefficients for upper-arm symptoms were good, but the results were lower than for the neck and shoulders. When the subjects queried about difficulty in keeping up with production or one's usual standard of quality because of neck, shoulder or upper-arm symptoms, the kappa equaled 0.64 (95% CI 0.48-0.80). A similar question inquiring if the worker had asked to change to a different job in the last year because of such symptoms yielded poor results (kappa = 0.39,95% CI 0.23-0.55). It should be noted that the prevalences of positive answers to this question were low (P, = 1% and P, = 2%); this result b Kappa computed for women only. When prevalence, = 0 or prevalence, = 0, the kappa coefficient is undefined as indicated for "current pregnancy".
Weighted kappa. Level of education = less than high school, high school graduate, more than high school. may help to explain the poor test-retest reliability. The Subjects demonstrated excellent reliability in reporting whether they felt that their symptoms were related to a particular workstation or work activity (kappa = 0.86, 95% CI 0.69-1.00), but reliability was lower when asked whether such symptoms were attributable to an accident or acute injury (kappa = 0.55, 95% CI 0.39-0.71). Questions pertaining to the "worst" body location (5 choices), frequency, and duration of episodes in the "worst" location (5 ordered categories), the occurrence of symptoms in the "worst" area in the last week (yes/ no), current symptoms (yeslno), and having sought medical treatment (yestno) all yielded similar (good) kappa values.
The results pertaining to the elbows and forearms are shown in table 5. The kappa coefficients for symptoms in the elbows and forearms were excellent, or nearly so. The kappa coefficient for the question eliciting symptoms in the week preceding the survey (kappa = 0.68, 95% CI 0.52-0.84) was significantly better than for the question about current symptoms (kappa = 0.46,95% CI 0.30-0.62).
As shown in  Weighted kappa. Duration of episodes in last year: = < 1 hour, < 1 day, < 1 week, < 1 month, > 1 month. Duration of episodes = < 1 hour, < 1 day, < 1 week, < 1 month, > 1 month. Health 1997, vol23, no 4 symptoms in the wrists, hands or fingers were all good. The test-retest agreement was good for the hand diagrams (using weighted kappa coefficients). The questions inquiring about nocturnal occurrence of wrist, hand and finger symptoms yielded near-excellent or excellent test-retest results.

Scand J Work Environ
The results of the visual analogue scale scores are summarized in table 7. The subjects rated their current discomfort significantly worse in round 2 than in round 1 for the neck, shoulders or upper arms and the wrists, hands or fingers. However, the 30-day discoinfort ratings were indistinguishable for the 2 rounds for all the body regions. Most of the subjects reported no discomfort whatsoever in all the body regions in both rounds, a finding which might tend to overstate the concordance of results between rounds. Hence the analyses for each body region (ie, paired t-tests) were restricted to subjects who reported symptoms in at least 1 round. The results based on this "symptomatic" subset of the study group again demonstrated that "current" discomfort ratings are less stable than 30-day discomfort ratings.
One possible approach to analyzing and presenting data from the questionnaire used in this study would be to present individual symptoms for each body location, Table 6. Agreement between rounds 1 and 2 with respect to the wrist, hand, and finger symptoms. (95% CI = 95% confidence interval, NA = not applicable, N/T/B/P = symptoms of numbness, tingling, burning, or pain in the wrist, hand or fingers) a When the prevalence = 0, the kappa coefficient is undefined, as indicated for "asked for different job". 95% confidence interval for K = K + 1.96 (SE,).
Frequency of episodes in last year = 3-12,13-36, 37-52, 53-150, > 150. I Duration of episodes in last year = < 1 day, i 1 week, < 1 month, z 1 month. There was no consistent pattern of the kappa coefficients for any of the symptoms among the body locations (ie, "soreness" did not consistently have the highest kappa value, and "tingling" did not always have the lowest kappa value). A concern, particularly when questionnaires are used in epidemiologic studies, is whether, or how, the testretest reliability of survey results may be related to demographic or ergonomic exposure covariates. The issue is whether any of these covariates might differentially impact on the agreement of responses across trials. As an example, consider the variable "any neck-shoulder-upper arm problems" (table 4). We constructed a new variable as follows: "neck, shoulder or upper-arm problems" = 1 if there was agreement of responses from rounds 1 and 2; "neck, shoulder or upper-arm problems" = 0 if there was disagreement of the responses from rounds 1 and 2. "Neck, shoulder or upper-arm problems" was then used as the dependent variable in a multiple logistic regression model with the following covariates: age, gender, repetition category, level of education (as reported in round I), and seniority. This model was not significant, and none of the individual odds ratios were either (not shown). This finding suggests that agreement between rounds for "any neck, shoulder or upper-arm problems" was not related significantly to any of the covariates in the logistic model. We constructed all similar symptom response variables and examined each in a multiple logistic regression model using the same covariates (age, gender, repetition category, level of education, and seniority). None of the models were significant, except for 1 isolated finding for gender and age as covariates for right upper-arm symptoms (not shown).

Discussion
The focus of this study was the assessment of the reliability of a discomfort survey employed in field studies of upper-extremity musculoskeletal disorders. "Reliability" in this instance refers to the extent to which a measurement procedure yields the same result on repeated trials. Obviously, with more consistent results obtained on repeated trials, it is implied that the measurement procedure is more reliable.
The approach employed in this study, direct re-testing of the same subjects with essentially the same survey Hence, what appears to be a lack of consistency between trials may actually reflect a true change of the state of a subject, rather than measurement error. One solution to this problem would be to minimize the interval between test administrations, which would reduce the likelihood of any changes in the underlying concepts being measured.
A second issue in test-retest trials is "memory", or the extent to which subjects remember their responses from the initial trial, and simply reproduce them on subsequent trials (15). Hence a short interval between trials may yield inflated estimates of the consistency of sesponses.
There is no survey procedure which can eliminate the influence of "memory" or a true change of state between trials, but one can attempt to minimize them. We chose to readminister the survey after 3 weeks. During this interval, there were no changes in the production processes, or production rates, and none of the worlters changed job titles. Therefore, we believe that these circumstances minimized the likelihood that study participants may have experienced a true change in state with regard to upper-extremity musculoskeletal discomfort. This interval also seems reasonable for reducing the influence of "memory". In essence, there is a trade-off between using a shorter or longer interval between test administrations, and the 3-week interval appears to balance the potential effects of "memory" and change of state.
Another factor that can influence test-retest trials is "reactivity", or the extent to which the measurement process can influence the initial measurement, or future measurements (15). So, for example, the initial measurement may "sensitize" a subject, and such sensitization may influence future responses (eg, subjects may be more or less likely to report musculoskeletal symptoms in the second trial, perhaps related to increased awareness). Sensitization may contribute to over or underreporting in follow-up trials; it is difficult to predict how a longer or shorter interval between trials influences the impact of sensitization on our results. By presenting, and conlparing, the prevalence of positive responses in each trial (PI and P,), we attempted to gain insight into the degree of "reactivity" or "sensitization". It should also be noted that the circumstances of questionnaire administration differed slightly between rounds 1 and 2. The subjects were on "company time" during round 1, while the study participants were on their personal break time during round 2, and it was observed that the subjects completed the questionnaire much more rapidly in the second round. Overall, few of the cornparisons between P, and P, differed significantly. This finding suggests that these factors (ie, circumstances of questionnaire administration, "reactivity", or "sensitization") did not have much influence on the observed results. For all the items in which PI and P, differed significantly, there was a decline in the prevalence of the item reported (ie, PI > P,). It is our impression that the circumstances of survey administration which led to more rapid colnpletion of the questionnaires in round 2 probably contributed to these observed findings.
Overall, kappa coefficjents for the presence or absence of symptoms were all good or excellent. The results of the discomfort ratings on the 10-cm visual analogue scale were very stable when the subjects were asked to rate their "worst" discomfort in the 30 days preceding the survey. Ratings based on "current" discomfort at the time of the survey appeared to change significantly between rounds; these latter findings may be related to "reactivity" or "sensitization", as previously discussed, or possibly reflect a true "change of state" since the intensity of musculoslceletal symptoms can fluctuate over time. Overall, the results of the visual analogue scale suggest that quantitative discomfort ratings should be "averaged" over some time period, rather than based on current symptoms alone. In the present study, 30 days appeared to work well. The responses concerning presence or absence of symptoms in the week preceding the completion of the survey also yielded good to excellent results. (See tables 4, 5, and 6.) The test-retest agreement of the responses for individual symptoms (ie, numbness, stiffness, etc) demonstrated considerable variability. These findings suggest that questions for eliciting individual symptoms should be combined, as was done in the present instance, rather than analyzed separately.
The logistic regression models indicate that ergonomic and common demographic covariates do not appear to have any impact on the reliability of responses. Thus the consistency of reporting symptoms would appear to be unconfounded by these factors.
Dickinson et a1 ( 5 ) described results of the "repeatability" of the Nordic musculoslteletal questionnaire among 44 subjects (all were cashiers). The percentage of the subjects who completed both rounds, the demographic information on subjects, and other circumstances of administration (eg, was test administration performed during normal workhours) were not stated. The nonidentical answers ranged from 0% to 26%. Kuorinka et a1 (4) also described results of repeat administrations of the Nordic questionnaire to various groups of workers. In 1 trial using the "general questionnaire" and involving 29 safety engineers, 17 medical secretaries, and 22 railway maintenance workers, the nonidentical answers ranged from 0% to 23%. The interval between test administrations was not stated. Another trial using the "neck-shoulder questionnaire" involved 27 female clerical workers who completed the questionnaire twice in a 3-week interval. The percentages of nonidentical answers ranged from 0% to 30%. A third trial assessed the repeatability of the "low-back questionnaire" among 25 nurses (gender not stated). The interval between test administrations was 15 days. The nonidentical answers ranged from 0% to 25%, with a mean of 4.4%. Aside from being small and providing limited information on the study groups, neither of these studies made adjustments for chance agreement (ie, use of kappa coefficient or similar technique), and so it is not possible to make direct comparisons with results in the present study. On the basis of these studies, it is also not possible determine the extent to which the test-retest reliability of Nordic survey(s) is better than what would occur by chance alone (13). We are not aware of any other published studies which describe results of the readministration of a discomfort survey to workers.

Concluding remarks
The test-retest reliability of the questionnaire used to elicit demographic information, medical history, exercise participation, and musculoslteletal symptom-related information among industrial workers appears to be good to excellent. These results suggest that most of the results of this discomfort questionnaire are reliable and suitable for use in epidemiologic studies. For reassurance of the robustness of these findings, similar studies should be replicated in other worker populations with this, and other, questionnaire instruments.
Joseph Kearns, Wendi Latko, Bryan Nakfoor, Susan Nalepa, Caroline Serna, Jill Sheiman, and Patricia Strasser for their assistance with the medical field studies. We also thank the workers and managers at the plant who made this study possible.
This study was supported by grant number 1 ROl 0H02941-02 from the National Institute for Occupational Safety and Health (NIOSH) and also by a gift from the Office Ergonomics Research Committee (OERC). Its contents are solely the responsibility of the authors and do not necessarily represent the official views of NIOSH or OERC.