The repeatability and validity of questionnaires assessing occupational physical activity – a systematic review

The repeatability and validity of questionnaires assessing occupational physical activity – a systematic review. Scand J Work Environ Health . 2011;37(1):6–29. Objectives This study aims to review systematically the repeatability and validity of questionnaires used to assess occupational physical activity among healthy adults. Methods We searched PubMed and Embase using occupational, work-related, job-related, physical activity, motor activity, and questionnaires as keywords. Two reviewers independently performed article selection, data extraction, and quality assessment. The methodological quality and results of the studies were evaluated based on an existing checklist. The level of evidence and repeatability, criterion, and construct validity were rated. Results We included 31 papers describing 30 questionnaires in the review. Repeatability was assessed in 22 studies, 11 used appropriate measures to assess 12 questionnaires. Intra-class correlation coefficients and weighted Cohen’s kappa ranged between 0.43–0.95. Six studies used appropriate measures to assess criterion validity of 13 questionnaires. One questionnaire, the Tecumseh Self Administered Occupational Physical Activity Questionnaire (TOQ), showed good criterion validity against a physical activity (PA) record. Eighteen studies used appropriate measures to assess the construct validity of 23 questionnaires. Comparison included those against accelerometers, maximal oxygen uptake, questionnaires, and body composition measures. None showed good construct validity. Conclusions There is strong evidence for good reliability of four questionnaires. None of the reviewed questionnaires showed good criterion validity compared to objective measures. Compared to PA records, moderate-to-good validity was observed for two questionnaires. Objective measures of occupational PA are needed.

Regular physical activity (PA) has shown to provide a variety of health benefits including a reduction in the risk of morbidity, such as cardiovascular disease, diabetes, high blood pressure, and obesity as well as a reduction in the risk of premature mortality (1). The results of recent studies have led to the suggestion that the health benefits of PA might differ for different domains of PA (2)(3)(4). The adult population spends most time in the work domain (5). The PA performed in this domain is referred to as occupational or work-related PA and includes all PA done as part of a job (6). Little is, however, known regarding the health effects of occupational PA, as few studies to date have adequately examined the contribution of occupational PA when studying the health benefits of PA (7). Available data provide conflicting information; while some studies observe the protective effects of occupational PA against, for example, cardiovascular disease (3,8,9), others show no or negative associations (4,10,11). Moreover, recent studies have shown contrasting cardiovascular effects of PA performed in different domains, such as during leisure time and work (3,9,11,12).
In order to draw any conclusions regarding the amount of occupational PA and the influence of occupational PA on health, it is essential to have a reliable and valid measurement instrument. Several methods are available for assessing PA, for example accelerometers, pedometers, observations, and questionnaires (13), with the latter frequently used in surveys or studies. There are numerous different questionnaires that assess occupational PA, some of which have been tested on repeatability and/or validity. To date, an overview of the measurement properties of the questionnaires assessing occupational PA is, however, lacking. As the contradictory findings with respect to the relation between occupational PA and health could be due to unreliable and invalid questionnaires, a review of the repeatability and validity of questionnaires measuring occupational PA is needed. The purpose of our study was to conduct a systematic review of published evidence on the repeatability and validity of questionnaires used to assess occupational PA among healthy adults.

Literature search
In March 2009, we searched relevant peer-reviewed English-language papers in the PubMed electronic bibliographic database (complete database until 28 February 2009). Subsequently, additional unique papers were searched in Embase following a screening of the reference lists of retrieved articles. The following full search strategy in PubMed was used: [(occupational OR "work related" OR "job related") AND ("physical activity"[tiab] OR "motor activity"[mesh]) AND (questionnaire[mesh] OR questionnaire*[tiab]) AND (English[lang])]. In Embase, the Emtree terms of "physical activity" AND occupation AND questionnaire AND [English]/lim were used.

Eligibility criteria
We screened all hits for possible inclusion based on the title and abstract. The following inclusion criteria were used: (i) the study was a validation and/or repeatability study of one or more questionnaires measuring occupational PA, which included the validation and/or repeatability of occupational PA questions; (ii) the questionnaire could be used to measure occupational PA in the general population; (iii) information on (at least one of) the measurement properties of the questionnaire should be provided; (iv) the article should be published in the English language; and (v) the study was published before March 2009. Occupational PA was defined as a type of physical activity performed that is related to energy expended during work. Studies that were performed among a specific population, such as patients or pregnant women were excluded, as were studies that measured occupational PA in relation to specific disorders and/or symptoms (ie, back pain). Finally, we excluded studies that lacked sufficient information on the protocol used to examine the validity and/or repeatability of the questionnaire.

Data extraction
Two reviewers independently performed data extraction and quality assessment. A description of the questionnaires and the protocols used in the studies was extracted from the included papers using a standardized data extraction form. Data extracted included: (i) sample characteristics (ie, sample size, age, gender, employment status); (ii) information on the protocol used [ie, methods, time interval between test and re-test, reference method, type of administration (self, interview)]; (iii) description of the questionnaire studied [ie, unit of measurement (energy expenditure, work index), number of occupational PA questions, scoring protocol]; (iv) statistical information (tests performed, package used); and (v) results of repeatability and validity.

Quality assessment of the studies and results
A slightly modified version of the checklist developed by de Vries and colleagues (14,15) was used to assess the methodological quality and results of the studies included [see appendix (table A)]. For the assessment of the methodological quality of the study, information was extracted and evaluated regarding the study design (sample characteristics, protocol, measurements, and statistical analyses). All items were scored 0, 0.5 or 1 point and summed per study. Accordingly, repeatability, criterion, and construct validity were rated depending on the results of the study (see below).

Repeatability
Repeatability concerns the degree to which repeated measurements among stable persons (test-retest) provide similar answers (16). The use of intra-class correlation coefficients (ICC) and weighted Cohen's kappa's (K w ) were considered appropriate methods to quantify repeatability with regard to continuous (17) and ordinal measures (18), respectively. An ICC or K w ≥0.70 was rated positively (+), an ICC or K w of 0.40-0.70 was rated as moderate (±), and an ICC or K w <0.40 was scored negatively (-).
One point was given when a study assessed repeatability and an additional point assigned if the ICC or K w was used. These points were added to the study design score. On the basis of the total score three levels of evidence were formulated: strong evidence (≥4.0 points), moderate evidence (2.0-4.0 points) and poor evidence (<2.0 points).

Criterion validity
Criterion validity refers to the extent to which scores on a particular instrument relate to a gold standard (ie, an instrument that measures the same construct) (18). Comparisons with accelerometers, when limited to occupational time, were considered as appropriate methods for objective criterion validity. Comparisons with PA records, diaries, and logbooks measuring occupational PA, were considered as appropriate methods of subjective criterion validity. Correlation coefficients >0.75 were scored positively (+), correlations of 0.50-0.75 moderately (±), and a correlation <0.50 was scored negatively (-) (14,16). Correlation coefficients >0.75 were scored positively as this indicates that the occupational measure and the criterion measure share >50% (~56%) of the variance in common (16).
One point was given when a study assessed criterion validity and an additional point if the sensitivity, specificity, Pearson's product moment, or Spearman's rank correlation coefficients was used; 0.5 points were given if a Bland Altman plot was used. These points were added to the study design score. On the basis of the total score, three levels of evidence were formulated: strong evidence (≥4.0 points), moderate evidence (2.0-4.0 points) and poor evidence (<2.0 points).

Construct validity
Construct validity refers to the extent to which scores on a particular instrument relate to other measures in a manner that is consistent with theoretically derived hypotheses concerning the concepts that are being measured (17,19). Methods that can be used to measure the same or similar aspects of occupational PA, or aspects that are related to occupational PA, were considered appropriate for the assessment of construct validity (ie, accelerometers, doubly labeled water, fitness-test, body composition measurements etc). A positive score (+) was given if the correlation coefficient was >0.60, a moderate score (±) if the correlation was 0.30-0.60, and a negative score (-) for correlations <0.30 (14,16). Correlation coefficients >0.60 indicate that the occupational measure and the comparison measure share 36% of the variance in common.
One point was given when a study assessed criterion validity and an additional point allocated if the Pearson's product moment, Spearman's rank correlation coefficients, t-test, Mann-Whitney U-test or chi-square test was used; 0.5 points were given if a Bland Altman plot was used. These points were added to the study design score. On the basis of the total score three levels of evidence were formulated: strong evidence (≥4.0 points), moderate evidence (2.0-4.0 points) and poor evidence (<2.0 points).

Results
The literature search resulted in 962 hits, of which 55 were selected on the basis of relevant titles and/or abstracts (figure 1). Of the fulltext articles, 24 were excluded after reading the article; the main reasons for exclusion were: (i) lack of information on occupational PA measures and/ or results, (ii) focusing on physical inactivity, or (iii) executed in a specific sample (ie, pregnant women). Finally, 31 papers (20-50) describing 30 questionnaires were included in the review (see table B in the Appendix). Two studies assessed the accuracy of several questionnaires simultaneously (20,33). Of the included questionnaires, Kuopio Ischemic Heart Disease Occupational Physical Activity Interview (KIDH-O), Occupational Physical Activity Questionnaire (OPAQ), Saltin & Grimby Lifetime Occupational Activity (SGLOA), Saltin & Grimby Present Occupational Activity (SGPOA), and Tecumseh Self-Administered Occupational Physical Activity Questionnaire (TOQ) assessed "only" occupational PA. The remaining questionnaires covered the entire range of PA, including occupational PA/measure diverse domains of PA, including occupational PA. A full description of the included questionnaires is provided in table 1.

Repeatability
In total, 22 studies assessed the repeatability of 26 questionnaires, of which 11 studies used appropriate measures to quantify the repeatability of 12 questionnaires (table 2). The repeatability of each questionnaire was only assessed in one study each. The level of evidence was strong for all studies (mean 5.2±0.5, range 4.5-6.0). Six of the studies were conducted in mixed-gender samples (27,29,43,44,47,48), three among males (34,35,39) and two among females (22,38). The average age of the study populations ranged from ~31-65 years, the sample sizes varied from 39-399 individuals (average N=132). The time intervals between the test and re-test varied from one week to one year. ICC and K w were observed between 0. 43 (39) and the SMC-PAQ among females but only for historical PA at ages 15, 30 and 50 years (ICC 0.73-0.75) and in some of the body mass index (BMI) and age subgroups (38). Moderate repeatability was observed for two questionnaires: the PYTPAQ [except among those with high levels of PA (ICC=0.78), for which it showed good repeatability (29)] and the KIHD-O among males (ICC=0.69) (35).

Validity
Criterion validity. Six studies used appropriate measures to assess criterion validity of, in total, 13 questionnaires. In two studies, criterion validity was assessed by validation against accelerometer data limited to occupational time only (41,43) (table 3), with both studies providing strong levels of evidence (mean 5.3±0.4, range 5.0-5.5). One of these studies was conducted in a mixed-gender sample (N=166) (43), the other in an all male sample (N=41) (41). The average age of the study populations was around 39 years for both studies. Observed correlations (r) varied between -0.20-0.50.
The objective criterion validity of the work index of the Baecke and the TCQ were assessed among males and both showed poor validity against an accelerometer (41). The TCQ did, however, show moderate objective criterion validity with regard to energy expenditure in the same sample of males, but only when validated against the sum of work counts of the accelerometer (r=0.50) (41). Poor criterion validity was observed for OPAQ's     items: "sitting" (hours per week -1 ), "walking" (hours per week -1 ), "heavy labor" (hours per week -1 ), when compared "against light" (hours per week -1 ), "moderate occupational PA" (hours per week -1 ) and "heavy occupational PA" (hours per week -1 ), respectively, assessed with an accelerometer in a mixed-gender sample (43).
In six studies (20,22,31,37,41,43), subjective criterion validity was assessed by validation against a PA record, diary, or logbook. Three of the studies were conducted among a mixed-gender sample (20,31,43), two in all-male samples (37,41), and one among a female sample (22). Except for the studies conducted among males only, all provided strong levels of evidence for their findings (mean 5.5±0.6, range 4-6). The average age of the study populations varied between ~37-63 years, the study samples ranged from 41-166 individuals. Correlations were observed between -0.05-0.92.
The subjective criterion validity of the work index/ activity score was assessed for the following questionnaires: Atherosclerosis Risk In Community Study/ Baecke (ARIC/Baecke), Coronary Artery Risk Development in Young Adults-Physical Activity Questionnaire (CARDIA-PAQ), Health Insurance Plan (HIP), KPAS, Lipid Research Clinics Physical Activity Questionnaire (LRC), and Minnesota Heart Health Program (MHHP). None showed good criterion validity. The KPAS showed moderate criterion validity among females but only for two items [sitting (r=0.58) and walking (r=0.50)] when validated against similar categories of the PA record (22).
With regard to energy expenditure, the subjective criterion validity was assessed for the Coronary Artery Risk Development in Young Adults-Seven Day Recall (CARDIA-SDR), IPAQ-L, PAQ, and TOQ. The work score of the TOQ (activity units per week -1 ) showed good criterion validity against a PA record (r=0.92) in a mixed-gender sample, in addition to its item "sitting at work" (activity units per week -1 ; r=0.77) (20). Moderate criterion validity was observed in mixed-gender samples for: the IPAQ-L "work activity" (MET hours per day -1 ) against a PA logbook (r=0.64) (31) and the TOQ "work" (MET; r=0.52) and "standing at work" (activity units per week -1 ; r=0.57) against similar activities of the PA record (20).
The subjective criterion validity of duration of activity was assessed for the CARDIA-SDR, OPAQ, TOQ, and TCQ. None showed good criterion validity. Individual items of the OPAQ and TOQ showed good and moderate criterion validity against similar activities of a PA record in mixed-gender samples: TOQ "sitting" (hours per week -1 ) showed good criterion validity (r=0.82) while TOQ "standing" (hours per week -1 ; r=0.57) (20) and OPAQ "walking" (hours per week -1 ; r=0.74) (43) showed moderate criterion validity. The TCQ "time at work" (hours per week -1 ) showed moderate criterion validity against an activity log (r=0.58) among males (41).
The construct validity of the work index/activity score was assessed for the ARIC/Baecke, Baecke, CAR-DIA-PAQ, European Prospective Investigation into Cancer (EPIC), HIP, HUNT-2, KPAS, LRC, MHHP, SGLOA, and SGPOA. Overall, none showed good construct validity. Four questionnaires showed moderate construct validity for their work index/activity score: (i) the EPIC against an accelerometer in a mixedgender sample (MET hours per week -1 ; r=0.37) (26); (ii) the Baecke against doubly labeled water (DLW) among males [average daily metabolic rate (ADMR) and physical activity level (PAL): r=0.37 and r=0.52, respectively)] (40); (iii) the HUNT-2 against several accelerometer measures among males (r=-0.45-0.48) (34); and (iv) the SGLOA against several comparison measures among females (50). The KPAS item "compared to others" showed moderate construct validity among females against a PA record (r=0.41) (22). Moderate-to-poor validity was found among females for the SGPOA (50). The work index/activity score of the ARIC/Baecke, HIP, LRC, and MHHP showed moderate construct validity against some of their comparison methods, but overall showed poor construct validity (20).
The construct validity of energy expenditure was assessed for the adapted IPAQ long version (A-IPAQ-L), CARDIA-SDR, EPAQ2, KIDH-O, IPAQ-L, Modifiable Activity Questionnaire (MAQ), MOSPAQ, Questionnaire d'Activité Physique Saint-Etienne (QAPSE), Sub-Saharan Africa Activity Questionnaire (SSAAQ), and TOQ. None of the questionnaires showed good construct validity. The SSAAQ work (MET per day -1 ) showed good construct validity but only among a subsample of urban females when validated against a heart rate monitor (r=0.72), in other subsamples it showed moderate construct validity (46). Among females, moderate construct validity was observed for the MAQ "work" (MET hours per week -1 ) when validated against accelerometer total activity (counts per day -1 ; r=0.43), but poor against accelerometer sedentary time (hours per week -1 ; r=-0.19) (32).

Discussion
This is the first systematic review of studies assessing the measurement properties of occupational PA questionnaires, in which both the results and methodological quality of the included studies have been taken into account. In general, the quality of the studies was not very high, mostly as a result of the use of inadequate measures, such as the use of Pearson and/or Spearman correlation coefficients instead of ICC or K w as repeatability measures and the use of inadequate reference measures for validation. Moreover, our results show that few questionnaires were tested in more than one study, or tested for both repeatability and validity.
The quality of the studies assessing repeatability was in general poor (50% calculated Pearson or Spearman correlation coefficients instead of ICC or K w ); moreover, the repeatability of only six questionnaires was assessed in a mixed-gender sample. Of these, the work index of the BRFSS showed good repeatability, as did energy expenditure and duration of activity of the IPAQ-L and MOSPA-Q, and duration of activity of several of OPAQ's items (ICC 0.76-0.83), all based on a strong level of evidence. The six questionnaires are, however, all very different, varying in length, completion time, form of administration and occupational PA outcome. The choice of any of them depends largely on the purpose for which it will be used. The BRFSS is a single-item surveillance measure, designed to categorize occupational PA into three components and can provide a rapid assessment of occupational PA levels (27,51). When a more comprehensive analysis of time spent in various occupational categories is desired, the multiple-item OPAQ or MOSPA-Q are more suitable (43,44). Both are relatively short and take less than five minutes to complete. An advantage of the MOSPA-Q over the OPAQ is that it contains questions on several PA domains and proved to be reliable in assessing both duration of activity and energy expenditure (44). The OPAQ on the other hand was additionally tested for duration of activity of each separate item: "sitting/standing", "heavy labor", and "employed for wages" all showed good repeatability. Like the MOSPA-Q, the IPAQ-L occupational PA was also reliable when expressed in both duration of activity and energy expenditure and contains questions on several PA domains. However its aim is to assess health-enhancing PA, therefore focusing on moderate and vigorous intensity PA, walking at work is included, but not sitting (31,47).
Three studies objectively assessed criterion validity in a mixed-gender sample (by comparing occupational PA with similar activities from either an accelerometer or PA record) and provided a high level of evidence for their findings. This small number of studies results from the fact that many of the studies were not primarily designed to assess the validity of the occupational PA measures, but of the PA questionnaires overall. The TOQ was the only questionnaire that showed good criterion validity, namely for energy expenditure, for "sitting", both when expressed in energy expenditure and duration of activity and "standing" when expressed in duration of activity, all validated against a PA record. Moreover, moderate construct validity was observed for the TOQ "work" when expressed in duration of activity (20). The TOQ, however, proved less valid for assessing "time spent at work" and "walking at work". The TOQ, which is a 29-item selfadministered questionnaire and assesses time spent in various types of occupational PA for 3 jobs in the previous year, can be seen as a useful questionnaire for identifying specific occupational PA habits that differ by type and intensity over time (20). However, when time and space allow only a brief assessment, it might be a less suitable questionnaire (the TOQ takes about 20 minutes to complete). Moreover, its questions are limited to occupational PA questions only. An additional questionnaire which showed moderate criterion validity with regard to energy expenditure is the IPAQ-L, when validated against a PA logbook, it also showed good repeatability (31,47). The IPAQ-L might be a more suitable questionnaire when one is also interested in assessing different domains of PA; however its main focus is on health-enhancing PA.
An important finding of our review was that the criterion-and construct-related validity correlation coefficients were generally low for most comparisons. This could be due to shortcomings of the questionnaires. However, it is not necessarily an indication of poor validity; it might also be a reflection of the complexity of assessing the validity of (occupational) PA questionnaires. One particular problem is the choice of an appropriate comparison instrument. Even though accelerometry was considered an objective criterion standard, accelerometers may not be sensitive enough for evaluating the criterion validity of occupational questionnaires (ie, as a result of their inability to detect upper-body movement while a person is sitting or standing) (43). Using PA records, diaries, or other forms of self-report methods as a validation method might produce higher correlations; however, the risk of correlated error is also higher as both the method under scrutiny and the validation method are subject to the same forms of bias (48). An additional objective criterion measure for occupational PA is observation, which is likely to cover those PA that cannot all be captured with an accelerometer. However, in this review, no observation studies were found in our search. Validating occupational PA measures against physiological constructs (ie, cardiovascular fitness, overall energy expenditure, VO 2 Max and body composition), which are related to PA behavior in general, might also not be optimal. As the short time spent in heavy occupational PA and the longer hours spent in sitting and standing activities at work might not be sufficient to result in any changes in the above-mentioned Table 4. Construct validity of questionnaires assessing occupational physical activity. For full names of questionnaires, see appendix. Evidence was rated as follows: strong evidence (3), moderate evidence (2) and poor evidence (1). Acceptable level of criterion validity rated as: correlation coefficient (r)≥0.60 (+), 0.30≤r<0.60 (±) and r<0.30 (-). Inadequate measure used to assess criterion validity (0         physiological constructs (21,52). Moreover a perfect correlation cannot be expected, as many other factors besides PA (including genetic predisposition and environmental conditions) influence changes in constructs, such as body composition and cardiovascular fitness (22). Thus, the observed criterion-and construct-related validity correlations coefficients should be interpreted in light of these considerations. Finally, as no standardized method for assessing PA exists, it is necessary to evaluate several measures of occupational PA.

Limitations of the review
One limitation of this review is a potential publication bias, as our search strategy only located articles that were published in peer-reviewed journals and referenced in electronic databases. Moreover, the inclusion of only English-language articles, may have discarded some studies that could have added relevant information regarding the repeatability and/or validity of (occupational) PA questionnaires, originating from a greater diversity of countries. In addition, despite the abundance of studies that have evaluated the repeatability and/or validity of measures for physical working tasks and/or positions, these are not included in the review. A second limitation concerns the checklist that was used to score the results and methodological quality of the included studies. Others might have chosen different cut-off points for scoring negative or positive on repeatability or validity, for scoring the study design, or for scoring the use of Spearman correlation coefficients in repeatability studies. However, we feel confident with the choices made in our study as they are based on an existing checklist (14,15). We believe that using the ICC as sole adequate method to quantify repeatability with regard to continuous measures as proposed in the checklist is appropriate, as ICC reduces the risk of overestimation of repeatability, which might occur when using the Pearson correlation coefficient, as the latter does not take systematic differences between the two measurements into account (17). The information provided in the tables makes it possible for the reader to interpret the findings using their own insights.

Recommendations for future research
Our results indicated that few questionnaires were examined for repeatability and/or validity in more than one study and often inadequate measures were used to determine the repeatability or validity. In our opinion, future studies are needed to test the reliability and/or validity of existing questionnaires. Moreover, in order to enhance the quality of occupational research with respect to physical work exposures and health, more insight is needed into optimal comparison methods for validation and optimal ways of assessing occupational PA, by using both self-reported and objective measures.

Concluding remarks
In conclusion, based on our review of the literature on measurement properties of questionnaires measuring occupational PA, there is strong evidence for: (i) good repeatability of the work index of the BRFSS, energy expenditure and duration of activity of the IPAQ-L and MOSPA-Q, and duration of activity of the OPAQ and (ii) moderate-to-good validity of energy expenditure of the TOQ and IPAQ-L. However, because of the great diversity of the questionnaires and the purpose for which questionnaires will be used, we feel that that no further conclusion can be drawn regarding the best questionnaire. Finally, as a result of the poor criterion validity of the questionnaires, objective measures of occupational PA are needed.