Possible bias from rating behavior when subjects rate both exposure and outcome

Possible bias from rating behavior when subjects rate both exposure and outcome. Scand Work Environ Health 1997;23(5):370-7. Objectives In many epidemiologic studies the subjects rate both the exposure and the outcome, assigning numerical values to the variables according to their perceptions and judgments. Hypothetically, subjects who tend to overestimate, exaggerate, or use high numerical values in rating tasks would rate both exposure and outcorne higher than subjects who tend to underestimate, dissimulate, or use low numerical values. A range of such rating behaviors among the subjects would introduce uncontrollable bias to relative risk estimates, in most cases an overestimation. The aim of this study was to assess the possible presence of and effects on relative risk estimates of such high and low rating behavior among subjects in an epidemiologic study of n~usculoskeletal disorders. Methods Rating behavior was analyzed by intercoil-elating the ratings of 19 different stimuli. High positive correlations would indicate the presence of high and low rating behavior. Results The cosrelations were, however, both positive and negative and close to zero. Adjusting for rating behavior did not affect relative risk estimates, based on subjective ratings of both exposure and outcome. CO~C~US~OII There is no support in this study for the existence of a range of high and low rating behavior among subjects who rate neutral and nonaffective stimuli, such as time, weight, number and physical exposure, as well as pain and other symptoms. There is therefore no support for the idea of a bias to relative risk estimates from such rating behavior in studies where subjects rate both exposure and outcome variables of this kind.

In epidemiologic studies quantitative data about both exposure factors and outcome phenomena are often acquired by subjective judgments or ratings. In the science of psychometrics such rated phenomena are called "stimuli" and the resulting judgments or ratings are referred to as rating^".^ "Stimuli", in the context of epidemiology, include studied or confounding exposure factors (physical, psychosocial, etc) and outcome phenomena (sick leave, pain, well-being, etc). "Ratings" in the context of epidemiology would be the overt judgments or ratings of these phenomena as a result of a perceptual and cognitive process, which by its nature must be described as subjective. Such judgments or ratings could be given as verbal expressions ("very heavy", "now and then"), as free numerations ("23 kg", "5 timeslday") or as values on rating scales [Likert scales, VAS (visual analogue scales), etc].
The relation between stimulus and rating magnitudes has been described as a power function by Stevens (1) and later modified by Ekman and others (algorithm 1) (2): where R = rating magnitude, S = stimulus magnitude, n = exponent, a & b = constants.
Such stimulus-rating functions have been determined empirically for many stimulus modalities (3,4). There are, however, many sources of error and bias, random or systematic, in subjective ratings (5). One source of systematic bias is individual differences in the use of rating 1 scales and the use of numeric values. Such differences in rating behavior are clearly described in psychometrics, mainly concerning the range and standard deviation of numerics in ratings (6)(7)(8)(9)(10). The spread of ratings used by each subject affects the exponent (71) in the mentioned power function, with a greater spread resulting in a higher exponent.
Individual differences in the average value of the numerics used in rating procedures have, however, received less attention. Such differences in rating behavior could be described as a stable trait, a general tendency, to use high or low numerics when rating different phenomena, or as "over-" or "underestimators" if the ratings concern phenomena with true values. High and low rating behavior would affect the exponent (n) in the aforementioned algorithm. High raters would have a higher value of n (figure 1). When this possibility is applied to epidemiologic studies, "high raters" would rate both exposure and outcome as higher than "low raters" and vice versa, even when there are no interindividual differences in exposure or outcome. In a hypothetical study, if there is a range of such rating behavior among the subjects, and both exposure and outcome are rated by the same person (usually the subject of study), an association would be introduced between the exposure and outcome ratings (figure 2). This association would, however, solely be an effect of rating behavior, an artifact that would introduce bias into the results. In typical cases, where both exposure and outcome measures are scaled in the same direction, the relative risk would be overestimated. Differences in the spread of the ratings among the subjects ("nassow" and "wide" raters) can likewise introduce similar bias to relative risk estimates.
If only one of the components, exposure or outcome, is rated by the subjects, then high and low rating behavior would bias relative risk estimates towards unity, since the relation to the true values is probably random.
No studies in epidemiology have been found regarding the existence and the potential uncontrollable biasing effect of such postulated high and low rating behavior. The aim of this investigation was therefore to study whether there is a range of high and low rating behavior, in particular among subjects in an epidemiologic study of musculoskeletal disorders, and whether there are effects on relative risk estimates when such rating behavior is stratified for when both exposure and outcome are rated by the same subjects.

Subjects
The subjects were participants in an epidemiologic study, approved by the regional ethical committee, on muscu-   Hypothetical false association between rated exposure and outcome magnitudes among subjects with a range of high and low raters. All thesubjects havethesame "truevalues" on both the exposure and outcome variables.
loskeletal disorders among the general working population aged 40-59 years (11). The number of subjects from whom data were available in the present study varied due to missing data and to the fact that some of the ratings only included the last 174 subjects of the total 484 (252 women, 232 men) examined in the main study (table 1).

Methods
Rating behavior was determined by asking the subjects to rate the following fixed stimuli without information on the "true" values: (i) the taste of acidity of a 0.03 molar citric acid solution (both using a 10 cm VAS scale with end-point anchors of "no acidity at all" and "maximum acidity" and a CR-10 scale (category ratio-10 scale) (see appendix) (12), (ii) the number of small objects in a box after a 3-s glimpse (true number = 72 pieces), (iii) the weight of a box lifted bimanually (true weight = 8.2 kg), (iv) the time given for completion of 2 subparts of a psychomotor test (true time = 30 and 60 s, respectively) (Purdue peg board test, Lafayette Instrument Co, Indiana, USA). Some additional nonfixed stimuli were also rated by the subjects. These stimuli concerned ratings of the subjects' own perforn~ance and feelings of exertion and of pain.
The subjects rated performance (= number of exescises) immediately after the following endurance tests: (i) curl-ups from a supine to a seated position, (ii) squats from erect to squatting position, (iii) 1-hand dumbbell lifts (male = I0 kg, female = 5 kg). Ratings of exertion were made after 5 minutes on a submaximal bicycle ergometer test (minimum steady-state heart rate = 120) using an RPE (rating of perceived exertion) scale (13). (See the appendix.) Ratings of pain were obtained using a CR-10 scale during a pressure pain threshold (PPT) test on the right trapezius muscle halfway between cervical vertebra number 7 and the right acromion using a traditional transducer with a rounded tip of 1 cm2 (Algometer, Somedic Sales AB, Farsta, Sweden).
The subjects were not informed about the purpose of these ratings. They rated physical exposure, in general and to the back and shoulders, in their present work by answering questions on (i) perceived exertion (RPE scale), (ii) the proportion of the day spent in a seated posture (10-cm VAS scale), (iii) the frequency of work postures with the hands held above the shoulder level (5-point category scale), and (iv) the frequency of handling loads heavier than 15 kg (5-point category scale).
The following symptoms in the shoulder and lowback regions were rated by the subjects during a medical interview: (i) number of days with symptoms in the shoulder regions during the past year (6-point category scale), (ii) number of days with symptoms in the lowback region during the past year (6-point category scale), (iii) intensity of present pain in the shoulder region (CR-10 scale), and (iv) intensity of present pain in the low-back region (CR-10 scale).

Statisticairnethods
The postulated existence of a range of high and low rating behavior was examined by analyzing rank-correlation coefficients (Spearman-Brown r,,) between ratings of the different fixed stimuli and also between the fixed and nonfixed stimuli, exposure, and outcome variables. The presence of such a range would result in high positive correlations. Rating behavior was studied in the entire study group and also in different subgroups. Mean ratings and correlations were therefore calculated separately for the men and women, subjects 40-49 and 50-59 years of age, subjects with lower or higher skilled professions according to the Swedish Socio-Economic Classification (14), and subjects reporting symptoms from the shoulder or low-back regions during the past year and those without such symptoms.
Rating behavior was categorized by the following procedure. The subjects were ranked by the magnitude of ratings in each of the 4 fixed stimuli. Rank numbers were divided by the number of subjects, giving relative rank numbers to each subject on each stimuli. They were then categorized as "low", "medium", or "high raters by cut-off points at approximately the 33rd and 67th percentile of the average relative rank of all fixed stirnuli.
The potential effect of a range of high and low rating behavior on the relative risk estimate was studied by analyzing the prevalence ratio (PR) of intensive pain in the shoulder region (case = CR-10 ratings 25) among the subjects rating the frequency of work with the hands held above shoulder level as high compared with those rating it as low. A corresponding analysis was done regarding symptoms in the low-back region and frequency of work with the handling of loads heavier than 15 kg.
Unadjusted calculations of PR were first made (PRcrude), but calculations of the adjusted PR (PRadj) were also made for adjustment for "low", "medium", or "high" rating behavior according to the method described by Mantel-Haenszel (15). The effects of possible bias due to high and low rating behavior were studied by comparing the PRcrude with the PRadj.
Analyses were done using the SAS (statistical analysis system) computer program (SAS Institute, North Carolina, USA).

Results
The interindividual variation and the range of the ratings showed satisfactory distributions allowing studies of high and low rating behavior (table 1). Most variables followed a normal distribution curve (data not shown).
The correlations between the ratings of fixed stimuli were all close to zero and both positive and negative (table 2). Correlations between ratings of fixed and nonfixed stimuli and between exposure and outcome variables were also mostly close to zero and both positive and negative (table 2). Table 2 presents only the results for the ratings of acidity using the VAS scale, not the CR-10 scale, and only for the time rating of the 60-s test, not the 30-s test, as the 2 ratings gave very similar results (rVASxCR.IO = 0.839, r60x30 = 0.697). All correlations between the ratings using the same scale were low: No systematic differences were observed for the mean ratings of fixcd or nonfixed stimuli between the different subgroups (gender, age, socioeconomic class, symptom status). The most substantial differences were related to gender. The mean ratings of fixed stimuli by gender (femaleslmales) were acidity 41.6142.7 mm [difference -1.08, 95% CI (95% confidence interval) -7.78-5.621, count 8 1.4169.4 pieces (difference 12.1, 95% CI -0.88-25.0), weight 4.7315.28 kg (difference -0.55, 95% CI -1.26-0.15), time 51.5144.7 s (difference 6.79, 95% CI 1.68-1 1.9). The correlations between the ratings of the fixed stimuli within different subgroups were all close to zero (data not shown).
The prevalence ratios for intensive syrnptorns differed between the genders. The PR values of the men were higher than those of the women. The calculations were therefore done separately for the men and the women (table 3). No substantial effects of adjustment for high, medium, or low rating behavior were noted within either group. This finding applied also when other cutoff points on the symptom scale were used for case definition (data not shown).

Discussion
In this study concerning bias from high and low rating behavior in epidemiologic research, different sensory modalities and cognitive demands were chosen for the  rating tasksestimation of taste, weight, quantity, frequency, time elapsed, exertion and pain. Different rating methods were usedfree ratings and Likert, RPE, CR-10 and VAS scales. No signs of a range of high and low rating behavior were found ainong the subjects in this study, as the correlations between the ratings were low and both positive and negative. Low correlations were also seen between ratings using the same type of scale. This finding further supports the absence of such rating behavior. This is a welcome result, as the consequences of the reverse outcome would have been problematic. The presence of such rating behavior would imply that the relative risk estimates in studies where both exposure and outcome data are based on subjective ratings by the same subject could be uncontrollably biased, typically being overestimated. Special adjusting procedures would have to be considered in such cases. One such procedure would be to measure and adjust for iildividual high or low rating behavior, using the same methods as in this study. Another would be to design the rating scales to balance the effects of such rating behavior. An alternative would be to refrain from studies based on subjective ratings of both exposure and outcoines. Many other rating behaviors and personality traits, reported to bias ratings or judgments, have been studied in the science of psychometrics (eg, "response set or style", "social desirability", "self-deceptors", "halo effects", and "yeasayers and naysayers" (16-19). Bias in rating behavior can be divided into that associated with the content of the rated item ("response set") and that without association to the content ("response style") (20). Examples of the former are, for example, "social desirability" or "negative or positive affectivity", and an example of the latter is "extreme response bias". Except for the range or spread of the numerics used in ratings (6-10,21), few consistent "response style" biases have been demonstrated (20, 21). The hypothetical "high and low rating behavior" could be considered a "response style", and therefore our negative results are consistent with the previous findings. It has been stated that the more ambiguous the rating or judging task, the more probable the introduction of different rating bias (5,20). The rating tasks in our study varied in ambiguity. Some tasks were self-evident and easy, such as the rating of the number of curl ups or dumbbell lifts. Other tasks were more ambiguous and difficult, such as ratings of acidity or pain. No systematic associations could, however, be noted regarding the ambiguity of the rating task and rating behavior.
Another potential source of bias to ratings, with siinilar effects as from high and low rating behavior, is the suggested phenomenon of "negative or positive affectivity". "Negative affectivity" has been defined as a mood-dispositional dimension that reflects pervasive individual difference~ in the experience of negative ernotion and self-concept (22) and "positive affectivity" as an ability to cope unusually well with stressful situations and to have a sense of coherence or dispositional optimism (23). Many studies have shown that different perceived stressors are associated with perceived symptoms, distress, and health (24-28). Negative affectivity has been shown to correlate with both the perceived stressors and the strain and to mediate between these (23, 29, 30). A bias (overestimation) from such affectivity to measures of association between stressful exposure and different outcomes has been argued, but also disputed (23, 31-33).
Possible effects of negative or positive affectivity were not included or controlled for in om study. The stimuli rated in our study are not considered to be stressful or emotionally loaded. All stimuli, with the exception of the pain ratings, can be considered as "neutral" stiinuli, without affective or emotional connotations. Ratings of pain in the PPT test showed only minimal correlatioils with ratings of present pain in the shoulders or low back, a finding indicating that these ratings were not substantially affected by some cominon factor like negative or positive affectivity. In other studies, however, thresholds for pain, but not for pure sensation, have been found to be sensitive to personal characteristics, such as "selfdeceptiveness" (19).
It is important to distinguish the potential source of bias from rating behavior from bias due to differential misclassification. Both have the same consequence of uncontrollable bias in relative risk estimates (34). Both sources of bias are due to an artifact of irrelevant associations between the exposure and outcome measures.
Differences between genders have been demonstrated regarding mean values in the use of Likert scales, in the validity of ratings of energy demands in cussent work or in assigning numeric values to verbal expressions like "very often" (21, 35, 36). Differences in rating behavior between different age and socioeconomic groups have been described earlier (21) and could hypothetically have been expected in our study, due to differences in educational level and supposed familiarity with judgments, evaluations, and numbers. Likewise differences connected to pain and ache status could hypothetically have been expected due to a possible higher "arousal level" or "alertness" for stimuli among suffering subjects. Subdividing our subjects did not, however, reveal any subgroup characterized by systematically higher or lower ratings or a range in such rating behavior. Our results therefore do not so far support the idea that observed differences in the validity of ratings among different subgroups in epidemiologic studies are explained by dif-fere~lces in high and low rating behavior. The narrow age span, however, limits conclusions regarding the influence of age on rating behavior in our study.
Our objective was not to study the validity of the ratings. It can, however, be noted that most of the stimuli with known true values were underestimated. Weight was pronouncedly underestimated (60% of the true weight), as has been shown in other studies (37). Ratings of acidity of the 0.03 molar citric acid solution were, on the average, 42 mm of the 100 mm VAS scale, which compares well with earlier findings (38). The RPE ratings during the submaximal aerobic capacity test were also mainly close to the expected value of 10% of the heart rate (1 3). No "true" value can be appointed to the pain ratings during the PPT test. The results were, however, mainly the same if the PPT ratings were related to the level of the pressure pain threshold (CR-1OIPPT level). Regarding ratings of the exposure and outcome variables, there are no known true values to compare with the ratings. Low validity of self-reported exposure to work postures has been reported, especially regarding ratings using scales compared with dichotomous variables (39). Our study does not, however, support the idea that this lack of validity can be attributed to bias from high and low rating behavior.
The spread in ratings between subjects was sufficient to examine the correlations between the rated variables. There was no evidence of nonlinearity in the associations among the variables. Thus neither of these 2 factors call explain the findings of low intercorrelations. Our study does not provide data about the reliability of the ratings. It is, however, unlikely that lack of reliability could attenuate hypothetically substantial intercorrelations to those very low, both positive and negative, intercorrelations observed in our study. The (expected) findings of the relatively high correlations when the same stimulus situation was rated twice (acidity with CR-10 and VAS scales; time of 30-s and 60-s tests) further support this. Nonparametric statistics (Spearman-Brown correlation coefficients) were used in this study as some of the rating scales were only on the ordinal level. The corresponding Pearson correlation coefficients did not, however, differ much from those reported.
The main limitations to the generalizability of the results from our study are to be found in the selection of the neutral and nonaffective stimuli and the middle age span of the subjects.

Concluding remarks
There is no support in this study for the existence of a range of high and low rating behavior among middleaged subjects who rate neutral and nonaffective stimuli, such as time, weight, number, and physical exposure, as well as pain and other symptoms.
There is therefore no support for the idea of a bias to relative risk estimates from such rating behavior in studies in which subjects rate both exposure and outcome variables of this kind.