Effects of the implementation of an 84-hour workweek on neurobehavioral test performance and cortisol responsiveness during testing

Effects of the implementation of an 84-hour workweek on neurobehavioral test performance and cortisol responsiveness during testing. Scand J Work Environ Health 2003;29(4):261–269. Objectives This study examined whether long workhours in combination with an extended workweek (12 hours/7 days), as requested by the workers, impaired attention and cognitive performance and whether the degree of hypothalamic-pituitary-adrenal (HPA) activation was related to the response to the performance tasks. Methods A group of 41 male construction workers between 21 and 60 (mean 39) years of age who worked 84 hours a week, with alternate weeks off, was compared with a group of 23 male construction workers between 24 and 65 (mean 43) years of age who had a traditional 40-hour work schedule. Neurobehavioral test performance, self-ratings of fatigue and sleepiness, and salivary cortisol levels were evaluated in a counterbalanced repeated-measures design. Results The 84-hour group did not show any signs of reduced test performance or elevated fatigue and sleepiness. The 84-hour group had faster reaction times on day 7 than on days 1 and 5. Although the expected activation of the HPA axis was only found in the total study sample when workdays 1 and 5 were collapsed, the HPA activation can be considered normal. Conclusions The results suggest that an 84-hour work regimen in response to requests from workers does not induce more performance deficits than an ordinary 40-hour workweek. An extended work schedule of 84 hours cannot in the short-term be considered to affect basic mental capabilities negatively.

The effects of long workhours on health and performance have been much debated. Potential benefits, which may be of a social, economic, or individually health promoting nature, are often contrasted against the assumed risks of negative effects on individual health and safety, as well as the possibility of negative social consequences (1)(2)(3). Even if working more than 48 to 56 hours a week is considered potentially harmful, current scientific evidence is inadequate to give any firm recommendations about long workhours (4)(5)(6). There is a need to supplement existing knowledge with detailed information regarding the multitude of alternative work schedules that are becoming increasingly implemented in a variety of organizational and industrial settings. In the present case, construction workers building a bridge across the strait between Sweden and Denmark in the Malmö region suggested an 84-hour workweek (12 hours/7 days) with alternate weeks off. Working 84 hours a week with such a combined physically and mentally demanding job assignment was suspected to lead to accumulated fatigue and, therefore, compromise the workers' physical and mental status and jeopardize safety at work. For this reason, the work schedule was initially questioned by the local labor inspectorate, but was later accepted subject to medical supervision and evaluation. To investigate the possible adverse effects of this Original articles Scand J Work Environ Health 2003;29(4):261-269 schedule, a multitude of methods and effect markers was used, for example, questionnaires, hormones in blood and saliva, heart rate variability, muscular fatigue, objective sleep monitoring, and neurobehavioral tests. This paper presents the evaluation of the effects of a long workweek on neurobehavioral test performance and cortisol responsiveness during testing. One aim was to examine whether working long hours would lead to a decline in attention and cognitive performance. A supplementary aim was to investigate whether the hypothalamic-pituitary-adrenal (HPA) axis was properly activated during neurobehavioral testing and to determine whether HPA responsiveness was related to test performance. Since short-term (ie, 30 minutes) work with a visual display unit (VDU) with high demands on speed and accuracy has been reported to be associated with an increase in salivary cortisol levels (7), indicating a general stress response during such exposure, one would expect to observe a similar rise in cortisol levels during neurobehavioral testing if the participants are able to mobilize sufficient motivation and arousal across the week.

Participants
Two groups of male construction workers were examined. The first group (N=41) had volunteered to work cycles of 84 hours a week (12 hours/7 days) followed by 7 days off. The daily workhours were between 0700 and 1900, including a sedentary 15-minute transport to and from an off-shore location in the middle of the strait. The second group (N=23) of construction workers worked a regular 40-hour week (8 hours/5 days) between 0700 and 1530 at an on-shore location in the harbor. The mean age of the 84-hour group was 39.2 (SD 10.4, range 21-60) years, and that of the 40-hour group was 42.5 (SD 13.9, range 24-65) years. The participants were identified in collaboration with the employer, and inclusion in the groups was determined by their working under one of the five specified company foremen who were in charge of the workers on the different worksites. It should be noted that the foremen in charge of the 84-hour group had a first-hand opportunity to select and hire workers when the project was started. Five participants in the 84-hour group were omitted from the statistical analysis since their participation rate was too low (ie, less than 2 of the 3 days of measurement). Their reasons for absence were vacation, common cold, working in key positions, and the like. Thirty-eight persons participated on all three workdays. In the 40-hour group, one participant was omitted from the statistical analysis due to poor cooperation. Altogether, 64 participants provided sufficient data for the analysis. When the testing was initiated, the 84-hour group had worked for approximately 6 months, whereas the 40-hour group had worked for approximately 8 months. Most of the participants were from the region and returned home after each workday. One-third of the participants in the 84hour group commuted on a weekly basis, as they came from more distant parts of Sweden. During the workweek these long-distance commuters shared ordinary apartments, two by two, in the vicinity of the transport harbor. Questionnaire data showed that most of the participants in both groups slept well and were satisfied with the amount of sleep they got (table 1). Electrocardiographic data showed that the physical fitness level (estimated by heart rate during rest) of the two groups was similar (table 2).

Work conditions
Both groups were engaged in heavy construction work, involving different types of concrete work and steel reinforcement work. The main task for the 84-hour group was to build two H-pylons (rising 204 meters above sea level) to support the double decked cable-stayed highway and railroad bridge. The 40-hour group on

Persson et al
shore made the bridge components on which the railroad tracks rest (ie, the railroad trough). For both groups there were three scheduled breaks during the workday (ie, morning, lunch, and afternoon). The morning and afternoon breaks lasted 15 minutes. The lunch break lasted 30 minutes. Due to the location of the worksites, unplanned interruptions owing to disturbances in logistics or bad weather conditions occurred more frequently for the 84-hour group. Electrocardiographic data, which were obtained during different weeks but with no known worktask variations, showed that the 40-hour group had a higher workload despite the similar worktasks (table 2).

Design
The 84-hour group was tested on days 1, 5, and 7, whereas the 40-hour group was tested on days 1 and 5.
Since the design aimed to compare the 84-hour group with the 40-hour group, tests on days 1 and 5 were given in a counterbalanced order to minimize the influence of possible training and learning effects. For the 84-hour group, the test session on day 7 was always scheduled last.

Measures
Karolinska Sleepiness Scale (KSS). A slightly modified version of the KSS (8), in which the intermediate scale steps without verbal anchoring were removed, was responded to on a VDU. The subjects rated their current sleepiness on the following 5-point scale: (i) very alert, (ii) alert, (iii) neither alert nor sleepy, (iv) sleepy, no difficulty remaining awake, and (v) extremely sleepy, fighting sleep.
Fatigue symptom ratings. Eleven items, inspired by, and partially derived from, previous research (9), were used to assess aspects of physiological and mental fatigue. The items were presented on a VDU and responded to on a 4-point scale indicating degree of compliance with a verbal expression concerning, for example, difficulties to concentrate, feeling stressed, overworked, and the like. A higher score indicated more symptoms of fatigue. In our present study, a global fatigue index comprised of all 11 items was used. The internal consistencies for this index were calculated as Cronbach's alpha and ranged between 0.74 and 0.85 across groups and workdays. The neurobehavioral tests were selected for their empirically demonstrated high sensitivity to detect minor performance decrements caused by, for example, subtle brain dysfunction (10).
WAIS-R Digit Symbol: a test of perceptual and fine motor speed. With the WAIS-R Digit Symbol test the participants were instructed to enter code symbols as rapidly as possible in empty squares according to a code list that covered a series of numbers ranging from 1 to 9. The number of correctly substituted symbols within 90 seconds constituted the score (11). To reduce learning effects in the WAIS-R Digit Symbol substitution test, three different versions, one version for each specific workday, were used.
APT two-choice visual reaction time (APT RT-2). This simple reaction time test is part of the Automated Psychological Test (APT) system developed by Levander & Elithorn (12). The participants were instructed to react as fast as possible to a white square presented either to the left or to the right of a computer screen, and the responses were given with either of two corresponding keys. A fixation cross was presented in the center of the screen. Altogether, 50 stimuli were presented during approximately 5 minutes with an interstimulus interval of 2-6 seconds. The individual results were expressed as level (the mean of the 50 reaction-time responses) and variation (the standard deviation of the 50 reaction-time responses). The electrocardiographic data were obtained during different weeks (but with no known worktask variations) than the tests. The 10 youngest participants in the 84-hour group were not subjected to the recordings because of a lack of proper equipment. The maximum heart rate was calculated as 210 -(0.662 × age) (14). The percentage of the heart rate ratio was then calculated as (HRwork-HRrest) / (HRmax-HRrest) × 100.
APT two-choice visual reaction time with sound inhibition (APT inhibition). This complex reaction-time test is similar to the APT RT-2, but the participants are supposed to inhibit a response if an auditory signal occurs at the same time as a visual stimulus (go/no go criteria). Half of the stimuli (ie, 25) are accompanied by the auditory stimulus, occurring at random. APT inhibition incorporates a preparatory stimulus (ie, the fixation cross-starts to flash before the squares are presented).
The results were expressed in terms of (i) level, (ii) variation, and (iii) failed inhibition percentage.
APT k test: a test of selective attention. Each subject was told to scan a video screen in order to decide whether or not the letter k was present in a set of nontarget letters. If the letter k was present, the participants were instructed to respond by pressing a pair of buttons using the middle fingers. When the letter k was absent, the subject was encouraged to press another pair of buttons using the index fingers. One letter k was absent or present with equal probability (ratio 0.50) among a total of 10 letters. Approximately 100 sets of letters were presented during 5 minutes. The result was expressed as (i) reaction-time level of correct hits (covering all spatial locations on the screen) and (ii) error rates (missed and false hits).

Salivary cortisol
To investigate each participant's psychophysiological response to the neurobehavioral tests, salivary cortisol was sampled. Both cortisol and cortisone were analyzed by high-performance liquid chromatography (HPLC), and the total amount of corticosteroid was used in the statistical analyses. Saliva (2-5 ml) was collected in a 10-ml glass tube, frozen, and sent frozen for the HPLC analysis. Thawed samples were treated with 100% methanol, extracted (Bond Elut C18, Varian, Harbor City, CA, USA), separated (Supercosil LC8, Supelco, Bellafonte, PA, USA), and detected by ultraviolet light (spectra 100 at 240 nm). The data were recorded and integrated by WINer on Windows Labnet (Thermo Separation Products, San José, CA, USA). The coefficients of variance (CV) for within days were from 5.4% to 8.9%, and for between days they varied from 7.1% to 8.8 %.

Procedure
Testing was carried out (for both groups) in the morning between 0800 and 1130 in stationary work barracks placed in the immediate proximity to the worksites. There was no systematic difference in the timing of the tests between the groups. Each session lasted approximately 30 minutes. Typically, four subjects were tested simultaneously under the supervision of two, or three, psychologists. The first sampling of salivary cortisol preceded the tests. To stimulate saliva excretion, the participants chewed on a piece of paraffin. The session then started by having them fill out a computer-administered questionnaire concerning current state of sleepiness and fatigue. The tests were then given in the following order: WAIS-R Digit Symbol, APT RT-2, APT inhibition, and finally the k test. The second sampling of salivary cortisol followed the tests. Since the participants had been awake for at least 2 hours before the testing, the sampling of salivary cortisol occurred in a phase in which cortisol levels are expected to show physiologically a slight decline during a 30-minute period.

Ethics
All of the participants gave their written informed consent to participate. The Ethics Committee of Lund University approved the study (LU 63-98).

Data management
Descriptive data: questionnaire and electrocardiographic data. The awakening score was calculated as the mean score of three items that assessed (i) ease of awakening, (ii) whether the sleep was refreshing, and (iii) exhaustion at awakening. The score ranged from one to five. Higher scores represented a more refreshing sleep that was easily terminated. The overall sleep quality was assessed with one question. The score ranged from one to five, and a higher score represented a better quality of sleep. The disturbed sleep score was calculated as the mean score of four items that assessed whether the subject had (i) difficulties falling asleep, (ii) disturbed or restless sleep, (iii) repeated awakenings, and (iv) premature awakenings. The score ranged from one to five. Higher scores represented less disturbed sleep. The sleep time during workdays was calculated as the difference between the reported time for night sleeping and reported time for awakening. The perceived satisfaction with the attained amount of sleep item was dichotomized according yes and no answers. The results were expressed as the percentage satisfied (yes answers) within each group. Confidence intervals for the percentage were calculated according to Altman et al (13). Maximum heart rate was calculated as 210 -(0.662 × age) (14). The percentage of the heart rate ratio (HRR%) was then calculated as (HRwork -HRrest) / (HRmax -HRrest) × 100.
Outcome measures: self-ratings and neurobehavioral tests. The neuropsychological test scores and the KSS scores were calculated as raw scores if nothing else was indicated. The ratings of the individual fatigue items Persson et al were averaged to form a fatigue index score. The distribution of every test variable and rating scale score was plotted and visually inspected for deviations from normality. Positively skewed deviations from normality were found for the two APT RT-2 variables and the three APT RT-inhibition variables. The APT k error variable and the fatigue index score were also found to be positively skewed. To comply with normality assumptions, the positively skewed variables in which only positive values were achievable were logarithmically transformed with the base of 10. The log-transformed variables were antilogged and converted back into the original scale after the statistical testing. Hence the log-transformed variables are presented as geometric means with accompanying confidence intervals. To lessen the impact of extreme scores for the APT RTinhibition and APT k test error variables, which incorporated possibilities to achieve nonpositive values (ie, zero), the data were subjected to square root transformation.

Statistical analysis
Most of the statistical computations were made with SPSS 11.0.1 (15). P-values below 0.05 were considered statistically significant. With the use of the General Linear Model module, univariate two-way repeated-measure analyses with group as a between-participant factor and age as a covariate evaluated the group differences in the mean scores for the neuropsychological tests. If a regression with P<0.10 was found between the covariate age and a test score, the former was retained in the final model. As a consequence, nearly all the group computations involving neurobehavioral test scores were adjusted for age (ie, all except the APT RT failed inhibition variable). Age did not yield any statistically significant effects on the cortisol measures.
The main question of the study was tested by the interaction effect between workday (day1 and day 5) and group (84-hour versus 40-hour). For the 84-hour group only, the possible effect of all workdays (ie, day 1 versus day 5 versus day 7) on the tests and ratings was also evaluated. Analyses were performed on the participants that had a complete set of data for the particular items or tests. In the 84-hour group the homogeneity of variance assumption was tested with Mauchly's test of sphericity. If the sphericity assumption was violated, the Geisser-Greenhouse F-test was used to adjust the degrees of freedom in order to increase the critical F-value; this procedure lessened the risk for type I errors.
Within-group post-hoc comparisons were made with the Bonferroni-adjusted paired t-test. Given that the within-subject factor represents ordered levels, polynomial contrasts were used to evaluate the linear and quadratic effects. The polynomial contrasts were also screened for higher order contrasts. No higher order contrast fitted better than any linear contrast.
A univariate three-way repeated-measures analysis was used to evaluate the possible differences in the cortisol levels for the 30 and 15 participants from the 84hour and 40-hour groups, respectively, who left a complete set of data for workdays 1 and 5. There were no differences in the mean cortisol levels between the participants with a complete set of data and those with an incomplete dataset. Specifically, the analysis involved two factors with repeated measurements with two levels each, that is, workday (day 1 and 5) and pre-post (pretest and posttest levels of cortisol), as well as one between group factor (84-hour versus 40-hour group). The main question to be evaluated was tested by the three-way interaction between workday, group, and prepost.
A univariate two-way repeated-measures analysis was used to evaluate the possible differences in the cortisol levels for the 25 subjects in the 84-hour group who had a complete dataset for workdays 1, 5, and 7. There were no differences in the mean cortisol levels between the participants with or without complete sets of data.

Results
Comparisons between the 84-hour and 40-hour work schedules Self-rated sleepiness and fatigue symptoms. There was no group difference for sleepiness or fatigue development between days 1 and 5. The 40-hour group had higher KSS and fatigue index scores than the 84-hour group (table 3).
Neurobehavioral test performance. There was typically no group difference for performance development between days 1 and 5. Only one interaction effect between workday and group, involving the APT k-test reactiontime variable, was detected (table1). This interaction was due to the slower reaction times observed in the 40hour group on day 1, which subsequently became faster and comparable with those observed in the 84-hour group on day 5. Furthermore, the 40-hour group exhibited larger variation scores than the 84-hour group in the APT RT-2 and APT RT-inhibition tests.
Salivary cortisol. No interaction indicating group differences in physiological activation between the days was observed. The levels of cortisol were, on the average, higher after the testing than before it [F=6.15, df=1, P=0.017, partial η 2 =0.13] (table 4).
Evaluation of performance and cortisol excretion across workdays in the 84-hour group Self-rated sleepiness and fatigue symptoms. The number of days worked influenced the feelings of fatigue but not sleepiness (table 5). The fatigue index scores increased across the days. Differences were mainly observed when day 7 was compared with day 1 or 5.
Neurobehavioral test performance. The number of days worked influenced performance with respect to the following tests and variables: WAIS-R Digit Symbol, APT RT-2 variation, APT RT inhibition level, and APT k-test level for correct responses (table 5). These differences, which emerged either between days 1 and 7 or between days 5 and 7, showed that performance became faster and less variable after working 7 days. The evaluation with polynomial contrasts suggested a gradual and linear progress for some performance variables.
Salivary cortisol measurement. There were no overall differences in the cortisol levels after the testing compared with before the testing [F=1.57, df=1, P=0.222], and no interaction occurred between workday and the factor pre-post [F=1.69, df=2, P=0.195]. The cortisol levels generally decreased as a consequence of the number of days worked [F=4.32, df=1.53, P=0.029, partial η 2= 0.15]. Specifically, the pretest levels on day 1 were higher than the pretest levels on day 7, and all the posttest levels of salivary cortisol deviated significantly from each other on all workdays (table 6). Polynomial contrast indicated a linear trend for the salivary cortisol levels, the means decreasing over the workweek [F=17.40, df=1, P< 0.001, partial η 2 =0.42].   1.17 1.11-1.23 1.23 1.16-1.31 1.30 1.19-1.42 1.50 1.35 -1

Discussion
Our study disclosed no evidence of performance decrement in neurobehavioral tests or of increasing symptoms of fatigue or sleepiness as a result of an 84-hour workweek in comparison with corresponding results for an ordinary 40-hour workweek. Contrary to expectations, the 40-hour group reported more sleepiness and fatigue and had occasionally a more variable and slower performance than the 84-hour group. Interestingly, the 84hour group had the fastest reaction times and the smallest variation scores on day 7. At first glance, this finding might be suspected to indicate training or learning effects, as day 7 was not balanced out in the study design. However, a previously repeated administration of the same neurobehavioral test has never suggested any signs of learning, in either the short term (ie, 4 times within 2 hours) or the intermediate term (the same procedure repeated 1 week later) (13). On the other hand, training or learning effects cannot be ruled out for the WAIS-R Digit Symbol test (16,17). An alternative interpretation of the lower and less variable reaction times is excessive arousal on day 7. This position is supported by the self-reports of low levels of fatigue and sleepiness prior to the testing, as well as the fact that unpublished observations have shown that the 84-hour group  required at least two major sleep episodes (ie, two consecutive nights' sleep) in order to reach baseline levels of fatigue and sleepiness. Following this line of reasoning, a central question becomes whether an increase in arousal is detrimental or beneficial. From a traditional perspective, one might argue that the shorter reaction times reflect improved performance, and are desirable, whereas a more unconventional approach would be to suggest that the central nervous system may accelerate towards a state of dysfunction. Yet another possibility is that the faster and less variable performance on day 7 in the 84-hour group was due to more motivation, as the subjects were likely to have positive expectations of the forthcoming leave. It is also imaginable that these subjects, in order to retain the current work schedule and the benefits there of, have had an unspoken ambition to perform well the last workday. Regardless of the reason, a motivation hypothesis finds some support in the fact that the decreases in reaction times seemed to become more prominent as a function of the complexity of the tests. For example, between days 5 and 7, no difference in mean reaction time was observed for the APT RT-2 test, whereas the mean reaction time of the slightly more cognitively demanding APT inhibition test decreased by 40 milliseconds (Z=-0.42). In the APT k test, which involved a markedly more complex visual search component, the mean reaction time decreased by 130 milliseconds (Z=-0.62). Possibly more complex tasks leave more room for motivation to become an issue; thus trying harder pays off. The observed rise in cortisol levels in the total study sample during the neurobehavioral testing agrees fairly well with previous findings concerning changes in saliva cortisol levels in response to short-term stress among VDU operators (7). Even if mean levels indicate diminishing cortisol responsiveness during testing in the 84hour group, the intraindividual variability between days was large, and no workday by pre-post interaction was observed. Similarly, the general rise in cortisol during the testing was not observed when the development across all 7 days was studied. Remarkably, the 84-hour group displayed a general decrease in absolute levels of cortisol across the week. At a glance this finding may give the impression of the start of a fatigue reaction in the HPA axis, as a response to prolonged demands and too little sleep. This impression is reinforced by the agreement with falling morning serum cortisol and serum testosterone levels during the workweek, found in the same group during another workweek (18). On the other hand, when the data are considered in combination with the observation that the melatonin level of the 84-hour group was markedly higher on day 1 than on days 5 and 7 (18), another explanation may be that the drop in cortisol merely reflects the normalization process after the phase shift that occurred when the participants entered the work schedule and were forced to wake up earlier.
At this point, some methodological issues need to be addressed. The present design has more than sufficient power to detect a difference between the means of the size of one standard deviation, both regarding main effects and interactions. Clinical experience indicates that smaller effects are hardly worth detecting. Hence, failures to detect workday by group interactions are not primarily due to a lack of statistical power. It is more likely that practical circumstances may have reduced the possibilities of finding the expected negative effects of the 84-hour workweek. One should, for example, note that our study focused on the potential effects of accumulated fatigue on cognitive performance. It is possible that the 84-hour group would have reported more fatigue and performed worse if the measures had been obtained at the end of the shifts. Moreover, despite great effort to find similar worktasks, the electrocardiographic recordings showed that the 40-hour group worked at a higher percentage of their cardiovascular capacity. Since it seems reasonable to assume that workers who feel out of shape, or have a tendency to become easily fatigued, do not likely volunteer for this type of extreme work scheduling, one might think that this observation reflects selection bias due to fitness. However, in view of the group's similar heart rates during rest, this seems not to be a plausible selection mechanism. It is more likely that other incentives have affected the allocation of the participants, for example, economic or social reasons. Nonetheless, even if major construction work is likely to be associated with a considerable degree of self-selection that may be caused by a wide range of factors that do not necessarily have medical relevance, one should note that the participants first and foremost were likely to be representative of their own, and similar, lines of work. As with our findings, however, many studies of compressed work scheduling have failed to find effects on performance (19). Our results stress the fact that long workhours are not automatically associated with performance decrements. It seems apparent that work content and context and employed strategies to execute work (individual as well as organizational) are important determinants of the impact of a work schedule on the individual worker's performance and health.
In conclusion, the expected consequences of fatigue build-up were not observed. The results showed that implementing a demanding 84-hour work regimen in response to the workers' requests does not necessarily create poorer cognitive performance than an ordinary 40-hour workweek. Working an extended work schedule of 84 hours cannot in the short term be considered to affect basic mental capabilities negatively.