Influence of errors in job codes on job exposure matrix-based exposure assessment in the register-based occupational cohort DOC*X

job-exposure-matrix-derived exposure estimates according to DISCO-88 codes based on self-reported job-titles and registered in the Danish Occupational Cohort with eXposure data (DOC*X), with respect to airborne, mechanical, and physical exposures. Substantial agreement was also found between the two sets of DISCO-88 codes. The results are promising with respect to future studies based on the DOC*X. Influence of errors in job codes on job assessment in the Objective Job-exposure matrices (JEM) may be efficient for exposure assessment in occupational epidemiologi cal studies, but they rely on valid job information. We evaluated the agreement between JEM-based exposure estimates according to self-reported job titles converted to DISCO-88 codes and according to register-based DISCO-88 codes in the Danish Occupational Cohort with eXposure data (DOC*X). Furthermore, we evaluated the agreement between these two sets of DISCO-88 codes. Methods We used JEM regarding wood dust, lifting, standing/walking, arm elevation >90°, and noise from DOC*X. Participants from previous questionnaire studies were assigned JEM-based exposure estimates using (i) self-reported job titles converted to DISCO-88 codes and (ii) DISCO-88 codes registered in DOC*X, in four time periods (1976–78: N=7707; 1981–83: N=2193; 1991–94: N=2664; 2004: N=11 782). Agreement between the exposure estimates and between the DISCO-88 codes (four-digit levels, 1–4) was evaluated by kappa (κ) statistics. Sensitivities were calculated using the self-reported observation as the gold standard. Results We found substantial agreement (κ>0.60) between exposure estimates for all types of job-exposures and all time periods except for one κ. Low sensitivity (30–65%) was found for the period 1981–83, but for the other time periods the sensitivities varied between 60–91%. For individual 4-digit DISCO-88 codes, the sensitivities varied substantially and overall the sensitivities increased by lower digit level of DISCO-88. Conclusion The validity of the DISCO-88 codes in DOC*X was generally high. Substantial agreement was found for the JEM-based exposure estimates and the DISCO-88 codes per se, although the DISCO-88 code-specific agreement varied across digit levels and time periods.

The validity of occupational exposure estimates assigned to individuals by means of JEM depends on the quality of information about exposures in specific jobs in different time periods, as well as on correct job titles or occupational codes (7). The latter aspect of JEM validity is particularly important when occupational codes are retrieved from national registers, without occupational research as the primary objective. While the validity of exposures assigned by JEM has been examined in a number of publications (8)(9)(10)(11)(12)(13), the validity of the job titles and occupational codes per se has seldom been examined (7,14). Incorrect occupational codes in registers may be the result of erroneous reporting from the primary sources (eg, tax agents, companies) and -if classification systems have changed over time -errors in translation from one classification system to another. Therefore, the validity of registered occupational codes may vary between industries and occupations and across time periods.
The Danish Occupational Cohort with eXposure data (DOC*X) is a nationwide cohort for occupational research containing occupational histories in terms of year-by-year codes according to the Danish version of the International Standard Classification of Occupations (DISCO) on an individual level from 1970 through 2015 with ongoing updates. DOC*X is an open research resource that provides opportunities to perform registerbased epidemiological studies of occupational exposures by use of JEM (15). The validity of the DISCO codes in the nationwide registers, which form the foundation of DOC*X, has not been investigated.
The overall aim of this study was to evaluate the validity of DISCO codes in DOC*X. Specific aims were to evaluate (i) the agreement between JEM-based exposure estimates according to self-reported job titles converted to DISCO codes and according to registerbased DISCO codes in DOC*X; and (ii) the agreement between these two sets of DISCO codes per se.

Methods
Danish Occupational Cohort with eXposure data (DOC*X) DOC*X is a nationwide database including 6.4 million residents in Denmark from the age of 16, who have been gainfully employed at a private or public workplace in Denmark from 1970 through 2015 (15)(16)(17). The database has been compiled and is updated at a secured platform at Statistics Denmark. The backbone of the database is the information on occupation and industry, which includes calendar specific DISCO-88 codes for each individual based on the 1970 Census (16) and the Employment Classification Module (1976-2015) (17). The Employment Classification Module has used three classifications: (i) a scheme developed by Statistics Denmark based on ISCO-68 (1976ISCO-68 ( -1990, (ii) DISCO-88 (1991DISCO-88 ( -2009, and (ii) DISCO-08 (2010 onwards) (15). In DOC*X, the different coding versions have been harmonized to DISCO-88 codes in a code-by-code manner as described previously (15). The codes vary in detail from 1-to 4-digit levels, of which the last-mentioned is the most detailed. The annual DISCO-88 code for each individual is defined by the job with the highest income during each calendar year. We extracted annual DISCO-88 codes by use of the personal identifier (18).

Population used for validation
From 1976-1994, we used occupational data from the Copenhagen City Heart Study (CCHS). In total, 19 698 men and women from the center of Copenhagen were randomly drawn from the Copenhagen Population Register. The sample was age-stratified within 5-year age groups from 35-70 years of age. All participants completed a self-administrated questionnaire in 1976-1978, including a freeform question about current job title (N=14 223). Follow-up studies with information on job title were completed in 1981-83 (≥500 20-25-year-olds) and in 1991-94 (≥3000 20-49 year-olds) (19,20). The proportions that responded were 73.6% at baseline and 70.2% and 61.2% at follow-up. In the beginning of 2016, the job title text strings from the stored questionnaires were digitalized and assigned DISCO-88 codes by three librarians, who worked independently. The codes were cross-checked and a supervising occupational health specialist resolved discrepancies.
For 2004, we used data from the ASUSI cohort of 14 266 men and women, who completed a questionnaire in a population-based study of working environment and sickness absence (ASUSI is a Danish acronym for working environment, sickness absence, premature exit from the labor market, social inheritance, and intervention) (21). Two trained sociologists digitalized the job title text strings from the questionnaires assigned DISCO-88 codes. Only persons who had been in employment for ≥80% of the time during the previous year or had been employed for 6 out of the 12 weeks preceding 1 July 2004 were included.

Assessment of occupational exposure intensities
We assessed five types of exposure using four JEM: Wood dust estimates were assessed using a wood dust JEM based on expert ratings and 12 704 measurements collected in 1978-2007 in wood related industries in six European countries (22,23). We dichotomized the exposures as non-exposed and exposed because wood dust exposure was rare in the study population.
Work with the arms elevated >90° estimates were assessed using the Shoulder JEM, which is based on expert ratings by five Danish occupational health physicians with a minimum of 10 years of experience (29)(30)(31)(32). The expert rated estimates of time spent working with the arms elevated >90° (hours/day) have been validated against technical measurements (13). We divided the exposure estimates according to previously used cut-off value for high exposure (0=non-exposed, 1=medium exposed (>0-0.4 hours/day), and 2=highly exposed (≥0.5 hours/day) (32,33).
Noise was assessed using the Noise JEM (35,36), which is based on personal dosimeter measures of occupational noise exposure in the periods 2001-03 and 2009-10 among 1140 workers (1343 measurements) within the ten industries with the highest reporting of noise induced hearing loss according to the Danish Working Environment Authority. The measurements represented 100 occupational titles according to the DISCO-88 system. Four experts rated the noise intensity levels for the remaining jobs using 35 benchmark groups. Their ratings were used to construct an expert score dependent on sex, age, and calendar time (34,35). We used the categorical variable for noise exposure (0=<80 dB, 1=80-84 dB, 2=≥85dB), based on ISO-1999 thresholds (35,36).
We assigned exposure estimates to individuals in the CCHS/ASUSI cohorts with DISCO-88 codes for which a JEM exposure estimate was available. The estimates were assigned by connecting the JEM with their calendar-year specific DISCO-88 codes based on self-report and their DISCO-88 codes in DOC*X for the specific calendar year.

Statistical methods
From both cohorts (CCHS and ASUSI) and each time period, we excluded persons, who stated that they were unemployed or had retired. For each exposure and time period, the final population included only individuals with both sets of DISCO-88 codes and only DISCO-88 codes with ≥10 self-reported observations (37). Further-more, we only included observations where JEM-based exposure estimates were available for both sets of codes.
We computed kappa coefficients (κ) with 95% confidence intervals (CI) for exposures with two exposure categories (wood dust) and weighted κ with 95% CI for exposures with three exposure categories (all other exposures). Additionally, we in 3×3 tables computed sensitivity (the percentage of true exposure categorizations for the highest exposed individuals) and specificity (the percentage of true exposure categorizations for the non-exposed individuals) based on self-report as the gold standard. This means that the medium exposed groups not were included in the interpretation of sensitivity and specificity. We also assessed the sensitivity and agreement (weighted κ) between the DISCO-88 codes per se (specificity was not assessed because it would always be very high due to the low frequency of persons in any DISCO-88 group compared to the total number of persons in the study). Sensitivity was calculated as the percentage of true registrations within each DISCO-88 code digit level (1-4) taking the DISCO-88 codes based on selfreport as the gold standard. In addition to the agreement at 1-, 2-, 3-, and 4-digit levels, we computed weighted κ coefficients by time period (1976-78; 1981-1983; 1991-1994; 2004) at DISCO-88 1-digit level (DISCO-88 major groups). We interpreted the κ coefficients as: <0=poor, 0.00-0.20=slight, 0.21-0.40=fair, 0.41-0.60=moderate, 0.61-0.80=substantial, and 0.81-1.00=almost perfect agreement (38). SAS software, version 9.4, (SAS Institute Inc, Cary, NC, USA) was used. Table 1 presents the number of DISCO-88 codes according to time period, including all digit levels of DISCO-88 (based on self-reported job titles), that met the inclusion criteria of minimum ten observations in our final study dataset. These codes represented 29-56% of the total number of codes, including all digit levels of the DISCO-88 system, with the lowest percentage in 1991-94 and the highest in 2004. The number of individuals in each time period is also shown; their distribution across DISCO-88 groups is presented in supplementary table S1, www.sjweh.fi/show_abstract. php?abstract_id=3857.

Results
As seen in table 2, our data showed substantial agreement between JEM-based exposure estimates according to the two sets of DISCO-88 codes based on self-reported job titles and registrations in DOC*X, except for noise in 1981-83. Across time, both the sensitivities and κ estimates were lowest for the time period 1981-83. Overall, the specificities were high showing substantial agreement for the non-exposed individuals. Table 3 shows that the agreements between the two sets of DISCO-88 codes were substantial across 1-, 2-, 3-, and 4-digit levels. The highest κ estimates were seen for the 4-digit DISCO-88 group level with estimates between 0.73-0.81. The sensitivities varied between 51.5-73.2% and were highest for the 1-digit DISCO-88 level. As seen in table 4, the DISCO-88 code specific agreement at 1-digit level varied from fair to almost perfect across time periods (κ=0.34-0.91). Group 0 (armed forces) had almost perfect agreement, whereas group 1 with legislators, senior officials, and managers showed the lowest agreement; no time trends were evident. The sensitivities generally showed the same pattern as the κ-values.

Discussion
Job titles and occupational codes constitute a crucial basis for the use of JEM, but errors in job titles and assignment of occupational codes have received minimal scientific attention. The present study benefitted from exposure data from JEM concerning five airborne, mechanical, and physical exposures. Self-reported job titles for the CCHS/ASUSI cohorts were translated into DISCO-88 codes, which were connected with the JEM to provide exposure estimates, which were then compared to JEM-based exposure estimates according to DISCO-88 codes registered in DOC*X. High sensitivities and substantial agreement was found for the JEMbased exposure estimates and for the DISCO-88 codes per se, although the DISCO-88 code-specific agreement varied across digit levels and across time periods.
The number of individuals in the study population from 1991-94 was low since only about one third of the individuals with a self-reported job title had a DISCO-88 code in DOC*X. An explanation may be the higher mean age in the population by calendar time as the main part of the population was included in 1976 with an age of up to 70 years at that time. For example, if they retired from the workforce before 1991, they have no DISCO code registered in DOC*X database for the time-period 1991-94. The classification system used by Statistics Denmark changed in 1981 and 1993, which may be an explanation for lower agreement observed in the period 1981-83, and again in 1991-94. In 1981-83, the classification system was less detailed than the DISCO-88 system. This means that it was very difficult to translate specific job groups from that time-period to DISCO-88 codes. Therefore, discrepancies between DISCO-88 codes may be because of translation difficulties rather than exact differences between jobs. Because of the less detailed job groups in 1981-83, the solution was to translate job titles to less detailed DISCO-88 group levels. The system for code assignment also changed in 1991, when the DISCO-88 classification system was introduced by Statistics Denmark. The DISCO-88 was based on the ISCO-88. Before 1991, the occupational codes were assigned by trained coders at Statistics Denmark based on self-reported information and union membership, but from 1991 the system was automatized and based on tax records and other personal register information. This shift in code assignment led to a temporary reduction of data reporting, which probably also contributed to the low number of individuals in the final study population for 1991-94.
The variation across DISCO-88 codes probably reflected variations in the accuracy by which DISCO codes are reported to the central authorities. Reporting to Statistics Denmark from large public and private com- panies is undertaken by trained staff according to written guidelines, while small private companies with fewer resources may provide less accurate DISCO codes. It is only mandatory for Danish companies with ≥10 employees to report information on occupation, and therefore significant differences in accuracy may be expected. The misclassification of JEM-based individual exposures assigned by using DISCO-88 codes in DOC*X seems less than might be expected based on comparison of the sensitivities for the DISCO-88 codes per se; overall, the sensitivities were higher when comparing JEM-based exposure estimates than when comparing the two sets of DISCO-88 codes (especially at the 3-and 4-digit levels). This is because DISCO-88 codes belonging to similar job groups in the JEM are assigned similar job-exposures (7,14). For example, the noise JEM will assign the same low level of noise exposure to all types of office workers regardless of the specific DISCO-88 code. Lack of agreement between two sets of DISCO-88 codes will therefore not necessarily affect the agreement between JEM-based exposure estimates.
The variation in agreement between the two sets of individual DISCO-88 codes seems to depend on char-acteristics of the jobs covered by the code. In general, the codes with lowest sensitivities are broadly defined and not specified, eg, business services agents and trade brokers not elsewhere classified, other teaching associate professionals, and finance and sales associate professionals not elsewhere classified. The two last-mentioned groups will probably be classified as other kinds of office workers, which will reduce the effect of the misclassification on the assigned JEM-based exposure estimates (see above). Another possibility is to exclude DISCO codes with low sensitivities in epidemiological studies (at least in sensitivity analyses) as they may increase the risk of misclassification of exposures. Thus, the actual validity of the DISCO-codes per se may be significantly higher in cleaned data prepared for analysis.

Strengths and limitations
One strength of our study is that we have data from four different time periods during a 24-year long period where Statistics Denmark used different classification systems of occupations in their registers. Furthermore, we have access to self-reported job titles. It may Table 2. Sensitivity, specificity, and agreement between occupational exposures assigned by job-exposure matrices (JEM) according to self-reported job titles converted to DISCO-88 codes and according to   The percentage of true registrations for the highest exposed individuals. b The percentage of true registrations for the non-exposed individuals. c Dichotomized (non-exposed/exposed) d Observations from the Copenhagen City Heart Study. e Observations from the ASUSI study. (ASUSI is a Danish acronym for working environment, sickness absence, premature exit from the labor market, social inheritance, and intervention) f For wood dust the κ and 95% CI are not weighted.   shop and market sales workers; 6=Skilled agricultural and fishery workers; 7=Craft and related trades workers; 8=Plant and machine operators and assemblers; 9=Elementary occupations. b Number of observations with two sets of DISCO-88 codes at major (1-digit) group level. c The proportion of true registrations within each major DISCO-88 group based on self-reported job-title as the gold standard. d Agreement between registered DISCO-88 codes in DOC*X and self-reported job titles converted to DISCO-88 codes based on the Copenhagen City Heart Study. e Agreement between registered DISCO-88 codes in DOC*X and self-reported job titles converted to DISCO-88 codes based on the ASUSI Cohort.
be questioned if self-reported job titles converted to DISCO-88 codes can be taken as a gold standard, but self-reported information on the current job is generally considered to have high validity (14,39).
One limitation of our study is that we have no selfreported job titles from the years after 2004, and therefore no validation has been performed on DOC*X registrations from 2005 onwards. This limitation particularly pertains to DISCO-88 codes after the time point when Statistics Denmark introduced the DISCO-08 system in 2010 (15). Another limitation is that the DISCO-88 codes, which were available for validation, only represented around half of the codes in the DISCO-88 system so that only frequent occupational titles were validated at the 4-digit level. If the agreements are lower for rare DISCO-88 codes, we may have overestimated the general validity of the DISCO-88 codes in DOC*X. On the other hand, the sensitivities did not seem to depend on the number of observations (all ≥10) per DISCO-code.
In our analyses of agreement between exposure levels, we used categorical variables with two or three categories. The JEM exposures for wood dust and noise only exist as categorical variables while the other JEM contain continuous measures, which we categorized to ensure comparability. It may be a limitation that we only validated the DISCO-88 codes based on categorical variables instead of using continuous scales. We chose to focus on the lowest and highest exposure categories to examine whether they were correctly categorized. To the extent that DISCO-88 codes in DOC*X are misclassified so that highly exposed are categorized as medium or non-exposed, the data would not be of a quality that allows future exposure-response analyses.

Validity of DISCO-88 codes in future DOC*X studies
This study concerned selected airborne, mechanical, and physical exposures, and it remains open whether the validity of DISCO-88 codes in DOC*X is similar for other exposures, eg, chemicals. The validity varied across 4-digit DISCO-88 codes and time periods, which should be considered when planning studies in DOC*X. DOC*X also covers industry codes from 1976 and onwards (15) and it can be relevant to use those industry codes together with the DISCO-88 codes to reduce the risk of misclassification of occupations.

Concluding remarks
The validity of the DISCO-88 codes in DOC*X was generally high. Substantial agreement was found for the JEM-based exposure estimates and group-based DISCO-88 codes per se, although the DISCO-88 code-specific agreement varied across digit levels and time periods.

Funding
The Danish Working Environment Research Fund funded this study (grant no.: 43-2014-03 / 20140016763). The funding source played no role in the (i) study design, (ii) the collection, analysis and interpretation of the data, (iii) the writing of the report, or (iv) the decision to submit the paper for publication.