Scientific Data Documentation
OPCS, English Mortality File, 1986-1990
ABSTRACT
Description of the Data Files
Numbers of deaths
The OPCS historic deaths file contains numbers of deaths in England
and Wales by year, sex, age and ICD cause. The international
classification of diseases was introduced In 1900 and has been used
in the tabulation of mortality in England and Wales since 1911.
During the period 1901-1910, when the first revision of the ICD was
used in some countries, a comparably modern (but unnumbered)
classification was used in England and Wales. In
all there have been nine revisions of the international
classification. Each of these provides a coding frame for cause of
death which differs, to varying degrees, from the previous
revision. The years for which each ICD revision was used in the
tabulation of England and Wales mortality data (or a corresponding
local equivalent) are shown in Table 1. It should be emphasized
that no attempt has been made to include on the data file deaths
prior to 1901, although deaths were tabulated for all of England
and Wales from July 1837 using an earlier series of national coding
frames.
While changes in ICD revision have influenced the way in which the
data are stored, the contents of each year's data have been
constrained by a second pragmatic consideration. Material on
individual deaths, coded for tabulation purposes are only available
as far back as 1959, when computerized record keeping was
introduced by OPCS for statistical tabulations of mortality. Prior
to that date, no regular series of coded or unpublished tabular
material was retained in any form.
In consequence, numbers of deaths for the years 1901-1958 have been
obtained by simply transcribing, on to computer, figures published
in the Registrar General's Annual Reviews. For the years 1921
onwards the figures used were those in Table 17 of the Annual
Review; in earlier years figures were copied from an equivalent
unnumbered table in the annual abstracts. This method of data
extraction placed self evident limitations on the degree of
disaggregatic of the data transcribed on to tape; namely that in
respect of age and cause groupings, the tapes cannot provide more
detail than was published at the time.
By contrast, data on the historic death tape for years after 1958
were compiled from computer tapes used in the production of annual
reference volumes. This provided greater flexibility in the choice
of cause groupings and it was possible to store the data held on
file according to individual four digit ICD codes, irrespective of
whether these appeared in routine publications at the time.
These two methods of constructing the database also required
different validation techniques and, correspondingly, different
means of resolving quality check queries. The details of these are
discussed in sections 5-7.
As indicated above, the choice of age groupings held on tape was
determined by the availability of published material in earlier
years; in later years a standard quinquennial age grouping was used
(sub-dividing deaths under one and those at ages one to four and
aggregating those at 85 and Over). The details of the age groups
available in each ICD revision are shown in Table 2.
As far as other variables on file are concerned, year of death and
sex, the data are held in individual years and for each sex
separate
Population Estimates
From time to time OPCS (or, formerly, the General Register Office)
revises its estimates of the population at risk of dying in England
and Wales in each year. This is done so as to incorporate
information which becomes available after the publication of
initial estimates in annual reference volumes. The major sources
of additional information are the counts of population at
subsequent censuses. To accompany the historic deaths tape, a tape
of comparable population estimates has been prepared from the most
recently revised figures released by OPCS.
This tape contains population estimates by sex, year and age in the
same format as the death tapes. Information is held for each year
and each sex separately Age groupings for each year correspond to
those used for the storage of death data for that year.
The source of these population estimates for each year is shown in Table 3.
Structure of the Files
Each count held on the death file is stored in a separate record,
referenced by the cause, sex, age and year to which it refers.
Where no deaths occurred in a particular year for a specific cause
sex and age combination, no record
is held on file. All counts held on file are thus positive
integers. The data for each ICD revision is sorted by year, then by
cause, then by sex and finally by age group. The layout of each
record is given in Table 4.
The population data are held in an identical format to that used
for numbers of deaths. However, the cause variable is set to zero
on all records, so that the file contains one record for each year,
sex and age combination, and is sorted in the latter order.
BACKGROUND
In the analysis of any particular set of mortality data, a pivotal
role is frequently played by national death rates by age, sex and
cause. For example, the analysis of cause specific time trends and
their correlates generally draws directly upon data of this sort;
instances of this type of application in the literature include
studies of lung cancer, stomach cancer, heart disease and suicide.
At a broader level, international comparisons utilize the rates of
several nations in order to make meaningful inferences about
possible causal associations (such as the role of diet, alcohol or
smoking in particular diseases). By contrast, local mortality
studies call upon national rates to provide a reference set of
background mortality levels against which local experience can be
measured. In this context, the tern `local' can usefully be
broadened to encompass the study of any subset or sub-division of
the national population; the application of standard national death
rates to subsets of the population provides one of the most
frequently used techniques in such analyses. Topics covered
include geographic comparisons, occupational comparisons and the
prospective follow-up of a wide variety of possible risk groups.
Each of these applications requires a tedious set of calculations
to be performed. As access to computers becomes more widespread
these calculations are almost invariably automated. However, the
extent to which this can be done is dependent on the availability
of national rates on computer.
At an international level, WHO showed considerable foresight in
setting up a database in the mid seventies which contained death
rates for a number of countries by age, sex and a short list of
causes. This data file extends back to 1950 and has been
distributed to a number of institutions throughout the world.
Despite the boldness or this concept, the file has two limitations
as far as applications at a national and sub-national level are
concerned. The range and specificity of the available causes ore
limited to what was readily comparable over time and between
countries. Secondly, the time span covered by the data Is limited
to recent history; many developed countries published comparable
mortality statistics for a number of years before 1950. So as to
extend the range of analyses that could be performed directly on
computer a number of similar, but more detailed files were set
up at various institutions throughout the world.
OPCS, in reviewing both its own needs and the contribution which it
could make to the greater availability of national death rates on
computer decided, in 1979, that rather than devoting scarce
analytic resources to constructing another synthetic database, it
would use its unique position to construct a database comprising
only the basic building bricks for constructing any aggregate
database. In this instance the basic components or the database
comprise numbers of deaths, held to the lowest level to which cause
was routinely coded. For recent years, this comprises four digit
codes laid down in the International Classification of Disease
(ICD) operational at the time the death was registered. The
calculation of rate is mode possible with this set of data by the
provision of a comparable tape of estimates of population at risk.
VARIABLES
Year
Individual years are held on file with each represented by
the final two digits of the year number; thus 1967 appears as 67.
Sex
Figures appear separately for each sex on the file; the code for
males is 1 and that for females is 2.
Age
For most years, data are held in the following age groups: under
one, one to four, five year age groups from five to 84 and then all
aged 85 and over. Exceptions to this rule are the years 1901-1910
and those covered by the third, fourth and part of the fifth
revisions of the ICD (1921-1941). During the former period, at
ages 25-84 figures are stored in ten year age groups (rather than
five year groups) while, for the latter period, figures at ages 80
and over are not sub-divided on the deaths tape for any of the 21
years and are not sub-divided on the population tape for the years
1921-1939. In other respects, data by age in these time periods is
stored identically to that in other time periods.
Reference values used on the data tape for each age group are held
in coded form. The codes used for each of the 26 possible age
groupings appearing on the file are shown in Table 5.
CAUSE OF DEATH CODES
Coding Scheme
Since the introduction of the sixth revision of the ICD, cause
codes have been represented in a broadly similar manner in each
revision of the classification. The first three digits of the four
digit code indicate the cause group to which the death is assigned;
the fourth digit is then utilized, for some but not all cause
groups, to provide further amplification of this basic
classification. The first three digits are always represented as a
number between 000 and 999. OPCS convention in publications is to
represent the last digit by either a number when the sub-division
is shown or by a hyphen when no sub-division is shown in the ICD.
For the convenience of those producing computer tabulations, on the
historic deaths file for ICD revision six to nine, hypens have been
erplaced by zeros. The computer are thus purely numeric in the range
0003-9999.
According to these four revisions, deaths which are not due to
natural causes may be assigned two codes; one covering the external
cause of injury and the other the nature of injury. To avoid the
possibility that deaths may be double counted, only the counts for
external causes are included on the historic deaths file. Codes in
the range 8000-9999 thus refer to causes listed in the ICD volumes
under the chapter covering external causes of injury.
Causes of death according to the second, third, fourth and fifth
revisions of the ICD were classified to something less than 200
groups, numbered from one upwards. Some of these groups were then
sub-divided as in later revisions using, in these instances, a
combination of alphabetic and numeric subscripts. In order to
achieve comparability with later revisions, the major cause
grouping (belonging to the list of approximately 200 causes) has
been transcribed directly as a three digit numeric code and the
various possible sub-divisions numbered from one upwards. To
provide a purely numeric fourth digit the absence of any sub-
division is indicated by a zero. The entire four digit computer
code for ICD revisions two to five is thus purely numeric.
A chart, giving approximate conversions between ICD codes and
computer codes, is given in Table 6. One potential difficulty
created by this method of conversion is that the technique only
works for causes with nine or less sub-divisions. In only one
instance did this actually prove to be a problem. In the fifth
revision of the ICD, there were 14 sub-divisions of code 157,
congenital malformations. A purely pragmatic solution to this was
achieved by using 1571-1585 (excluding 1580) to represent these
sub-divisions; where congenital debility, 158, should have been
allocated a code 1580, it was instead allocated a code 1589, to
avoid overlap of the two causes. Tabulation of mortality from
congenital anomalies in the years 1940-49 should therefore be
undertaken with extreme caution.
During the period 1901-1910, an unnumbered list of causes was used
in England and Wales in place of the new International
Classification which had not yet been widely adopted. On the
historic mortality data file, the lowest level of disaggregation of
cause has simply been numbered from 001-191, as shown in Table 6.1.
In general a fourth digit of zero has been used to avoid
unnecessary complexity. However one (little used) category,
"other specified diseases", has been given a code 1741 to enable it
to be distinguished both from identifiable specified cause
groupings and from the (then rather large) group of vague or
unspecified causes of death.
Missing values
A cautionary note concerns the years 1901-1958, when data
were obtained from the Annual Review Volumes. Where the ICD
provided a sub-division of a cause but none was published in one or
more years or where the sub-division was only partially utilized,
deaths in those years have been arbitrarily assigned to one of the
unused sub-divisions of that cause.
The details of codes and years affected are shown in Table 7.
CONSISTENCY CHECKS ON THE DATA
Using some simple methods of checking, an effort has been made to
ensure that the data are internally Consistent and are in close
agreement with the figures published at the time. At this distance
removed in time, it has not always been possible to achieve exact
agreement, as indicated below.
In terms of quality, three time periods may be distinguished. From
1968, it was possible to utilize Computer summaries already created
for routine tabulation purposes. The only Checks considered
necessary for these years was to ensure that every four digit cause
had been carried across and to make spot checks on the
values carried across. There checks have been made and have
revealed only one problem in so far as the final database is
Concerned. During the course of the eighth revision, changes were
made to the OPCS implementation of the ICD. These involved the
introduction of additional fourth digit subdivisions in 1975-6.
These changes are described in Table 8 and provide a cautionary
note to attempting relevant detailed fourth digit tabulations
Covering 1968-1978.
Between 1959 and 1967, it was necessary to re-tabulate counts of
deaths from computer records of individual deaths. These computer
tapes are held in archived form and have been subject to re-Copying
as old tapes have deteriorated and to reformatting as old computers
have been replaced. A Certain amount of data-corruption and loss
must be considered inevitable in these Circumstances. Table 9
shows the magnitude of discrepancies between figures on the
database and those published at the time by age, sex and year.
These are considered tolerable for statistical purposes. At a
cause specific level, similar error rates are found.
A somewhat different method of validation was employed in the
checking of manually transcribed figures for 1901-1958. For each
year, the computer summed age specific figures for each four digit
cause and checked these against the corresponding published figure
for all ages. At the same time, cause specific figures for each
group were summed to produce totals for each ICD chapter.
These were then checked against corresponding published figures.
By this dual checking procedure errors were pinpointed and
corrected. Consequently on the final tape each cause Contributes
the correct number of deaths at all ages and figures by age are
Correct for each chapter. Despite the cross-checking that was
undertaken, it is possible that some mutually compensating errors
are present within the sub-matrices defined in
this way. However, it is unlikely that any such errors are of
sufficient magnitude to distort any statistical analysis performed
using these data.
A minor problem was revealed for the years 1934, 1938 and 1939 in
the course of checking the data. Some 265 deaths in a colliery
disaster in 1934 were not registered until 1938-39. The figures
relating to these deaths were excluded from Table 17 in both the
year of occurrence and of registration. For the sake of
completeness it was decided to include them on the historic
mortality data file, but to allocate them, contrary to normal
practice, to the year of occurrence (namely 1934). They have been
allocated cause code 1942.
Inevitably in a checking procedure such as that described above,
occasional printing errors are uncovered in published figures.
Where this has occurred, figures on the historic deaths file have
been adjusted to achieve the internal numerical consistency
rather than agreement with evidently incorrect published figures.
However, no attempt has been made to reclassify deaths for any
reason. Thus changes in coding rules from year to year or any
error in the compilation of cause statistics in isolated years all
appear on this tape, in full agreement with figures published at
the time.
TABLES
List of Tables
Table Title
1 Years for which each ICD revision was
implemented in England and Wales
2 Age groups available for the Periods Covered by
each ICD revision
3 Sources from which Population estimates
Computer tape was derived
4 Computer record layout
5 Computer Codes for age groups
6 Computer Codes for cause of death in 1st Revison ICD
Table 1
Table 1 Years for which each ICD revision was implemented in England & Wales
ICD Revision Years
1+ 1901-10
2* 1911-20
3* 1921-30
4* 1931-39
5* 1940-49
6 1950-57
7 1958-67
8 1968-78
9 1979-
+ An unnumbered list was used in England and Wales rather than the
international classification during this period
* As amended for use in England and Wales
Table 2
Table 2 Age groups available for the periods covered by each ICD revision
ICD Revision Age groups
1 Under 1, 1-4, 5-9, 10-14, 15-19,
20-24, 25-34, 3544, 45-54, 55-64,
65-74, 75-84, 85+
2 Under 1, 1-4, 5-9 and 5 year age
groups to 80-84, 85+
3,4 Under 1, 1-4, 5-9 and 5 year age
groups to 75-79, 80+
5-9+ Under* 1, 1-4, 5-9 and 5 year age
groups to 80-84, 85+
+ For the years 1940-41 age groups used for mortality are as for
ICD revisions 3 and 4; that is to say, for these two years the
highest available age group is 80+ rather than 85+.
* Neonatal deaths by cause are not included on the tape from 1984
onwards. All cause for neonatal deaths is shown at ICD 0000.
Table 3
Table 3 Sources from which population estimates computer tape was derived
Years Source of data
1901-1910 Population based on final 1911 Census data using the
method described in the 73rd Report of the Registrar
General, 1910, but with revised factors and with
an adjustment for the leap years 1904 and 1908.
Figures for age `under 1 year' have been estimated
by taking births minus deaths aged under 1 year for
two years and dividing by two.
1911-1920 The Registrar General's Decennial Supplement 1921,
pt III, Table 1, page 8. During the period
1915-1920, for males, figures used Correspond to the
Civilian population.
1921-1930 The Registrar General's Decennial Supplement 1931,
pt III, Table 1, page 2.
1931-1939 The Registrar General's Statistical Review of
England and Wales for the years 1938 and 1939 Text,
Table XCIV, page 156. Figures used correspond to
mid-year estimates for the period 1931-1938 and to
the male and female population in 1939.
Figures for age `under 1 year' have been estimated
from Annual Statistical Reviews.
1940-1945 The Registrar General's Statistical Review of
England and Wales for the six years 1940-1945, Text,
Volume II Civil, Table III, page 10. Civilian
populations for males and females were used (where
these were distinguished from total populations).
Figures for age `under 1 year' have been estimated
from Annual Statistical Reviews.
1946-1950 The Registrar General's Statistical Review of
England and Wales for the five years 1946-50 Text,
Civil, Table III, pages 10-11. Figures used
correspond to the mean Oivilian population for 1946,
Civilian populations for the period 1947-1949 and
the Home population for 1950.
Figures for age `under 1 year' have been estimated
from Annual Statistical Reviews.
1950-1952 The Registrar General's Statistical Review, 1952
Text, Table 1, page 6. These figures were revised
using 1951 Census results.
1953-1959 The Registrar General's Statistical Reviews annually
1953-1959, pt II, Table A2, Home population.
1960 The Registrar General`s Puarterly Return for
England and Wales, Puarter ended 31st March 1966,
Appendix J, page 46. These figures were revised
using 1961 Census results.
1961-1980 OPCS Monitor Series PP1 84/1 issued 10 January 1984,
Tables 1(a) and 1(b). These figures include
revisions based on 1981 Census results.
1981-1983 OPCS Monitor Series PP1 84/3 issued May 1984, Tables
1-3 (final estimates).
Table 4.1
TABLE 4.1 IBM/IOL 2900
HISTORIC MORTALITY DATA TAPE RECORD DESCRIPTION
START SIZE TYPE DESCRIPTION RANGE
POSITION OF CODES
1 7 character Number of deaths 0000001.0012649
8 4 character ICD code 0000-9999
12 2 character Year 01-83
14 1 character Sex 1, 2
15 2 character Age code 01-26
Table 4.2
TABLE 4.2 IBM/ICL 2900
HISTORIC POPULATION DATA TAPE RECORD DESCRIPTION
START SIZE TYPE DESCRIPTION RANGE
POSITION OF CODES
1 5 character Number of deaths 0000001-3096687
6 4 character Filler 0000
10 2 character Year 01-83
12 1 character Sex 1, 2
13 2 character Age code 01-26
Table 4.3, IBM
TABLE 4.3 IBM
HISTORIC DATA FILES TAPE DESCRIPTION
Tape width 0.5 inch
Record length 16 bytes
Block size 2400 bytes
Header labels Unlabelled
Packing density 1600 b.p.i
Data representation standard EBCDIC
Recording mode Phase encoded
Parity Odd
Interblock gap 0.6 inch
Number of tracks 9
Table 4.3, ICL 2900
TABLE 4.3 ICL 2900
HISTORIC DATA FILES TAPE DESCRIPTION
Tape width 0.5 inch
Record length 16 bytes
Block size 8192 bytes
Header labels Standard ICL 2900
Packing density 1600 b.p.i
Data representation standard EBCDIC
Recording mode Phase encoded
Parity Odd
Interblock gap 0.6 inch
Number of tracks 9
Table 5
Table 5 Computer codes for age groups
Age Code Age Code
Under 1 01 45-49 11
1-4 02 50-54 12
5-9 03 55-59 13
10-14 04 60-64 14
15-19 05 65-69 15
20-24 06 70-74 16
25-29 07 75-79 17
30-34 08 80-84 18
35-39 09 80+ 19
40-44 10 85+ 20
* On ICD 1st revision these 10-year age groups are used.
** On ICD 3rd, 4th revision, the highest age group is 80+; for
deaths, this also holds true for 5th revision years 1940-41.
Table 6
Computer codes for cause of death in each ICD revision
Unnumbered cause list used in England and Wales 1901-1910 (1st revision)
COMPUTER
CAUSE OF DEATH CODE
Small-pox:-Vaccinated 0010
" " Not vaccinated 0020
" " Doubtful 0030
Cow-pox and other effects of vaccination 0040
Chicken-pox 0050
Measles (Morbilli) 0060
German measles 0070
Scarlet fever 0080
Typhus 0090
Plague 0100
Relapsing fever 0110
Influenza 0120
Whooping cough 0130
Mumps 0140
Diphtheria 0150
Cerebro-spinal fever 0160
Pyrexia (origin uncertain) 0170
Enteric fever 0180
Asiatic cholera 0190
Diarrhoea due to food 0200
Infective enteritis, Epidemic diarrhoea 0210
Diarrhoea (not otherwise defined) 0220
Dysentery 0230
Tetanus 0240
Malaria 0250
Rabies, Hydrophobia 0260
Glanders 0270
Anthrax (Splenic fever) 0280
Syphilis 0290
Gonorrhoea 0300
Phlegmasia alba dolens 0310
Puerperal septicaema, Puerperal septic intoxication 0320
" " pyaemia 0330
" " fever (not otherwise defined) 0340
Infective endocarditis 0350
Pneumonia:-Lobar 0360
" " Epidemic 0370
" " Broncho 0380
" " Not defined 0390
Erysipelas 0400
Septicaemia, Septic intoxication (not puerperal) 0410
Pyaemia (not puerperal) 0420
Phegmon, Carbuncle (not anthrax) 0430
Phagedaena 0440
Other infective processes 0450
Pulmonay tuberculosis (tuberculous phthisis) 0460
Phthisis (not otherwise defined) 0470
Tuberoulous meningitis 0493
" " peritonitis 0490
Tabes mesenterica 0500
Lupus 0510
Tubercle of other organs 0520
General tuberculosis 0530
Scrofula 0540
Parasitic diseases 0550
Starvation 0560
Scurry 0570
Alcoholism, Delirium tremens 0580