The study of the spatial distribution of disease incidence and mortality is a basic approach to find possible causes. Traditionally, aggregation (clusters) of health events has been recorded for certain large administrative geographical areas, such as the provinces and municipalities in Spain. These methods for geocoding healthrelated events have been used in most countries. The result is that the specific factors that cause aggregations of cases of disease cannot be investigated unless its size and geographical boundaries match with the spatial units coded. For years, this has been a recognised limitation (Gatrell and Loytonen, 1998). Attempts to overcome them have always clashed with the lack of an alternative method of spatial encoding of the health events. In recent years, the introduction of Google Maps, ggmap, ggplots, R packages, etc. and the development of spatial epidemiology (Elliott et al., 2000) radically changed this scenario by allowing connection of geocoding health events to the address point level. With that, the limited and inefficient old coding approach, which assigned individual records to areas or predefined regions, is no longer sound.
To know the real cancer incidence in a given population requires having access to a high-quality population-based cancer registry (Forman et al., 2013) and this implies that they must have a physical infrastructure and trained personnel. Usually the inclusion of each case requires considerable effort, which can delay the end of a registration period with several years, hampering the epidemiological surveillance of certain diseases. In this connection, the epidemiology services of the Autonomous Communities in Spain are making great efforts in integrating their information systems (cancer incidence registries, mortality registries, health surveys, clinical information) to produce an instrument for assessing morbidity and mortality for any type of pathology and surveillance of the non-communicable diseases (Mayoral Cortes et al., 2016). Examples of these systems in Spain are the Minimum Basic Data Set (MBDS), the hospital cancer registries and the healthcare information systems.
The MBDS records information on the patients and health centres based on the coding of all diagnoses of hospitalised patients discharged each month and sent to the department of health in each local county. The MBDS is the only authority that has both state coverage, mandatory completion and linkage of administrative data with diagnoses regarding all patient information with exception of mortality. Many of the information systems described above, including the MBDS, have the information necessary to carry out geocoding, which allows performing spatial epidemiological investigations such as the detection of outbreaks of cases (Abellan et al., 2002; Lopez-Abente and Ibanez, 2002). This information has been used for reporting epidemiological surveillance with regional character or the study of certain diseases (Gil Prieto et al., 2009; Garcia-Garcia et al., 2010). However, its use in epidemiological surveillance of cancer is rare (Gil et al., 2007), which is because there is little information on the capacity or coverage of these systems with respect to case detection and because of the long induction periods of many cancers. In the case of MBDS, it has been used in numerous studies of epidemiological surveillance of various infectious diseases in Spain, but in cancer it has only been used to study the extent of admissions for cervix cancer during the period 1999-2002 (Marquez Cid et al., 2006).
The development of information systems in public health services invites the use of this infrastructure for health-related epidemiological studies. In this context, the aim of this paper is to explore the feasibility of MBDS as a tool for epidemiological research on cancer, specifically in monitoring its occurrence through the detection of spatial clusters of cases.
Materials and Methods
Case control studies of cancer patients were designed with special reference to the geographical coordinates of the patients’ addresses of residence. These designs are included in the scope of spatial data analysis known as point processes (Diggle, 2003).
The location of the area of study, the town of Alcala in Madrid, Spain, can be seen in Figure S1 (Supplementary file). This area is located in the centre of the country near the capital (Madrid).
The MBDS hospital discharge data, obtained directly from the Prince of Asturias University Hospital (HUPA) for the time period between January 2012 and June 2014, was used. This is the only public hospital in the study area, and admissions collected in the MBDS from it are mostly from the region of the study. The HUPA has a hospital cancer register (HCR), which is not the case in all hospitals in Spain. The HCR started its activity in 2008 based on pathology reports only, but in mid-2011 the MBDS was added as source data. The data sources that currently feed the HUPA’s HCR are MBDS, pathology and haematology reports, the proceedings/minutes of the hospital committee of tumours and chemotherapy lists. This circumstance allowed us to assess the MBDS as a source of alternative cancer records data using data from a HCR as the reference standard in another study already published in the same region of the study area (Fernandez-Navarro et al., 2016).
The MBDS records contain the birth, admission and discharge dates of every patient, as well as the patient’s personal data and disease diagnose(s) code according to the International Classification of Diseases (ICD-9-CM), which can be up to 13 different diagnoses. In the case of cancer, diagnoses can correspond to incident tumours as well as to a history of cancer (prevalent and cured). All recorded cancer diagnoses from hospital admissions in the MBDS during the study period were collected. In order to carry out this selection, the 13 different diagnoses included in the MBDS for the whole period in each patient were assessed. Once a diagnosis of cancer (described below) appeared for a patient, he or she was selected as a case. In that way, the same patient can be accounted for different cancer cases groups. For the analysis, we have selected the following tumour sites: stomach cancer, colorectal cancer, lung cancer, breast cancer in women, prostate, bladder and kidney cancers, melanoma and haematological tumours (non- Hodgkin’s lymphomas, myeloma and leukaemias).
Patients older than 39 years and registered in the MBDS during the study period by any of the diagnoses of cancer described above were selected.
Geocoding and selection of control groups
The Statistics Institute of the Community of Madrid in Spain geocode automatically and routinely use the home address of all those enrolled in the municipal register of the region to build a geocoded municipal register. The tools developed are applicable to records containing postal addresses. The geocoding of the cases in our study was carried out by locating them in the annual municipal rolls of all patients registered in the MBDS extracting the coordinates of their address of residence. For this, a file with identifiers but without diagnoses was referred to the institute. All patients non-domiciled in the municipality were excluded.
In this type of study, the control group provides information on the spatial heterogeneity of the population. To select the control group, the continuous register of inhabitants geocoded provided by the Statistical Institute of the Community of Madrid was used. A random sample of 10 controls per case matched by frequency of age and sex of each tumour location was obtained. The recorded variables were: coordinates (x and y) in the ED50/UTM zone 30N projection (European Petroleum Survey Group spatial reference 23030) (http://spatialreference.org/ref/epsg/ed50-utm-zone-30n/), sex and year of birth. With respect to the coordinates of residence, the last digit of coordinates (x, y) was assigned randomly in order to preserve confidentiality. This change does not alter the results of the spatial analysis since the coordinates are expressed in meters.
The MBDS population coverage has been evaluated in order to have an approximate idea of cases not included in the hospital base of the study (Table 1). For this, the age-specific incidence rates of the population-based Cancer Registries of Cuenca and Tarragona was used to calculate the cases expected in Alcala if it had an incidence similar to one of these provinces. These rates were obtained from the European population-based cancer registries database, (EUREG) (http://eco.iarc.fr/eureg/). These registries were selected because they are the provinces most similar to the municipality of Alcala among all those existing in Spain.
Average socioeconomic status
Low socioeconomic levels have been associated with the development of different chronic diseases like cancer. In order to assess if it is a possible cause of the spatial clustering of this type of disease, the average socioeconomic status (ASE) by census track, included in the 2001 Spanish census, were represented on a map. The ASE is a combination of occupation variables, such as activity and professional status of the household, where a high value represents a high socioeconomic level.
To assess the possible aggregations of cases, we studied the spatial distribution of cases by tumour location through Ripley K function and its difference between cases and controls (Diggle and Chetwynd, 1991) that is the standard procedure for cluster detection using point processes. The assessment of the statistical significance of spatial aggregation was investigated by random labelling and Monte Carlo simulation methods. For spatial location of possible clusters, the distribution of risk in the study area was evaluated through the spatial intensity of the process, i.e. estimating the frequency of cases at each specific location. The parameters of the scanning window were defined empirically. The ratio of spatial intensity of cases against controls is interpretable as a relative risk (RR) and it allows estimating the surface of risk by detecting the peaks exceeding the margin of statistical significance. These surfaces of tolerance, that show areas with excess risk, were obtained by random labelling and Monte Carlo simulation methods (Kelsall and Diggle, 1995). To facilitate this analysis, a simple polygon that contain cases and controls was defined. The estimated ratios of smoothed intensities for each tumour site, using the default kernel functions included in the software used, were represented in maps (Kelsall and Diggle, 1995). The results are shown in figures, each of which includes the k-difference (k cases-k controls) and two lines with the 95% confidence interval (CI) for each type of cancer. When the K-difference (y-axis) crosses the confidence bands is an indication of clustering at the distance in abscissa axis (Distance in meters). Moreover, surfaces of tolerance and Kernel ratio of the intensity (Relative Risks, RR) of cases and controls will be shown in maps to locate the possible spatial clusters for each type of cancer.
Cartography and software
For the mapping representation the municipal and provincial census, sectioning of the National Statistics Institute of Spain (INE) (http://www.ine.es/en/welcome.shtml) from official cartography, was used. All analyses and cartographic representations were made using software R and maptools and splancs libraries (R Development Core Team, 2005).
The HUPA registered 283.796 patients between 1997 and 2011, 75.3% of whom (213.526) were found in the geocoded municipal register. Out of those 98.9% (211.165) had assigned coordinates, while 1.1% were without geocodes. However, 70.270 patients were not located in the municipal rolls (24.7%) and may to a large part correspond to non-residents in the municipality.
Table 1 shows the estimation of the population coverage for all analyzed tumour locations. In general, there was a good coverage over what was expected using the incidence rates of the population cancer registries, but there were tumour locations that are over-represented, such as bladder cancer. The total coverage was about 74-87% with data reference from the register of Tarragona and Cuenca, respectively. Table 2 shows the number of cases recorded in the MBDS and the number of cases and controls included in the analysis. The study included 2,683 cases and 27,825 controls where the most frequent tumour was found to be colorectal cancer followed by breast cancer (in women) and prostate cancer.
Clusters of cases
Figure 1 shows the distribution of cases of lung cancer and its controls over the mapping of the sectioned census of the study region by way of example. The grey points correspond to the control group representing the spatial heterogeneity of the population. In this case, it seems that there is a high density of cases of lung cancer in the Centre and in the South of the study region.
Figures 2-4 show the result of the detection of spatial clustering of cases relative to controls for the different tumour sites studied. Specifically, Figure 2 shows the results for cancers of the lung, bladder and kidney, where the main risk factor is smoking. These three tumours show signs of spatial aggregation of cases on the difference of the K functions. Comparison of the spatial density of cases and controls point to an area in the Southwest of the municipality with a higher incidence than determined by random and marked with a thick continuous stroke in Figure 2. In contrast, there is no such kind of relevant spatial aggregation for the stomach or colorectal cancers (Figure 3).
Figure 4 shows the results of the analysis for the breast cancer (in women) and prostate cancer. In both, cases and controls the spatial density distributions are similar. The same figure shows the results for the haematological tumours in which there is a certain aggregation of cases but it does not exceed the envelope of tolerance (K-functions difference). The comparison of spatial densities designates an area of possible aggregation that partly coincides with the one shown by tumours in Figure 2. Finally, the detected area in the Southwest as seen in the map of Figure 1, which corresponds to an area called the Catholic Monarchs, which has a low socioeconomic status according to the data. Figure 5 shows the average socioeconomic status for each of the census tracts in the study region. The information about this socioeconomic status is a combination of variables about occupation, activity and professional status of the household.
The achievement of this work shows the feasibility of designing and carrying out cancer case-control studies from the MBDS. These data originated in the National Health System, not designed for this purpose but the idea is transferable to other data of public administration. Moreover, the main result of the study includes the existence of a clear and similar pattern of aggregation of cases for cancers of the lung, bladder and kidney in a particular region of the urban area of study. These cancers have a common risk factor such as the use of tobacco (Adami et al., 2008).
Internationally, smoking rates are particularly high among people who are socioeconomically disadvantaged (Hiscock et al., 2012). This fact, could explain the aggregation of cases with tumours closely related to smoking found in the area with a low socioeconomic status depicted in Figure 5. Possibly there is a high rate of smoking in this area.
Minimum Basic Data Set as a tool in epidemiological research on cancer
This work shows the feasibility of using MBDS as an exploratory tool in epidemiological cancer research, providing useful information for cancer monitoring in a region and for identifying potential risk factors. The suitability of the use of the MBDS for any of the functions described, depends on the sensitivity of this registry to identify cancer cases. In this way, a previous study (Fernandez-Navarro et al., 2016) of our group has shown that it is 74% (72-76; 95% CI) for all cancer sites, although it varies depending on the type of tumour, reaching the highest values for bladder cancer (96% (92-98; 95% CI)) and their specificity and VPN were very high for all types of cancer studied, always on top of 95%. All these results suggest that, except for certain tumour sites, the MBDS can be a valid source for information to be used in epidemiological studies. Although the study that was carried out using the MBDS in a specific location, i.e. the region of Alcala, it could serve as a pilot study for other regions of Spain, except for big cities like Madrid in central Spain and Barcelona in the northeastern part of the country. In these cities, health care is based on more complex models involved several public and private hospitals and health centres. The design could also be applied to hospital tumour registers where they exist. In our case, knowing that not all hospitals have such records, we have selected the MBDS as a source of cases looking for some portability design.
Cancer vs other diseases
Although the study shown here focused on cancer, a pathology with a long induction period (Schottenfeld and Fraumeni, 2006), there are other situations where the spatial techniques used could have an important role. In this sense, the performance of studies that attempt to elucidate the environmental origin of a series of cases that cluster in time and space, like the one shown here in the manuscript, may be higher in the case of acute conditions such as communicable diseases and/or associated with emission sources of biological agents or toxic substances (e.g. outbreaks of Legionella infections or respiratory diseases such as asthma), where the induction periods are very short. These studies, apart from detecting cluster of disease, could help to find the emission sources and/or issuing alarms or recommendations to the population.
The main limitation of this study is the possible existence of unidentified cases, due to problems related to geocoding and sensitivity of the MBDS registry. In relation to the first potential problem, the geocoding using the geocoded municipal register where the addresses where geolocated by professionals of the Community of Madrid, the 19% of addresses of all cancer cases were not found in this registry. These values do not vary too much depending on the type of tumour analyzed in this study (range 13-25%). This limitation does not seem to be too relevant in the context of the exploratory objectives of the study. Furthermore, the sensitivity of the MBDS to detect cancer cases described above is quite high. In that way, cancer cases included in the analysis seems to fit very well to the reality.
Another possible limitation is the non-inclusion of covariates in the detection of spatial clusters of cases. The inclusion of covariates in the analysis could also allow a statistical evaluation of the influence of certain factors in the spatial aggregations but this option exceeded the objectives of this work. Finally, the kernel smoothing techniques requires the specification of parameters related to the window size that could modify the obtained images. In our analysis we have not studied the differences that may exist by gender, or by the failure to take into account variables such as ethnicity or other intrinsic characteristics of census tracts given the exploratory nature of the study. An added difficulty to control contextual variables such as socioeconomic status is the absence of such data stratified by sex in the census from the INE. We should get this kind of information from other places.
Advantages and strengths
The main advantages of using the MBDS as an exploratory tool for spatial cancer research as shown in this article, are the speed and low cost of the process. The geocoding process by identifying patients in municipal records is also very efficient and allows selecting control groups with ease. Additionally, the design of the study shown here, that involves a close collaboration between different public institutions, could form the basis of an automatic warning system surveillance. These systems that should be pending updates of the MBDS, may emit different types of warnings managed by epidemiologists. All these processes could be extrapolated to primary care to provide a comprehensive alert system. Moreover, the study has other strengths derived from the validity of intrinsic inference of case studies and controls, and the data quality both in terms of coverage (Table 1) and detection of cancer cases in the region. Moreover, the design and the model applied could be exported to many other areas. Finally, in the context of health surveillance, it should be noted that due to the clear exploratory nature of this study, all the results should be taken into account with caution. However, the inclusion of these techniques in the routine exploitation of hospital administrative data could be an easy first way to explore health problems and not only in hospitals because the primary health through its information systems could be included. Furthermore, the results from the study of clusters of disease could generate environment exposure hypothesis that could be affecting the health of people giving the opportunity to be verified by the competent authorities.