Competing interest statement
Conflict of interest: the authors declare no potential conflict of interest.
Geocoding is the process of matching postal addresses to their corresponding geographical coordinates (i.e. latitude, longitude) (Rushton et al., 2006). Sophisticated science, data sets, and algorithms underlie this complex process (Boscoe, 2008; Zandbergen, 2008). There are a large number of published studies (Goldberg, 2008; Ratcliffe, 2001) that describe the numerous algorithms that are used during the geocoding process to attempt to match an input address to an address stored in a reference database. The variability in algorithms, addresses, and databases can lead to a variety of errors in the geocoded results (Ratcliffe, 2001; Gilboa et al., 2006; Schootman et al., 2007; Zandbergen, 2008, 2011; Goldberg et al., 2013). There is no such thing as a one size fits all type of geocoding system that works perfectly in every situation and for every user. The accuracy of this complex process can range from the centroid of a rooftop to the centroid of a state (Jacquez and Rommel, 2009). This leads to the following questions: Should inaccuracies be incorporated into research or should they be omitted entirely? Should inaccuracies be corrected? Is there a threshold that inaccuracies should not exceed?
Previous studies have indicated that researchers should attempt to correct inaccurate data so that real world variances can be incorporated into analysis (Krieger, 2003; Zandbergen, 2007; Goldberg et al., 2008; Goldberg and Cockburn, 2012; Murray et al., 2011; Zandbergen, 2012). The practical application of reducing geocode inaccuracies is to improve the source data (i.e. geocoded data) used for spatial analysis (Strickland et al., 2007). However, despite calls to pay heed to geocode quality by type and to employ manual geocode correction methods, there are few documented case studies that evaluate the cost effectiveness of this practice, or the improvements that can be expected by undertaking such an effort (Goldberg et al., 2008). The purpose of this study was to quantify the effort (i.e. time) required to manually correct the geocodes in a health related dataset, as well as the match rate improvement between the original geocoded and the corrected geocode, and the corresponding spatial shift by geocode quality type resulting from the corrections. The results of this study can be used to help guide researchers as they decide whether or not to undertake manual geocoding correction to improve the geocode quality type of a dataset.
Materials and Methods
Web based geocoding and interactive geocoding correction procedures were performed using the Texas A&M University (TAMU) Geoservices Online Geocoding service, version 4.01, which was developed by the study authors (Goldberg et al., 2008). The corrections were performed by the study authors, a Ph.D. student and an honors undergraduate student. This web-based system allows for rapid manual intervention of previously geocoded data by drawing from online satellite imagery, street maps, and additional geocoding engines to determine an improved geocode for each record (Goldberg et al., 2008).
This system allows a user to upload a dataset and analyse each record one at a time. It compares the current location of each geocode to that of another location provided by an alternate geocoder (i.e. Google Maps) within the TAMU online geocoding platform, and allows the user the flexibility to execute a manual intervention process to determine a more accurate geocode. The user can select which geocoder produced a more accurate location and the dataset can be updated with the corrected coordinates. In the event that neither geocoder provides an accurate location, the user can utilise online sources to refine an address (e.g. misspelling of an address) as well as aerial imagery and street views to attempt to find the location intuitively, and visually verify a location using Google Maps. The TAMU Geoservices Online Geocoding service utilises publicly accessible data so person-hours are the only cost associated with the geocode correction processes. It is free to all researchers (https://geoservices.tamu.edu/), and the source code can be made available upon request to researchers and/or organisations that wish to use it.
To analyse the impact of the geocode correction process, a health related dataset was used. This dataset contained 784 addresses of health service facilities located within the state of New Mexico that offered cervical screening (Pap and/or Human Papillomavirus testing), diagnostic testing (colposcopy), and excisional pre-cancer treatment (loop electrosurgical excision procedure or cone biopsy). Although this data is publically available, it is not practical to obtain information on specific tests offered by individual clinics or providers. This unique health service facilities dataset was provided by the New Mexico HPV Pap Registry (NMHPVPR). The NMHPVPR is the first population-based statewide cervical screening registry in the United States; it includes address-level data on healthcare facilities providing aforementioned services in rural and urban areas. Due to the uniqueness of this data set, the authors invested the effort to have the most accurate geocoding possible.
The first step of processing was to geocode the entire set of addresses using the TAMU Geoservices Online Geocoding service. The version of the geocoding service used for this research included the 2015 Navteq Address Points database, the 2010 USPS ZIP+4 reference files, the 2010 Boundary Solutions National Parcel Data Layer, and the 2010 US Census TIGER/Lines the reference, and the US Census Bureau 2010 Cartographic Boundary files for Minor Civil Divisions, Zip Code Tabulation Areas, Counties, and States. Once the results were obtained, the geocoded file was uploaded to the TAMU Geoservices Online Geocoding Correction Service; Figure 1 displays the geocode correction tool interface. This service provides a user interface that displays a map that shows the point obtained from the TAMU geocoding system and the point obtained from the alternate geocoder, i.e. Google Maps. If the alternate geocoder is able to find a match that is more accurate than the original match, a button can be pressed that updates the original geocode with the more accurate geocode. As previously noted, in the case that both geocodes appear to be inaccurate, the next step would be to attempt manual interactive geocoding. Online resources can be used to refine the address contained within the input file and often photo(s) of the building to be geocoded are available online. In addition, the user can study aerial imagery and street views of the location and attempt to manually locate the site; Figure 2 displays the correction prompt. If the site is located, the user marks that spot on the map and the geocode will be updated. These processes were used to update and correct the health service facility dataset analysed for this study. The final file contained information about the original geocodes and the corrected geocodes, which were used for comparative analysis.
This section provides a description of the results that were obtained from manually correcting the 784 geocodes. The same method used in prior research (Goldberg et al., 2008) was used to classify an improved record as one of two criteria (Rushton et al., 2006). A record that was originally non-geocodable and a geocode was obtained after processing was categorised as criteria one. A record that was previously geocodable and the accuracy of the geocode was improved after processing was categorised as criteria two (Boscoe, 2008). It should be noted that we considered a record that has a lower North American Association of Central Cancer Registries (NAACCR) GIS Coordinate Quality Code (Goldberg, 2008) after it has been processed, to be an improvement in accuracy according to criteria 2. We acknowledge that without direct field observation, it is not possible to assess with 100% accuracy that the original geocode was improved. All of the records in the dataset were geocodeable in the original file, therefore no records met criteria one. For measuring improvement, we followed the geocode output type hierarchy of the NAACCR GIS Coordinate Quality Code.
Of the 784 records, 709 met criteria two. Ninety percent of the original addresses were corrected to a higher accuracy after the manual correction processes and 10% did not change. Of the 75 records that did not change, 21 were of the Exact Parcel Centroid quality, 50 were of Address Range Interpolation, and four records were of the USPS Zip Centroid quality. Table 1 shows that of the 71 addresses that matched to either Exact Parcel Centroid or Address Range Interpolation these records were already either the second or the third highest ranked geocode quality types (Goldberg, 2008).
Table 1 contains the original and corrected geocode quality type for the dataset. The original dataset contained zero records that were geocoded to the Building Centroid quality type. The corrected dataset contains 638 (81.38%) geocodes of this quality. It is notable that the original geocoded dataset contained 204 (26%) geocodes that matched to the USPS Zip Centroid quality type and after manual geocoding correction there were only four (<1%) records.
The correction process of the entire dataset consisting of 784 records was completed in 42.21 hours. The average processing time was 194 seconds per record. In the following sections, we will discuss the quality improvement of the dataset. The purpose of analysing both the time taken and the geocode quality improvement is to illustrate the effort that is involved versus the improvement in geocode accuracy gained.
Of the 784 geocodes, 709 were assigned a new set of coordinates during the correction process. In this section we will review the spatial shift that the majority of the geocodes underwent. This distance was measured in meters (m) using the XY to Line tool within ArcGIS 10.1. Of the addresses that met criteria 2, the spatial shift improvements ranged from the smallest (0.018851 m) to the largest (151,368 m), the mean was 1963 m, and the median was 114 m (Table 2). For the smallest spatial shift improvement category, i.e. Exact Parcel Centroid to Building Centroid, we found that these geocode quality types were closely aligned and required minimal processing time (in seconds), mean 100 seconds and the median 52. In the event that the original geocode location of an Exact Parcel Centroid quality type was already accurate but needed to be updated to Building Centroid, the building was selected to reflect its true level of accuracy. The newly selected point was located proximate to the original point, resulting in the small difference between the original and corrected geocodes. For the largest spatial shift the geocode quality improved from USPS Zip Centroid to Street Centroid and the processing time was 1276 sec (21.2 min). Figure 3 illustrates an example of the spatial shift between the original and corrected geocoded points. In the bottom left of the diagram, it can be seen that many corrected geocoded points were derived from the same original point. In this case, many addresses were originally geocoded to a zip code centroid and then corrected to more accurate single location-based geocode.
Geocoding a list of addresses is often just the first step to a more extensive project (Rushton et al., 2006; Goldberg et al., 2007). This first step, however, is very important because it can ultimately dictate the accuracy and direction of the final result (Oliver et al., 2005; Zandbergen, 2009; Wey et al., 2009). Prior research has demonstrated that geocoded datasets should be evaluated not only for match rate but also by geocode quality type (Goldberg et al., 2008; Rushton et al., 2006). Based on the level of accuracy of geocodes and the research purpose, it is our recommendation that researchers pause and evaluate if it is necessary to invest time to improve the accuracy of the geocodes (Krieger et al., 2001; Bonner et al., 2003; Nuckols et al., 2004; Oliver et al., 2005; Grubesic and Matisziw, 2006; Schootman et al., 2007; Zandbergen, 2007, 2009). This study illustrates that a dataset of lower geocode quality types can be improved to a higher level of quality with very little investment of time, effort, or finances. The original dataset contained zero geocodes that matched to a building centroid. After 42 hours (-one week of work), 638 (81%) of the geocodes matched to a building centroid. Our spatial shift findings support previous studies demonstrating that inaccurate geocoding produces positional errors (Cayo and Talbot, 2003; Ward et al., 2005). These errors have the potential to impact health analysis ranging from inaccurate local disease rates to imprecise accessibility measures; these health analysis studies are frequently used to inform health policy decisions (Jacquez, 2012). The manual intervention geocoded dataset that was produced as part of this study is now more suitable to be used for analysis because it will yield more reliable results.
The current study provides additional motivation and evidence-based findings for the purpose of demonstrating that manual geocoding correction is both a feasible and economical method for improving the quality of geocoded data. And, we demonstrated that the manual intervention geocoded processes resulted in increased match rates, higher confidence in geocode quality, and improved geocode match types. Finally, this study supports prior research that has been conducted in the geocoding accuracy and analysis field, and supports that prior findings are transferable from one geographic region to another as well as across domains of health services (Goldberg et al., 2008). As demonstrated by this study, the TAMU Geoservices geocoder and the geocode correction tool, which is integrated in the online web service, is a low to no cost, easy to use option to improve geocode accuracy.