Leishmaniasis is a neglected tropical disease (NTD) (Alvar et al., 2006), which occurs in three forms: cutaneous, visceral and mucocutaneous. Cutaneous leishmaniasis (CL) is transmitted by the bite of various species of Phlebotomine sand flies. Nearly 350 million people are thought to be at risk of CL (Reithinger et al., 2007). Unlike the incidence of most other NTDs, the incidence of CL is increasing (Karimkhani et al., 2016). Almost 90% of all cases of CL now occurs in Iran, Syria, Saudi Arabia, Afghanistan, Peru, and Brazil (Desjeux, 2004). Leishmaniasis is therefore a major challenge to public health and it is due to increased deforestation, urbanization and population movement together with climate change that is pronounced in many regions of the world (Haouas, 2016).
CL is endemic in some areas of Iran with Isfahan Province one of the main foci. The prevalence of this infection has been reported at 1.8% to 37.9% in different provinces of Iran (Yaghoobi- Ershadi et al., 2002; Talari et al., 2006). The main vector and reservoir host of rural CL are Phlebotomus papatasi and Rhombomys opimus, respectively (Yaghoobi-Ershadi et al., 2001). There are many studies conducted to identify the risk factors associated with incidence of CL diseases and several environmental and topographic characteristics have been found to exhibit important effects on the distribution of the disease (Ali-Akbarpour et al., 2012; Ramezankhani et al., 2017a; Ramezankhani et al., 2017b; Ramezankhani et al., 2018). Environmental factors affect the geographic distribution of vectors and the host and they also change demographic patterns. The combination of these factors determines the contact between vector and susceptible host (Vanwambeke and Lambin, 2013).
Recently, data-mining methods have been widely used to explore potentially useful and ultimately understandable patterns in the data (Han et al., 2011). Some widely used such techniques for classification include the approaches of decision tree, neural networks; K-nearest neighbour and Naive Bayes. Decision trees are one of the most commonly used data-mining methods that automatically explore the most important risk factors and their interactions and have also shown high accuracy in predicting outcomes (Song and Ying, 2015; Ramezankhani et al., 2016a; Ramezankhani et al., 2016b). The decision tree approach is the best non-parametric method to model nonlinear relationships and create simple classification rules, which are easier to understand than traditional methods (Han et al., 2011). The aim of this study was to develop a predictive model to investigate the contributions of environmental factors on the CL incidence by using the decision tree algorithm in Isfahan Province, Iran.
Materials and Methods
Isfahan Province, situated within the 31°43′N to 34°22′N and 49°38′ to 55°31′E, consists of 23 districts with a population exceeding 4,500,000 in 2015 according to the census in this year. It covers an area of approximately 107,000 km2 and is surrounded by the provinces Semnan, Qom and Markazi to the North; Fars and Kohgiluyeh va-Boyer-Ahmad to the South; Yazd to the East; and Chaharmahal va-Bakhtiyari to the West. Due to varying altitudes there are many climatic variations in the province. Generally, with the declining altitude from west to east, the air temperature rises while the precipitation declines. Figure 1 shows the country and its provinces as well as the Isfahan Province with its districts (the study area).
Data collection and preparation
Population at risk
The population at risk in our study consisted of all those exposed to CL. Population data for all 23 districts of Isfahan Province were obtained from the National Institute of Statistics in Iran.
Cutaneous leishmaniasis incidence data
We obtained confirmed data on all CL cases in Isfahan Province between 2007 and 2015 from the Center of Disease Control (CDC) at the Ministry of Health in Tehran. The diagnosis of CL was based on clinical features and laboratory testing (de Vries et al., 2015). The quarterly incidence rate of CL cases was calculated as follows:
The monthly averages of temperature, humidity, rainfall and maximum wind speed were obtained from 34 synoptic stations in Isfahan Province and six neighbouring provinces during 2007-2015. These climatic data were entered in ArcGIS desktop, version 10.3 (ESRI Inc, Redlands, CA, USA) for preparation and analysis. Ordinary Kriging was applied for calculating the monthly average of climatic data for all 23 districts. Kriging is a geostatistical interpolation technique that considers both distance and the interconnected degree of variation (autocorrelation) between known data points which is useful for estimating values in nearby but uncharacterized areas. A Kriging estimate is a weighted linear combination of the known sample values around the point to be estimated (Luo et al., 2008). In the next step, the Isfahan polygon layer at the district level was overlaid with each climatic factor raster and the monthly averages of all climatic factors was obtained using zonal statistics separately for each district (Ramezankhani et al., 2017b). Finally, the quarterly mean of all climatic factors were computed for each district during the whole study period.
The NASA Shuttle Radar Topographic Mission (SRTM) has provided digital elevation model (DEMs) for over 80% of the globe. We obtained the SRTM 90 m DEM of Isfahan through the United States Geological Survey (USGS) website. The DEM data were overlaid the Isfahan Province layer and then zonal statistics were applied to compute the average of elevation and slope for each district (Ramezankhani et al., 2017b).
Normalized difference vegetation index data
The Normalized Difference Vegetation Index (NDVI) is the most popular and commonly used index for vegetation monitoring, crop yield and drought assessment. NDVI data with 250 m spatial resolution was obtained from Moderate Resolution Imaging Spectroradiometer (MODIS) (16-day composites) on board NASA’s Terra satellite. NDVI values vary from –1.0 to +1.0. Higher NDVI readings indicate greater vigour and amount of vegetation (Peters et al., 2002). NDVI ≥0.7 is considered as densely vegetation, while 0.3 to 0.5 indicate sparse vegetation (Yan and Wang, 2016). The NDVI data for Isfahan Province from April 2007 to 2015 for each month was downloaded and the quarterly mean each district computed for analysis.
The descriptive statistics carried out included the CL incidence, which was calculated for each year during 2007-2015. The mean, minimum, maximum and standard deviation (SD) of climatic and topographic factors and vegetation coverage data in Isfahan Province were calculated for the same time period.
Data preparation and variable selection
The final data included the quarterly mean of 11 variables (maximum temperature, minimum temperature, average temperature, maximum relative humidity, minimum relative humidity, mean relative humidity, mean rainfall, maximum wind speed, DEM, slope and NDVI). The linear correlation between these variables (collinearity) was checked by the Variance Inflation Factor (VIF) index for each independent variable quantified in the regression model. After removing variables with high VIF values, seven (slope, altitude, NDVI, average rainfall, average maximum wind speed, average temperature and average mean humidity) were selected for model building. The entire dataset was divided into two sets with the data for the period 2007-2012 used as training data to build the decision tree model, and the data for 2013-2015 (the test data) used for testing the model.
Decision tree model building
A decision tree is a flowchart-like structure based on predictors. Each tree is a set of nodes and branches with growth starting from the root node, which divides into several child nodes based on variables that produce maximum heterogeneity among them. The criterion to be applied in the selection process is to choose those variables that best split the root node into distinct groups. If the samples in a node belong to the same group, then that node becomes a leaf node. Each path from root node to leaf node represents an if-then situation called a rule which has been described by Han et al. (2011). In the building of a regression tree, we utilized the classification and regression tree algorithm, the rpart in the R software package version, which uses the Poisson method to split the node. In this way, the criterion of dividing a node into two child nodes was based upon the probability ratio test for two Poisson groups, i.e. Deviance Parent – (Deviance Left node + Deviance Right node). The deviance value for each node was obtained according to the following formula:
where ci is the number of cases of the disease for each observation, i.e. the number of cases of the disease for each city in each season, ni the population at each observation site, and λ the incidence of the disease:
The pruning method was used to select the best tree size. In this method, the tree is first grown to its final size and then pruned on the basis of a complexity parameter so that subgroups that cannot increase the predictive power to a certain value can be eliminated.
The overall performance of the regression tree was evaluated for the test data using the two of the most common measures of regression models: the Root Mean Square Error (RMSE) and the correlation factor (R2). RMSE indicates the differences between values predicted by the model and the values actually observed and varies between 0 and 1 where values close to 0 indicate smaller errors and best fit. The R2 measures the linear relationship between model predictions and the observed values where R2 ranges from 0 to 1, with a value closer to 1 indicating that a greater proportion of variance is accounted for by the model.
A total of 26,347 confirmed CL cases were reported in the Isfahan Province between 2007 and 2015. As the graph depicted in Figure 2 shows, the highest incidence occurred in 2009, (approximately 82 per 100,000). The averages of mean temperature, mean relative humidity, mean rainfall, and mean maximum wind speed amounted to 16.4°C, 35.7%, 16.8mm and 13.5 m/s, respectively. The mean altitude in the study area was 1,888 m and had a NDVI level of 0.11 (Table 1).
Figure 3 depicts the decision tree model, including the selected variables and the cut-off points for each variable. Six variables were used to generate the tree. As mentioned earlier, the algorithm first selects the most important variable based on the deviance index and then determines the best point for this variable based on the same index. As seen in Figure 3 the wind speed was the first variable dividing the data into two completely heterogeneous groups. The algorithm determined the best point for this variable to be at 14 m/s. In the right of the tree, node 3 contains data with wind speeds less than 14 m/s and in the left part of the tree, node 2 contains data with wind speeds ≥14 m/s. In node 2, the next important variable was shown to be the mean temperature, which divides the data into two groups (nodes 4 and 5) using a cut-off at 30°C. For the data in node 4, another variable was found by the algorithm where the slope of the area divides the node into two terminal nodes by a cut-off at 8.4 degrees. In the right part of the decision tree, the values of NDVI, DEM, rainfall and air temperature were also considered significant in splitting node 3 into its next nodes.
The nine if-then rules derived from the decision tree are displayed in Table 2. According to rule 9, the seasonal incidence would be 163.28 per 100,000 persons if the wind speed of the area is less than 14 m/s, the altitude between 1234-1810 m, the vegetation coverage less than 0.12, rainfall less than 1.6 mm and the air temperature more than 30°C. The lowest incidence is found at rule 5, which says that the mean seasonal incidence of the disease in that area would be 2.27 per 100,000 persons if the wind speed is less than 14 m/s, the altitude less than 1,810 m, and the NDVI above 0.12. The RMSE for the decision tree model was 31.67 and R2 0.60 (P<0.001).
The decision tree approach identified six environmental and climatic factors as the most important predictors of CL in Isfahan province. They included average maximum wind speed, average temperature, altitude, slope, NDVI and average rainfall. The decision tree model showed an acceptable level of predictive performance, as assessed by RMSE and R2 on the test data.
Several studies have been carried out on the relationship between the wind speed and the incidence of leishmaniasis in Iran and other countries around the world. Also, the relationship between the wind speed and a number of diseases where vectors are responsible for their transmission has been investigated. According to these studies, the wind has dual effects on the vector and the host in that strong wind reduces the chance of insect bites while it increases their flight paths (Wu et al., 2016). While in other studies the threshold for the wind speed has not been determined, our study shows that CL incidence is greatly influenced by wind speeds less than 14 m/s. In other words, the relationship between the wind speed and the incidence of the disease is nonlinear. A study in southern Iran showed a negative correlation between the wind speed and the spatial distribution of leishmaniasis (Ali-Akbarpour et al., 2012).
A large number of studies have been conducted about the effect of altitude on the diversity and distribution of sand flies. In a study by Guernaoui et al. (2006), 2742 sand flies belonging to 9 species were collected from 25 stations with altitudes between 400 and 1400 m. The findings of that study demonstrated that P. papatasi was the main vector of leishmaniasis in lower-altitude areas and less frequent in mountainous areas with higher altitudes. Holakouee et al. (2017) showed that the distribution of leishmaniasis in Iran had a specific pattern and was less common in northern and North-western provinces with high altitudes. Our study, demonstrated that the altitude alone is not responsible for the disease incidence; rather, this factor in the interaction with other factors can add to the problem. For example, an altitude of less than 1810 m and greater than 1234 m along with other factors can increase the incidence of the disease.
According to NDVI classification, Isfahan Province is relatively bare as also was corroborated by the present study (NDVI = 0.11). Low coverage increases the temperature and evaporation, which in turn provides good conditions for sand flies, a relationship which has been investigated in different studies with reference to leishmaniasis. Shirzadi et al. (2015) showed that vegetation coverage is an important factor in predicting the incidence of leishmaniasis and that it increased in areas with less vegetation coverage. Nonetheless, results from around the world vary widely. A study in Brazil found that hot and humid climates with dense vegetation coverage causes higher incidences of leishmaniasis (Karagiannis- Voules et al., 2015).
Different results from around the world demonstrate that interactions between environmental factors have different effects on the disease, and each one of them alone cannot play an effective role in the occurrence of the disease. In the present study, these interactions were well illustrated. For example, the threshold for vegetation coverage in our study was equal to 0.12, which along with other factors can increase the incidence of the disease. The relationship between rainfall and leishmaniasis has also been studied both in Iran and different parts of the world. For example, a study in Sudan confirmed that altitude and precipitation were the best predictors in the incidence of visceral leishmaniasis; since rainfall affects the humidity, temperature and vegetation of the area, and, therefore, influences the host and vector cycles (Elnaiem et al., 2003). Another study in Afghanistan revealed that rainfall and temperature contributed significantly to the development of leishmaniasis, especially in the month of April (Adegboye and Adegboye, 2017). The other study in Algeria, based on data from 13 years of leishmaniasis cases demonstrated a strong correlation between rainfall and the incidence of the disease (Khezzani and Bouchemal, 2017). As noted earlier, the results vary from region to region. In this study, a cut-off of 1.6 mm was determined for the precipitation, which together with other factors, increased the incidence of the disease. The decision tree in our study shows that the lowest incidence of the disease occurs when the wind speed is less than 14 m/s, altitude below 1810 m, and NDVI greater than 0.12. As previously explained, low wind speed and low altitude alone cannot be a risk factor for the incidence of the disease, since the incidence of the disease will decrease if both variables are accompanied by the NDVI greater than 0.12.
Leishmaniasis incidence was investigated based on environmental and topographic factors in an endemic area of Iran. In the last two decades, cases of CL have increased in Isfahan Province because of rapid urbanization, construction of buildings on farmland near storage of waste materials with large rodent populations encouraging sand fly breeding around its towns. Decision tree algorithm application on spatial and environmental dataset provided valuable information which should be used by policy makers for implementing strategies to control and prevention of this disease.