## Introduction

Dengue fever, which is transmitted by *Aedes* mosquitoes, have been increasing in Malaysia lately, specifically in its most urbanized state, Selangor (Hassan *et al.*, 2012). Uncontrolled population growth, climate change and unplanned urbanization are some of the main factors linked to an increase in dengue cases (Hassan *et al.*, 2012). Control against the vector population is still the best way to curb this disease as vaccination is still under study with mixed outcomes. A dengue vaccine named *Dengvaxia* has been developed by Sanofi Pasteur Ltd (Sanofi Pasteur Press Release, 2016), but this vaccine has caused many young children who underwent vaccination to become more sick after a three-year time period (Warner, 2016). As a result of this, dengue prediction models are still important and complementary tools for controlling the vector population at present.

Most of the dengue prediction models use meteorological parameters such as mean temperature, relative humidity and rainfall as predictors (Brunkard *et al.*, 2008; Olson *et al.*, 2009; Cheong *et al.*, 2012; Cheong *et al.*, 2013; Dom *et al.*, 2013; Naish *et al.*, 2014). Inclusion of air quality as a potential predictor is still a topic that is not well researched in the literature; there are only a few reported studies, especially from Singapore, which have addressed this issue (Smith *et al.*, 2013). The name of measurement that refers to air quality also varies among countries. For example, it is known as the Air Pollution Index (API) in Malaysia, as Air Quality Index (AQI) in Thailand and as Pollutant Standard Index (PSI) in Singapore (ASEAN Specialised Meteorological Centre, 2017). A previous study (Massad *et al.*, 2010) based in Singapore hypothesized that a severe haze condition (with high PSI levels) causes a drop in dengue since severe haze will produce smoke that would kill the adult mosquitoes. The authors’ discussion was based on observations of drops in dengue cases during the Southeast Asia extreme haze event in 2006. On the other hand, Smith *et al.* (2010) carried out a haze-dengue analysis, also in Singapore, and concluded oppositely: High PSI value did not necessarily reduce dengue cases. They observed that during the bad haze conditions in years of 1997 and 1998, dengue cases was also at its highest levels. Therefore, they concluded that severe haze conditions did not necessarily bring about a drop in dengue cases. These two opposite findings can be further explained by another study by Oliver *et al.* (2014) which have discussed that only when the API level is enough high, its negative effect on the mosquito population (these vectors could be destroyed) can be felt and dengue occurences could be reduced (Oliver *et al.*, 2014). Therefore, the relationship between air quality levels and dengue cases remains uncertain.

In view of this, our study aimed to explore API as a potential predictor of dengue cases for zones involved in our study in Malaysia. A challenge in this study is that previous dengue cases provided a positive feedback for current dengue cases due to the nature of transmission of this disease. Thus, the effect of API had to be assessed over and above this feedback effect. To address whether API levels correlate with dengue cases, a number of autoregressive integrated moving average (ARIMA) and autoregressive integrated moving average with exogenous variable (ARIMAX) models were built using the Box-Jenkins (BJ) variable transformation approach. Subsequently, the ARIMA and ARIMAX models were optimized with respect to the extent of their feedback. The optimized ARIMA and ARIMAX models were then compared using the Bayesian Information Criteria (BIC) as the criteria for model selection. Our approach of deciding the presence or absence of correlation of API with dengue cases was based on three stages of model-based analysis: model identification, validation and selection. Our models were able to find long-term correlations between API levels and dengue cases, if any, after accounting for other explanatory attributes proxied through lagged dengue cases.

## Materials and Methods

### Study area

This study involved the following five zones in Selangor: i) Petaling Jaya; ii) Pelabuhan Klang; iii) Shah Alam; iv) Kuala Selangor; v) Banting. The state of Selangor was selected as it recorded the highest number of dengue cases in Malaysia (Hassan *et al.*, 2012). Selangor is also well known for facing deterioration of air quality every year, where it recorded a number of 59, 48, and 37 unhealthy days (API levels were in the range of 101-200) during the years of 2009-2012. This poor air quality is believed to occur due to a number of factors such as commercial and industrial developments, emissions from motor vehicles and the trans boundary haze (Rahmah *et al.*, 2015).

Data on weekly dengue cases were obtained from the Ministry of Health Malaysia, and it represents the total number of dengue cases that occurred in the respective zones. On the other hand, data on API measurements were obtained from the Department of Environment, Malaysia. The API measurements for Malaysia comprise measurements of five main air pollutants, *i.e*. carbon monoxide (CO), ozone (O_{3}), nitrogen dioxide (NO_{2}), sulphur dioxide (SO_{2}) and particulate matter with a diameter of less than 10 micron (PM_{10}) (Department of Environment Malaysia, 2017). All dengue cases and API data for the five zones of study were reported on a weekly basis, starting from the year 2009 till 2015, except for Banting where the dengue cases and API data were obtained from year 2010 till 2015. The surrounding environment for air quality monitoring stations varied: the stations in Petaling Jaya, Shah Alam and Pelabuhan Klang are located in the middle with main roads which connect industrial and high density housing areas. The Kuala Selangor station is located in a comparatively less populated area, whereas the Banting station is near the roadside of a highway, thus located in a area that is less populated and having trees and oil palm plantations (Rahmah *et al.*, 2015). These different environments contribute to the variation in levels of deterioration of air quality measurements (Azmi *et al.*, 2010). Figure 1 shows the spatial locations of the API monitoring stations as well as the boundaries of the zones considered. The areas corresponding to these zones are as follows: Petaling Jaya (PJ) = 97.2 km^{2}, Pelabuhan Klang (PK) = 261 km^{2}, Shah Alam (SA) = 293 km^{2}, Kuala Selangor = 217 km^{2} and Banting = 261 km^{2}. Four of the areas are approximately similar but PJ is half the size of the rest.

The locations of API stations for the 5 zones, Petaling Jaya, Pelabuhan Kelang, Shah Alam, Kuala Selangor and Banting, are marked with black dots and their zonal boundaries indicated by the shaded grey patches.

### Presence of feedback

It is proposed that dengue cases can be explained through various time-lagged models with lags of up to ten weeks. The extent of feedback can be understood based on the study of the lifecycle of the *A. aegypti* mosquito. The mosquito breeding period from mosquito eggs to adulthood (Life cycle of Aedes Agypti, 2017) lasts beween 1.5 to 3 weeks, which is followed by a biting period where the adult mosquitoes, if successful in biting an infected human, will acquire the dengue virus. The infected mosquito then undergoes an extrinsic incubation period (EIP) which takes from 8 to 12 days before becoming infectious (Chen *et al.*, 2012). This is followed by a mosquito transmission period where the infected mosquitoes will bite susceptible (uninfected) humans (Dengue Transmission, 2014). The next phase is the human tramsmission phase where the dengue virus is transferred from an infected person to an uninfected person (Cheong *et al.* 2013). The virus will undergo an Intrinsic Incubation Period (IIP) inside the newly infected human which lasts for 4-10 days after which the infected person will exhibit symptoms of dengue (World Health Organization, 2009). The number of infected humans at previous time points, therefore, serves as a determinant of current levels of dengue cases based on the lifecycle of *Aedes* mosquitoes as described above. As the number of infected humans at any time point is difficult to determine, a reliable substitute is the actual number of dengue cases observed at these previous time points. Variability in the biting period, the mosquito transmission period, human transmission period, the EIP and the IIP cause us to consider all previous dengue cases up to lag-10 weeks as potential predictors for the current number of dengue cases, which is the maximum extent of feedback that we consider in our ARIMA and ARIMAX models. Figure 2 shows the timeframe of the dengue infection process. For dengue cases, ARIMA models are considered as models that can incorporate feedback information via lagged terms. In this study, API is the exogenous variable considered in our ARIMAX models. Both models are used to represent stationary time series data (Alain *et al.*, 2007; Cressie and Wikle, 2011). However, when a time series is non-stationary, as is the case for the dengue case trajectories, ARIMA and ARIMAX models cannot be applied directly. The standard practice is to have the original time series differenced up to the first or second order to yield stationarity after which the models can be fit to the observed data (Hyndman and Athanasopoulos, 2013).

### Statistics

In our case, first order differencing was found to be sufficient to yield stationarity; higher order differencing was found to be unnecessary. The general mathematical equation of an ARIMA model for a differenced time series *Y** is:

*ϕ*= (

*ϕ*,

_{1}*ϕ*,….

_{2}*ϕ*) and θ = (

_{p}*θ*,

_{1}*θ*,…,

_{2}*θ*) are unknown coefficients with each

_{q}*ϕ*∈ ℜ and

_{i}*θ*∈ ℜ, and ε’s random errors independent and identically distributed with zero mean and constant but unknown variance, σ

_{j}^{2}. In Eq. 1,

*q*is the extent of lag (

*i.e.*the extent of dengue case feedback) and the extent of lag of the errors (

*i.e.*the extent of correlation in

*Y*

_{t}^{*}). The ARIMA model in Eq. 1 with lag orders

*p*and

*q*is denoted ARIMA (

*p*,

*d*,

*q*) where

*d*is the order of differencing to yield stationarity. The values of

*p*and

*q*were obtained based on inspecting the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots of

*Y*

_{t}^{*}. The method of choosing

*p*and

*q*based on inspection is a wellknown technique in the literature (see Nau, 2014) and hence, we omit further details here. To avoid complicated models, the

*ϕ -*and

*θ*- coefficients in Eq. 1 corresponding to insignificant lags in the ACF and PACF plots were set to zero. The final model was denoted by ARIMA(

*p, d, q*)* where * indicates the reduced model with omission of insignificant coefficients.

The remaining significant but unknown coefficients in the ARIMA(*p, d, q)** model were estimated using the maximum likelihood (ML) approach (Wang and Jain, 2003; Levine and Justice, 2013). The ML approach was used as it reports good model fit and prediction accuracy. Subsequently, residual diagnostics based on the fitted model was performed. The fitted model is deemed appropriate if the ACF and PACF plots of the residuals do not show any significant autocorrelations (*i.e.* significant peaks at lagged time points). If not, the above process mus be repeated for different choices of *p* and *q* until desired results are obtained. We provide an illustration of the techniques for one of the zones under study in the Results section. The details on the same methodology are omitted in the analysis corresponding to the other four zones.

Exogenous variables can be incorporated into the ARIMA(*p, d, q)** model forming an ARIMAX(*p*, *d*, *q, m*)* model with *m* lags. The corresponding mathematical equation of an ARIMAX(*p*, *d*, *q, m*)* model is:

*m*. The

*ϕ*-and

*θ*- coeffiients which were insignificant in the original ARIMA(

*p, d, q)**model were set to zero in the ARIMAX (

*p*,

*d*,

*q, m*)* model in Eq. 2 as well. The remaining significant

*ϕ*- and

*θ*-coefficients as well as the

*m+1 β*- coefficients in the ARIMAX(

*p*,

*d*,

*q, m*)* model in Eq. 2 were re-estimated using ML. We noted that the API time series also needs to be stationary to avoid spurious regression (Hyndman and Athanasopoulos, 2013) which was achieved by performing first order differencing as mentioned earlier.

We selected an upper bound for *m*, namely *m _{0}*, and fit

*m*ARIMAX (

_{0}*p*,

*d*,

*q, m*)* models for

*m=0,1, 2,…,m*. Among the ARIMA

_{0}*(p, d, q)** and the

*m*ARIMAX(

_{0}*p*,

*d*,

*q, m*)* models, the best one was selected using (BIC):

*L*is the model likehood,

*k*is the number of model parameters and

*n*the number of observations in the time series data. Model selection based on BIC accounts for both model fit and model parsimony (Bhatnagar

*et al.*, 2012; Sitepu

*et al.*, 2013; Sriyasatien

*et al.*, 2016). In Eq. 3,

*k*=

*k*+ 1 where

_{0}*k*is the number of significant

_{0}*ϕ*- and

*θ*- coefficients in the ARIMA

*(p, d, q)** model; the extra

*+*1 corresponds to the variance parameter

*σ*

^{2}. For the ARIMAX(

*p*,

*d*,

*q, m*)* model, the value of

*k*=

*k*+

_{0}*m*+ 2 since there are

*m +*1

*β*- coefficients and one variance parameter σ

^{2}. The best model is the one with the lowest BIC value in Eq. 3. We also considered spatiotemporal analysis where temporal API measurements from spatially neighbouring zones were incorporated into Eq. 2 in addition to the API measurements from that zone. In this case, the value of in Eq. 3 where

*M*is the total number of zones considered with

*m*denoting the lag order for zone

_{j}*j*= 1, 2,…

*M*.

## Results

### Description of data

A total of 354 weekly data were used (covering the period between January 2009 up to middle of October 2015) involving observations of dengue cases and API levels; for Banting, the number of weeks covered was 290 from March 2010 to October 2015. Table 1 gives the range and average values of the data for each zone. The average API readings ranged between 51 and 100 which is classified as moderate, that is, *moderate pollution that does not pose any bad effect on health* (Department of Environment Malaysia, 2017). This moderate API range should be acceptable for most of the public except for a small group of people who are sensitive to ozone or particulate pollution (Department of Environment Malaysia, 2017).

### Prediction models

Prior to model building and selection, we normalized both the response and explanatory variables (*i.e.* dengue cases and API levels) into the same range (0-1): Letting *H* represent a generic notation for a variable, normalization entails

*H*is the normalized variable, and

_{N}*H*and

_{min}*H*, the minimum and maximum values of

_{max}*H*, respectively, in the data set.

Three models were compared in this study, *i.e.* the ARIMA models, ARIMAX models with single-API station and the ARIMAX with paired-API stations from spatially neighboring zones. To build the ARIMA model, the BJ approach was used: first, differencing was carried out to ensure that the data were stationary; next, the significant lag orders of *p* and *q* were selected by inspecting the ACF and PACF plots; finally, the ARIMA model was estimated to obtain the parameter values using the ML approach. An example of ARIMA model building is illustrated for the Kuala Selangor zone in Figures 3-5. Figure 3 shows the weekly dengue case trajectory for Kuala Selangor during the years 2009-2015 which is clearly seen to be non-stationary. First order differencing (*d=1*) was carried out and significant lag orders identified using the ACF and PACF plots. The ACF plot in Figure 4 shows a significant peak at *q=2*, hence, the ARIMA *(0, 1, 2)** model was selected. We found that significant peaks in both the ACF and PACF plots can be eliminated by choosing exactly one of *p* or *q* appropriately, that is, by setting the other to zero, which also reduces model complexity. Figure 5 shows the ACF and PACF plots of the residuals from the fitted model. No significant peaks existed outside the 95% confidence lines indicating that the residuals are indeed white noise. Hence, the ARIMA model (0, 1, 2)* was selected as the model for dengue cases for Kuala Selangor.

To investigate if API had any effect on dengue cases, six ARIMAX (0,1,2,*m*) *models were formed for different lag orders *m=0,1, 2,…,m _{0} =* 5

*.*Table 2 gives the BIC values for ARIMA and best ARIMAX models (out of the six, based on BIC) for Kuala Selangor. A similar methodology was also applied to the rest of the study zones (Table 2).

Next, a spatio-temporal analysis was carried out to find if inclusion of API parameter from spatially neighbouring stations (the closest one) could enhance the model fit. Table 3 shows the results obtained for the ARIMAX model for Kuala Selongor when API data from its closest neighbour, Shah Alam, was additionally incorporated into the ARIMAX model (Figure 1). Results from Table 3 show that best ARIMAX model is from the pairing of API parameters at lag orders of zero and zero. A comparison of the BIC values in Tables 3 and 4 shows that the it reaches its lowest value for the ARIMA model without any API component. We concluded that the dengue trajectory for Kuala Selangor was very localized and intrinsic to that zone only. Once the localized, best model based on feedback had been identified, it was realized that API observations for that zone as well as API data from spatially neigbouring zones do not contribute to better models. Indeed, similar results were found for all other zones: Comparing the best ARIMA and ARIMAX models from Table 2 with the best ARIMAX models from Table 4 (with API data from spatially neighbouring zones), we found that the ARIMA models had best BIC values in all the five zones of study.

## Discussion

In Malaysia, poor air quality mostly occurrs during the haze episodes, known as the trans-boundary haze (see BBC News Online, 2017) due to smoke from open burning and wildfires in Indonesia. Haze and other local particulate matters were believed to kill the mosquito vectors, causing lower dengue transmission and decreased number of dengue incidences as a whole (Massad *et al*., 2010). Indeed, when their modelled prediction of a dengue outbreak did not materialize, Massad *et al.* (2010) suggested that haze could have prevented it. This suggestion was based on Singapore data on dengue and the pollutant standards index (PSI) for the year of 2006 and 2007. Smoke is also known to act as an insect repellent and therefore used as such (Tabuti, 2007).

Studies which investigate the relationship between haze (or air quality index) and the number of dengue cases in Malaysia are at best scarce. Rafdzah *et al.* (2015) conducted an extensive review of existing literature for such studies and came up with a list that does not contain any studies from Malaysia. This motivated us to conduct this study in the state of Selangor in Malaysia. Based on our analysis, we found no relationship between API level and dengue cases in all five study zones in Selangor. Prediction models were better elicited without API as a predictor. Based on data from the region of Kuala Selangor, we found that the optimum ARIMA model was better than the optimal ARIMAX model based on BIC values. The reason was that ARIMAX(*p*, *d*, *q, m*)* models are more complex compared to ARIMA*(p, d, q)** and will be penalized by the BIC criteria (*i.e.* result in higher BIC values) if the inclusion of API as the exogeneous variable is redundant, that is, if API has no effect on dengue cases. A spatio-temporal analysis also yielded similar conclusions. Incorporation of API measurements from Shah Alam, the station closest to Kuala Selangor, in addition to API measurements from Kuala Selangor station did not lower the BIC values, which means that their incorporation into the model was redundant. Similar results were observed for the four other study areas.

Dengue case observations were similar in the closely situated neighbouring zones Petaling Jaya and Shah Alam as can be seen in Table 1. The average and range of dengue cases were 267.92 [0, 1585.00] for Petaling Jaya and 213.27 [0, 1782.00] for Shah Alam, respectively. These similar average and range values differed significantly from the other zones in Table 1. In terms of API levels, there were similarities too, *i.e.* in the average and range of API levels. As seen in Table 1, the average and range of API levels were 58.61 [34.46, 201.29] for Petaling Jaya and 61.53 [34.33, 248.29] for Shah Alam, respectively. These average and range values again differed significantly compared to the other zones. Due to this similarity between Petaling Jaya and Shah Alam, the best models (which were the ARIMA models although with differing ARIMA components) had BIC values closer to each other compared to the other zones (Table 2), meaning that these models provided similar model fit and parsimonious assessment to be used for prediction purposes.

Our study findings on dengue cases and API levels are consistent with findings from a previous study (Smith *et al.*, 2010) based on 2001-2008 data from Singapore and air quality measured according to the PSI. They concluded that there was no significant association between PSI and dengue. However, their data were limited to a small range of dengue cases (minimum of 45 and maximum of 265) and PSI (minimum of 34.2 and maximum of 41.9) (Rafdzah *et al.*, 2015). We considered a higher range of dengue cases as well as API values in this paper and still did not find any correlation among them.

Interestingly, some positive impact of haze on dengue has been suggested (Ooi *et al.*, 2010; Smith *et al.*, 2013) based on the 1997-1998 dengue outbreak in Singapore, which faced severe haze at that time. This impact was explained in Smith *et al.* (2013) where they suggested the high haze readings have caused both human and mosquitoes to stay indoors bringing higher possibilities for mosquito bites and dengue transmission. However, Massad *et al.* (2010) suggested that the observed outbreak during this period could be explained by the high annual average temperature which occurred together with the haze conditions. Annual average temperatures were at the highest during the latest 15 years at the time, and this could have caused an increase in the number of infectious *Aedes* mosquitoes since the EIP period would have been shortened. This, in turn, would have caused more possibilities for infectious bites and dengue transmission. Thus, in our model, it was imperative to include such explanations in the model development so that the effect of API levels could be assessed after such inclusions; note that our model used time-lagged dengue cases as a proxy for high temperature. Thus, residual correlation of API levels and dengue cases, if any, would only be reflected in our models if truly present and not due to other confounding factors such as high temperature. Indeed, we found that after the inclusion of time-lagged dengue cases, API levels did not provide any additional explanation to the increase or decrease in the number of dengue cases.

## Conclusions

This is one of the first studies to investigate the relationship between API levels and the number of dengue cases in Malaysia using feedback modelling techniques such as ARIMA and ARIMAX. Our study found that API level did not correlate with dengue cases in all five study zones in the state of Selangor, Malaysia, which were selected due to their high levels of dengue cases. The inclusion of API for all study zones did not enhance our prediction models as assessed by the BIC values. The BIC values did not improve, either by adding API data from spatially neighbouring zones. We conclude that the dengue cases trajectory for each zone was very localized and intrinsic to that zone only. Once the localized best model was identified; other variables (the API observations for the focal zone as well as API data from spatially neigbouring zones) did not contribute to yield better prediction models. The finding that the ARIMA models with the lowest BIC values outperformed the rest of the models in all five zones should be of future interest.