Title | A REGRESSIVE METHODOLOGY FOR ESTIMATING MISSING DATA IN RAINFALL DAILY TIME SERIES |
Abstract | The "presence" of gaps in environmental data time series represents a very common, but extremely critical problem, since it can produce biased results (Rubin, 1976). Missing data plagues almost all surveys. The problem is how to deal with missing data once it has been deemed impossible to recover the actual missing values. Apart from the amount of missing data, another issue which plays an important role in the choice of any recovery approach is the evaluation of "missingness" mechanisms. When data missing is conditioned by some other variable observed in the data set (Schafer, 1997) the mechanism is called MAR (Missing at Random). Otherwise, when the missingness mechanism depends on the actual value of the missing data, it is called NCAR (Not Missing at Random). This last is the most difficult condition to model. In the last decade interest arose in the estimation of missing data by using regression (single imputation). More recently multiple imputation has become also available, which returns a distribution of estimated values (Scheffer, 2002). In this paper an automatic methodology for estimating missing data is presented. In practice, given a gauging station affected by missing data (target station), the methodology checks the randomness of the missing data and classifies the "similarity" between the target station and the other gauging stations spread over the study area. Among different methods useful for defining the similarity degree, whose effectiveness strongly depends on the data distribution, the Spearman correlation coefficient was chosen. Once defined the similarity matrix, a suitable, nonparametric, univariate, and regressive method was applied in order to estimate missing data in the target station: the Theil method (Theil, 1950). Even though the methodology revealed to be rather reliable an improvement of the missing data estimation can be achieved by a generalization. A first possible improvement consists in extending the univariate technique to the multivariate approach. Another approach follows the paradigm of the "multiple imputation" (Rubin, 1987; Rubin, 1988), which consists in using a set of "similar stations" instead than the most similar. This way, a sort of estimation range can be determined allowing the introduction of uncertainty. Finally, time series can be grouped on the basis of monthly rainfall rates defining classes of wetness (i.e.: dry, moderately rainy and rainy), in order to achieve the estimation using homogeneous data subsets. We expect that integrating the methodology with these enhancements will certainly improve its reliability. The methodology was applied to the daily rainfall time series data registered in the Candelaro River Basin (Apulia - South Italy) from 1970 to 2001. |
Source | 6th EGU General Assembly, Vienna (AUT), 19 - 24 Aprile 2009 |
Year | 2009 |
Type | Abstract in atti di convegno |
Authors | BARCA E., PASSARELLA G. |
Text | 119397 2009 A REGRESSIVE METHODOLOGY FOR ESTIMATING MISSING DATA IN RAINFALL DAILY TIME SERIES BARCA E., PASSARELLA G. E. Barca and G. Passarella Water Research Institute, National Research Council, Bari, Italy emanuele.barca@ba.irsa.cnr.it / 00390805313365 http //meetingorganizer.copernicus.org/EGU2009/EGU2009 12496.pdf Convegno internazionale organizzato da EGU, the European Geosciences Union 6th EGU General Assembly Vienna AUT 19 24 Aprile 2009 Internazionale Contributo The presence of gaps in environmental data time series represents a very common, but extremely critical problem, since it can produce biased results Rubin, 1976 . Missing data plagues almost all surveys. The problem is how to deal with missing data once it has been deemed impossible to recover the actual missing values. Apart from the amount of missing data, another issue which plays an important role in the choice of any recovery approach is the evaluation of missingness mechanisms. When data missing is conditioned by some other variable observed in the data set Schafer, 1997 the mechanism is called MAR Missing at Random . Otherwise, when the missingness mechanism depends on the actual value of the missing data, it is called NCAR Not Missing at Random . This last is the most difficult condition to model. In the last decade interest arose in the estimation of missing data by using regression single imputation . More recently multiple imputation has become also available, which returns a distribution of estimated values Scheffer, 2002 . In this paper an automatic methodology for estimating missing data is presented. In practice, given a gauging station affected by missing data target station , the methodology checks the randomness of the missing data and classifies the similarity between the target station and the other gauging stations spread over the study area. Among different methods useful for defining the similarity degree, whose effectiveness strongly depends on the data distribution, the Spearman correlation coefficient was chosen. Once defined the similarity matrix, a suitable, nonparametric, univariate, and regressive method was applied in order to estimate missing data in the target station the Theil method Theil, 1950 . Even though the methodology revealed to be rather reliable an improvement of the missing data estimation can be achieved by a generalization. A first possible improvement consists in extending the univariate technique to the multivariate approach. Another approach follows the paradigm of the multiple imputation Rubin, 1987; Rubin, 1988 , which consists in using a set of similar stations instead than the most similar. This way, a sort of estimation range can be determined allowing the introduction of uncertainty. Finally, time series can be grouped on the basis of monthly rainfall rates defining classes of wetness i.e. dry, moderately rainy and rainy , in order to achieve the estimation using homogeneous data subsets. We expect that integrating the methodology with these enhancements will certainly improve its reliability. The methodology was applied to the daily rainfall time series data registered in the Candelaro River Basin Apulia South Italy from 1970 to 2001. A REGRESSIVE METHODOLOGY FOR ESTIMATING MISSING DATA IN RAINFALL DAILY TIME SERIES Abstract presentato a Convegno internazionale organizzato da EGU, the European Geosciences Union INT_Abstract_Poster_06.pdf Abstract in atti di convegno Copernicus GmbH 1029 7006 Geophysical research abstracts Geophysical research abstracts Geophys. res. abstr. Geophysical research abstracts. giuseppe.passarella PASSARELLA GIUSEPPE emanuele.barca BARCA EMANUELE TA.P04.005.008 Integrazione di metodologie per il monitoraggio e la modellizzazione per la gestione delle risorse idriche |