Elsevier

Marine Structures

Volume 21, Issues 2–3, April–July 2008, Pages 177-195
Marine Structures

Filling up gaps in wave data with genetic programming

https://doi.org/10.1016/j.marstruc.2007.12.001Get rights and content

Abstract

A given time series of significant wave heights invariably contains smaller or larger gaps or missing values due to a variety of reasons ranging from instrument failures to loss of recorders following human interference. In-filling of missing information is widely reported and well documented for variables like rainfall and river flow, but not for the wave height observations made by rider buoys. This paper attempts to tackle this problem through one of the latest soft computing tools, namely, genetic programming (GP). The missing information in hourly significant wave height observations at one of the data buoy stations maintained by the US National Data Buoy Center is filled up by developing GP models through spatial correlations. The gap lengths of different orders are artificially created and filled up by appropriate GP programs. The results are also compared with those derived using artificial neural networks (ANN). In general, it is found that the in-filling done by GP rivals that by ANN and many times becomes more satisfactory, especially when the gap lengths are smaller. Although the accuracy involved reduces as the amount of gap increases, the missing values for a long duration of a month or so can be filled up with a maximum average error up to 0.21 m in the high seas.

Introduction

The time series of ocean wave heights has applications in many studies related to coastal, offshore and ocean engineering. Analysis of such series can yield a variety of information for design and operational use, like the long-term wave height corresponding to a certain return period and exceedance probabilities of a given wave height, respectively. An analysis of the time series usually dictates that the sequential observations contained in it are equally spaced, made over long periods and reported in an uninterrupted manner. However despite precautions, gaps in the collected records cannot be avoided. This is due to many reasons like, failure of collection and transmission equipments, noise and synchronization problems between the buoy and the receivers, hardware- as well as software-related failures, aging of equipments, accidental or weather-induced snapping of mooring lines, severe weather rendering the system in-operational, thefts and the cloud cover problem in satellite image-based records.

The amount of gaps in a given record changes from one buoy location to another, but in general appears to range anywhere from less than 1% a year to more than 40% as shown in Table 1, that gives an example of the percentage missing values in the wave rider buoy measurements made by the US National Data Buoy Center (NDBC) [1] at selected locations in the Gulf of Mexico. Arena et al. [2] based on the Italian national wave measurement programme involving 14 buoys mention the occurrence of 15% gaps over a 12-year period, associated with a maximum repair time of 24 days. In the Sea Wave Monitoring Network (SWAN) of 10 buoys around Italy Puca et al. [3] report loss of data ranging from less than 5–15%. While analysing data collected under the Indian data buoy programme in a separate exercise the authors have noted that for three locations along the west coast, namely, DS1, SW2 and SW4 the loss of data was around 19%, 39% and 9% and over periods of 4.0, 2.5 and 5.5 year, respectively, during 1998–2003.

The loss of information can be valuable and hence needs to be retrieved. The presence of gaps obviously affects the quality of information obtained through analysis of such gappy time series and also the performance of the application made, like say real-time wave forecasting and derivation of wave height–duration curves. The presence of missing information may introduce a bias in the results so obtained.

The techniques of substituting missing values in a given time series of a random variable have been well studied and routinely employed in case of variables like river discharge and runoff [4], but the same cannot be said for the variable of ocean waves. This might probably be due to relatively smaller sample sizes involved in many hydrological studies, like analysis of annual peak river flows or that of monthly rainfalls, where a single missing value may introduce very large bias in the results. The methods employed in such applications include a random choice within the observed range, linear and non-linear interpolation [4], autoregressive schemes [5], chaos theory [6] and artificial neural networks [7]. The problem of gappy data in general oceanography has been addressed by investigators like Thompson [8] who suggested that a random sampling of data points might be an optimally efficient approach and Sturges [9] who used a Monte Carlo technique to make up gaps at random in a known time series of monthly mean sea-level. Emery and Thomson [10] gave an account of such attempts in a wider domain of oceanography. As regards the time history of wave heights (rather than that of other variables in the works referred to earlier) is concerned there are relatively sparse studies directly addressing the issue. Stefanokos and Athanassoulis [11] made use of a residual wave height series with the same probability distribution as the original one created after removing the trend and periodicity from the observed series. Use of the soft computing tools like artificial neural network (ANN)'s for the in-filling of wave data is very recent. Puca et al. [3] filled up gaps at one location by developing the spatial correlation with two nearby sites, while Balas et al. [12] resorted to temporal correlations probably due to smaller gaps (2–24 h or so) in their series and also smaller period of observation (24 months). In general small gaps—a few in number—appeared to have been filled up by simple interpolation, medium gaps by stochastic model fitting and large gaps by spatial correlation, although the distinction made between the small, large, medium is not very clear [11], [13].

A review of past works indicates that when it comes to in-filling of missing values in the wave height series there is a scope to carry out a systematic analysis based on various sizes of gap lengths and using a large database and the present useful information accordingly to future investigators. Further, since recent past researchers dealing with uncertainties in data are finding new soft computing approaches more attractive compared to traditional schemes and it is necessary to try out such newer techniques to retrieve the missing information. The present work is directed towards this. It involves application of one of the latest and so far untried soft tool of genetic programming (GP) for filling up the missing information at a given location based on the same being collected at nearby stations. The GP can iteratively generate new values till such values reach a certain level of acceptance as per the selected criterion and thus looks attractive in the current problem of retrieval of missing values. In the present work suitability of this new approach is assessed for different lengths of the gap and its outcome is compared with that of an ANN.

Unlike past, a large amount of wave rider as well as satellite wave data are now becoming increasingly available for multiple locations and over long durations, at many parts in the world and this study would therefore be useful while analysing such database.

Section snippets

The database used

The wave rider buoy observations pertaining to four stations in the Gulf of Mexico maintained by the US National Data buoy Center were downloaded form the web site [1]. These stations were: FPSN7, 41,002, 41,008 and 41,004 (Fig. 1) located in the Gulf of Mexico-off the US SE coast. The measurements were made by ‘frying pan shoals’ and ‘moored’ type of buoys. The station 41,002 is in deep water (depth=3786 m) while stations 41,004, FPSN7 and 41,008 are in shallower depths of 33.5, 23.5 and 18.0 m,

Implementing GP

The present problem of establishing spatial correlations can be handled by GP either through the equation mode or by the programme mode. Considering more flexibility in data mining offered by the latter approach the individuals consisting of computer programs only were used. The software Discipulus [23] was used to generate the GP programs. TurboC in the C++ environment was employed to run the evolved programs and to implement them by applying to a new data set (applied data set). The

Results and discussion

The objective of the study was to fill up gaps in the time series of Hs values at the location: 41,004 given the measurements of Hs at the three surrounding locations (Fig. 1). The calibration as well as the testing of the GP model was done for those cases only where observations at all the four locations were simultaneously available. The calibration or training was thus done with the help of 60% of total data (input–output pairs) of 48 months belonging to the initial observed sequence. Once

Conclusions

GP was found to be an effective tool to retrieve the missing information in a given wave height time history by establishing spatial correlation with neighbouring locations. The technique of GP was able to learn the non-linear trends in the underlying time series satisfactorily.

The results obtained by adoption of the GP were marginally better than the best-trained feed-forward type of ANN, especially when a small interval (a few days or so) of gaps was intended to be filled up.

Although in

Acknowledgement

The authors gratefully acknowledge the financial support given to the above project by the Department of Science and Technology, Government of India.

References (24)

  • Puca S, Tirozzi B, Arena G, Corsini S, Inghilesi R. A neural network approach to the problem of recovering lost data in...
  • K.N. Mutreja

    Applied hydrology

    (1987)
  • Cited by (58)

    View all citing articles on Scopus
    View full text