1. Introduction
In the last decades, both the frequency and intensity of extreme weather and climate events have increased worldwide, leading to huge economic losses [
1]. In Argentina, the number of intense storms due to climate change has already increased in the last few years [
2]. The frequency of hail events is higher in midlatitudes during the warm season [
2]. The Pampas region of Argentina, the most important crop production area in the country, is located in midlatitudes, and the time of higher frequency of hail events overlaps both the stage of the highest vulnerability of winter crops (such as wheat, which reaches the fruiting stage in the warm season) and the whole growth period of summer crops such as soybeans and corn.
Farmers aim to mitigate the consequences of natural catastrophes through management strategies within their farming operations, but certain risks, such as hail events, are beyond their financial means. For such cases, farmers usually transfer the risk to specialized risk management companies by taking out agricultural insurance policies, with hail damage policies being very widespread in the country.
Following a hailstorm, an insurance claims adjuster goes to the field and estimates the yield losses due to hail damage. The accuracy of estimates relies heavily on the detection of Homogeneous Damage Zones (HDZs), which allows extrapolating the data from field samples to the whole area of the damaged field. Currently, estimates of the surface area of HDZs are performed visually, and the adjuster has no information about either the affected area within the field or the degree of damage before getting to the site for inspection. Since crop fields in Argentina are often extensive, with areas of 500 hectares being usual, it is unfeasible to determine HDZs in the field, particularly in the case of tall crops and with very heterogeneous degrees of damage.
One alternative is to apply machine-learning techniques to remote sensing data for the detection of crop changes due to hailstorms. The ability and accuracy of remote sensing to detect and classify changes in the physical, physiological, and chemical properties of a crop vary both with the type of sensor and algorithms used for data processing. Thus, for correct implementation, it is necessary to have precise knowledge of the advantages and limitations of each sensor and the machine-learning techniques available. Two types of sensors were used in this research: active microwave Synthetic Aperture Radars (SAR), and passive multispectral. Data obtained with these sensors are pre-processed and used as input variables for the machine-learning algorithm to detect HDZs.
The total scattered microwave signal that returns to the sensor from a vegetated surface is known to be affected directly by plant cover, by the soil (attenuated by the canopy), and by vegetation-underlying soil interactions [
3]. Thus, the microwave signal is sensitive to changes in canopy architecture and geometric configuration, whether such changes have been caused by normal plant growth or by the occurrence of a hailstorm. Since microwave signals are also sensitive to surface roughness and the dielectric coefficient of the materials observed [
4], backscatter between two successive captures might be more influenced by wind or rainfall, which affect surface roughness or the dielectric properties of the materials, than by the growth of the crop itself [
5] or by hail-caused changes. Several authors [
6,
7,
8,
9,
10] have proposed using indices calculated with SAR data to estimate biophysical crop parameters, with procedures for smoothing the undesired characteristics in the backscattered signal. The microwave spectrum range, on the other hand, has a clear advantage over multispectral data in not requiring generally atmospheric corrections since it is not sensitive to atmospheric interferences.
Optical sensors capture the sunlight reflected by objects, thus allowing accurate identification of crop characteristics. Filters placed in the sensor record the reflectance received in different layers of an image as wavelength ranges. Optical images require correction of intermediate sun glint caused by atmosphere interferences as well as detection of clouds and cloud shadows. Once this drawback is overcome, optical images are easier to interpret than those obtained from SAR sensors. Since each type of sensor has its advantages, it is reasonable to infer that a combination of both types of signals, microwave and multispectral, would optimize the detection of changes in crop biophysical parameters.
Following this approach, an arithmetic formulation including two or more wavelength ranges in the multispectral signal or types of polarization in the microwave signal is used to generate vegetation indices to describe crop biophysical variables. Although vegetation indices are easy to use, the limited number of variables they include may not reflect the complexity of field conditions [
11]. This disadvantage can be minimized by selecting the indices most sensitive to the changes in the variable of interest, which will allow differentiating plant groups with different degrees of change.
Once the biophysical variables to be assessed along the crop growth period have been defined, the challenge is to achieve a correct classification of the changes observed, with machine-learning techniques having been successfully applied for this purpose in agriculture [
12]. These techniques generate specific, accurate, and automated outputs from high volumes of remote-sensing complex data [
13]. However, implementing machine-learning methods is not easy, and the literature provides contradictory information regarding the key steps in the process. The complex decisions involved include the choice of algorithm, requirements of input data, selection and optimization of user-defined parameters, and computational costs [
14].
ML techniques are most generally classified into supervised and unsupervised. They differ from each other in that the former uses labels in a data subset to train the model to be subsequently applied to the whole set, whereas the latter searches for patterns in unlabeled data. Since there was no training sample available for this research, we used an unsupervised technique. The technique was applied to identify damage patterns in data from different crops affected by hailstorms at different phenological stages.
Although Earth satellite data have been available for more than 30 years, to the best of our knowledge no research has been conducted on the use of SAR data for hail damage to crops. Multispectral data have been used to analyze some local and sporadic cases [
15,
16,
17] which, however, pose practical limitations for regional and operational application [
18]. Thus, the purpose of this work was to develop an algorithm for determining HDZs in crops affected by hailstorms, following a standardized and accurate methodology for processing satellite data. Such a development will confer objectivity and transparency to crop damage assessment.
The following section describes the study area and the importance of agricultural insurance in the region. The characteristics and pre-processing of satellite data are also included in this section. Next, we explain the methodology used to select the indices with the highest sensitivity to hail-induced changes in the crop, as well as the steps followed by the algorithm to classify pixels into HDZs and the algorithm validation methodology.
Section 3 details the results of the sensitivity analysis for the different vegetation indices and identifies the most sensitive for each signal type. The results and discussion of the algorithm steps are also included in this section, which finishes with the results and discussion of algorithm validation with in situ data. Conclusions are presented in
Section 4.
2. Materials and Methods
2.1. Study Area
The study was conducted in the Pampas, a huge sedimentary basin formed on top of the Brasília Massif in South America. The climate in this region is subtropical humid in the north and temperate in the south, with warm summers and cool and variable winters with frequent frosts but no snowfall. The average annual temperature is 17 °C. It presents a monsoon rainfall regime, with annual averages ranging from 1200 mm in the east, an area under the influence of Atlantic winds, to 300 mm in the west, where the dry, hot winds blowing across the Andes Mountains predominate. The Pampas is an extensive plain extending over the Argentine provinces of Buenos Aires, Entre Ríos, Santa Fe, Córdoba, La Pampa, and San Luis (
Figure 1). Formerly characterized by rangelands and the absence of trees, today it is the most industrialized region in the country. The industrialization process began with cattle raising and the manufacturing of animal products for export continued with agricultural development and is currently fueled by soybeans, corn, wheat, and their by-products.
In Argentina, the regions with the highest hailstorm frequencies are located near the Córdoba, Tandil, and La Ventana Mountains, where orographic storms occur, and along the Atlantic coast, where the disturbances caused by cold air currents reaching the continent contribute to the formation of hailstorms [
19]. Historically, the damage is greater in winter crops such as wheat since intense storms occur mainly in late spring and early summer, when the filled grains are exposed [
20].
The area allocated to crop production shows small variations between years. In the 2019–2020 growing season, the areas sown with soybeans, corn, and wheat were 16.9, 9.5, and 6.9 million hectares respectively (
http://datosestimaciones.magyp.gob.ar/ (accessed on 10 March 2020)). In the same period, the policies issued by all the insurance market for Agricultural and Forestry Risks amounted to US
$ 250 million (
https://www.argentina.gob.ar/superintendencia-de-seguros (accessed on 25 July 2020)), 91.5% of which was taken out for hail damage, as follows: 37.6% for extensive soybean crops, 37.1% for corn, and 16.8% for wheat.
2.2. Data
Data were obtained from Sentinel-1 and -2 Missions of the Copernicus Program (European Space Agency and European Environment Agency, EU), and from in situ damage assessments in 91 plots affected by hailstorms between 2017 and 2020. All plots are located within the study area, 44 planted with soybeans, 32 with wheat, and 15 with corn (
Figure 1). The evaluated plots had an average surface of 137.97 hectares, with a minimum value of 13 and a maximum of 590 hectares; the values of the quartiles Q1, Q2, and Q3 were 69.75, 97.5, and 162.75 hectares respectively.
Table S1 shows the central point of each plot, the crop name, and the date of the hailstorm which caused the damage.
Sentinel-1 is equipped with C-band, dual-polarization Synthetic Aperture Radar (SAR) sensors operating at a center frequency of 5.405 GHz. Sensors are mounted on two satellite platforms (A and B), each of which registers images of the same area every 12 days. In most of the Argentinian territory, images overlap, and their incidence angles differ considerably between the platforms. To avoid this problem, when data from platforms A and B overlapped, the first to capture images after a given hailstorm was selected.
The images used were Ground Range Detected, processed from Single Look Complex products with Sentinel-1 Toolbox (
http://step.esa.int/main/toolboxes/snap (accessed on 12 December 2018)). This process generates a calibrated, ortho-rectified product, with thermal noise removal, and radiometric calibration, and terrain correction using the SRTM 30 digital elevation model, generating a product with a 10 m spatial resolution.
Sentinel-2 is a remote sensing system that uses the sensors MSI (MultiSpectral Instrument) capturing images in the visible light spectrum, near-infrared, and short-wavelength infrared bands, to record solar radiation reflected by objects on Earth in thirteen 16-bit spectral ranges. Similarly to Sentinel 1, two MSI sensors are mounted on two S2 platforms (A and B). The wavelength ranges used were blue (B2), red (B4), and near-infrared (B8), all of them with 10 m spatial resolution and obtained from a product with Level-1C Processing. The data thus obtained represent the reflectance in the upper layer of the atmosphere. The time resolution of Sentinel-2 is 5 days, or 2 and 3 days in sectors with orbit overlap.
For each assessed plot, data were obtained from satellite images from both missions, within the plot boundaries, with a time frame of 60 days prior to and 60 days after the hailstorm. When the time frame included any date prior to sowing or after harvest of each crop, the series was interrupted to avoid entering data from other crops. Satellite data were retrieved from and processed in the Google Earth Engine platform.
In situ damage assessment for each crop was performed following the United States Department of Agriculture and Federal Crop Insurance Corporation guidelines [
21,
22,
23]. At least ten sampling stations were established in each of the 91 plots analyzed, totaling 1174 observations (646 in soybean plots, 153 in corn, and 375 in wheat). Latitude, longitude, and estimated damage percentage were recorded in each sampling station.
2.3. Satellite Data Pre-Processing
SAR images are formed through coherent processing of the return of successive RADAR pulses. When a RADAR sheds light upon a rough surface, the distances between the elementary scatterers and the receiver vary due to surface roughness. As a result, the backscattered waves are coherent in frequency but not in phase. The signal received is strong if the waves add together in a relatively constructive way, whereas it is weak if phase shifts lead to intensity variations between adjacent pixels. These variations appear as a granular pattern, called speckle [
24], which makes visual and digital image interpretation more difficult. The technique most widely used is to remove speckle noise in spatial averaging. This technique, however, lowers image resolution significantly, resulting in blurry images. Morphological filters reduce speckle noise without losing image focus [
25] and also have an excellent computational performance. These filters transform images by applying successive convolutions. To do this, a kernel function (e.g., maximum, minimum, mean, median) is assigned to a moving window. The kernel moves over the original image performing the calculations for each pixel. The value obtained as output is assigned as the new value of the central pixel in the image. Thus, the pixel values in the new image are modified according to the values of its neighboring pixels in the original image. Kernels can be circular, square, cross-or diamond-shaped, or octagonal, and their size is defined by the user. To remove speckle noise, the median for each pixel was calculated on a circular kernel with a 15-pixel radius. A median rather than an average was used because speckles appear in random positions and their values vary greatly from other pixels representing the same object, which could lead to distortions in the transformation of speckled areas of an image if an average were used. Kernel size was defined according to plot size and the areas to be differentiated within the plot, whereas kernel shape was defined according to the shape of the areas to be differentiated. The crop changes to be detected are determined by soil variables such as salinity, texture, and topography before the hailstorm, and by hail after the storm. A variation in any of these variables will be strongly correlated with the spatial variable, generating irregular patterns [
26]. The circular kernel best fitted this pattern.
Although there are numerous methodologies to correct atmospheric interferences in multispectral data, not all can be applied to Sentinel-2 data because it does not record data in the thermal infrared wavelength range. Since atmospheric interferences appear as unexpected and occasional changes along with successive images in a collection, and crop changes manifest themselves gradually along time, temporal methods such as Tmask [
27] and MAJA [
28] are considered most suitable to perform atmospheric corrections. However, Tmask and MAJA have a high computational cost. Their application to HDZs would slow the algorithm, making it unsuitable for the quick responses required for crop damage insurance claims. Thus, a quick-running time method was selected to filter clouds and their shadows and correct other atmospheric interferences, namely the mobile linear regressions method, which is applied separately to each pixel time series [
29]. A 15-day bandwidth was used to ensure interference removal without adversely affecting the values by data degradation due to excessive smoothing.
2.4. Selection of Vegetation Indices
Vegetation indices allow identifying and describing structural, phenological, and biophysical parameters, which is a temporal line basis help detecting changes in vegetation [
30]. To define the variables to be used for pixel classification into HDZs, we considered the vegetation indices which are most informative about crop condition and can be calculated from Sentinel-1 and -2 satellite missions’ data. For this purpose, five microwave indices (calculated with the S1 data set) and five multispectral indices (calculated with the S2 data set) were evaluated. Subsequently, one index was selected for each signal type and named
and
.
The microwave indices considered were:
Dual-Pol Diagonal Distance (DPDD) [
6]:
where
is the value of pixel
i in the co-polarized signal, and
the value of pixel
i in the signal with cross-polarization. A low DPDD value represents bare soil, increasing gradually as biomass increases, and reaching maximum values with thick vegetation.
Inverse Dual-Pol Diagonal Distance (IDPDD) [
6]:
where
and
are defined as in (1) and
is the highest value of the co-polarized signal observed in each image within the plot. IDPDD is better than DPDD to discriminate between vegetated pixels and pixels with bare soil but presents some degree of inaccuracy to differentiate vegetation from water bodies.
Vertical Dual Depolarization Index (VDDPI) [
6]:
where
and
are defined as in (1). This index quantifies biomass by measuring the ratio between total polarized power and co-polarized power.
Microwave Polarization Difference Index (MPDI) [
31]:
where
and
are defined as in (1). This index is particularly sensitive to surface roughness and soil moisture, both being considerably altered after a hailstorm.
Dual Polarization SAR Vegetation Index (DPSVI) [
6]:
where
and
are defined as in (1), and
as in (2). DPSVI is a combination of VDDPI and IDPDD which keeps the advantages of both indices while allowing differentiating vegetation from water bodies.
The spectral indices considered were:
Normalized Difference Vegetation Index (NDVI) [
32]:
where NIR is the pixel value in the near-infrared range (B8); Red is the pixel value in the red range (B4). NDVI is associated with the amount of chlorophyll in the vegetation. High NDVI values correspond to areas with higher reflectance in the near-infrared spectrum, which correspond to thicker, healthier vegetation. This index reduces noises such as differing amounts of light, cloud shadows, atmospheric attenuation, or certain topographic variations [
27]. However, it saturates with high biomass and is very sensitive to variations in soil underlying the canopy [
33].
Enhanced Vegetation Index (EVI) [
34]:
where NIR and Red are defined as in (6), and Blue is the pixel value in the blue range (B2). EVI performs better than NDVI in high biomass situations. It also has a better response to canopy structure variations such as leaf area index, canopy type, plant shape, and canopy architecture [
35]. EVI requires a thorough previous removal of both external noise and sensor noise to estimate surface reflectances [
34].
Soil Adjusted Vegetation Index (SAVI) [
33]:
where NIR and Red are defined as in (6). SAVI is derived from the NDVI formula, to which a coefficient is added for the correction of soil brightness in areas with low vegetative cover. This correction is important in open canopies where the underlying soil affects the reflectance recorded.
Advanced Vegetation Index (AVI) [
36]:
where NIR and Red are defined as in (6). Like NDVI, AVI uses the red and near-infrared reflectances. However, since AVI uses a power degree of the infrared response, it is more sensitive to slight differences in canopy density [
36].
Normalized Pigment Chlorophyll Ratio Index (NPCRI) [
37]:
where Red and Blue are defined as in (6) and (7). This index is sensitive to carotenoid: chlorophyll content ratios. Being especially useful for characterizing leaf senescence and fruit ripening, it also differentiates senescent crops from bare soil.
Indices
and
were selected by analyzing data from the hailstorm on February 10th, 2019, near Lozada town, province of Cordoba, Argentina. The analysis was performed in a 90-hectare soybean field (plot 79 in
Table S1, hereafter referred to as model plot) in the R3 phenological stage according to the Fehr scale [
38]. At the time of the hailstorm, the plot presented total and homogeneous crop cover. After the storm, damage ranged from null to total crop loss, with intermediate degrees of damage in some sectors.
All Sentinel-1 and -2 images containing the model plot were selected from 12 December 2018 (60 days before the storm) to 11 April 2019 (60 days after the storm). Each image was cut out to fit the plot shape, and two datasets were built: S1 with images from mission Sentinel-1, and S2 with images from Sentinel-2. Datasets were pre-processed as described in
Section 2.3. In each dataset, the five vegetation indices for each signal were calculated for each image pixel. A graph was built for each index showing the time series of the model plot pixels. Standard deviations of index values were calculated for each image. Next, the maximum values of the standard deviations pre-storm (
) and post-storm (
) were calculated for each of the ten vegetation indices
(
j = 1, …, 5 with S1 dataset, and
j = 6, …, 10 with S2 dataset). Then we selected the lowest quotient among the vegetation indices for each dataset S1 and S2, defined as
and
respectively and calculated as:
Thus, the microwave index and the spectral index with the lowest pre-storm backscatter compared to the post-storm backscatter were selected.
2.5. Algorithm for Identification of HDZs
An unsupervised learning algorithm is proposed to identify HDZs within a plot using Sentinel-1 and-2 satellite images. The steps of the algorithm are: (1) selection and pre-processing of satellite data to remove speckle noise, presence of clouds, and other atmospheric phenomena; (2) calculation of vegetation indices with the highest sensitivity to hail-caused damage to constructing the data matrix for pixel classification; (3) preparation of the variables for classification and data matrix construction; (4) unsupervised pixel classification into HDZs.
- (1)
Selection and pre-processing of satellite data
For each plot, satellite images within the 60-day pre- and post-storm time window, as described in
Section 2.2, were selected. Each image was cut out to fit the plot shapes and datasets S1 and S2 were assembled. Both datasets were pre-processed by correcting speckle on microwave signals and atmospheric interferences on the multispectral signal, as described in
Section 2.3.
- (2)
Calculation of vegetation indices
Indices
and
were calculated for each pixel of datasets S1 and S2 respectively, as described in
Section 2.4.
- (3)
Preparation of classification variables and construction of data matrix
To determine HDZs in a field with hail-damaged crops, pixels were classified using the data in Sentinel-1 and -2 images. The number of pixels depended on the size of the affected field, with analyzed fields having an average area of 60 hectares. Thus, for the average field size, 6000 pixels of 10 m resolution (100 m2) each were analyzed.
The algorithm was expected to classify pixels into HDZs. Since the pixels in a given data matrix always corresponded to a single plot, they invariably represented a single crop and variety, sown on the same date. Classification into HDZs considered the pre- and post-hailstorm crop evolution. The resulting HDZs were expected to cluster plants with a similar response to the biophysical variables estimated by the indices, both prior to and after the hailstorm.
In order to capture the diversity and complexity of possible hail-induced crop changes, both microwave and multispectral signals were used. As mentioned in
Section 1, microwave signals are more sensitive to changes in structure and geometry, while the multispectral signal is more sensitive to physiological changes manifesting themselves in canopy coloration.
Since the revisit time of the Sentinel 2 mission is shorter than that of Sentinel 1, Sentinel 2 will capture a larger number of images than Sentinel 1 (RADAR) for a given time window. For this reason, a linear interpolation of spectral index values was performed to estimate a value for each of the dates of SAR image capture. This allowed having the same amount of data for both microwave and spectral signals in the matrix, thus balancing the influence of each dataset. The procedure also provided multispectral images with temporal equidistance from each other, which allowed calculating the rates of index value changes.
The rates of change (derivative functions) were calculated by differentiating each of the time series obtained when calculating the different indices of vegetation in the plots. Differentiating an equispaced time series consists of subtracting the immediately preceding value from each value and, therefore, it served us to calculate the derivative of the function. Exploration of both series, the original and the derivative, is interesting since in certain cases the different groups may differ due to the indices observed in a given period, but in other cases due to their change ratios [
39]. It was not possible to calculate the derivative of the first image due to a lack of previous data, so it was removed from the matrix after calculating the derivatives. Thus, the data matrix for each plot had as many rows as pixels, and the variables
,
and their corresponding derivatives (
,
) calculated for each observation within the time window. Thus, if the sowing (harvesting) date was prior to (following) the time window (60 days pre- and post-storm date), 9 observations were obtained (4 pre-storm and 5 post-storm) for each variable (
,
,
,
), resulting in a maximum of 36 values in the vector for each pixel.
- (4)
Unsupervised Pixel Classification
HDZs will be defined using unsupervised machine learning. Unsupervised learning aims to find the underlying structure of a dataset and cluster its data by similarity. The methods most widely used include K-means [
40], hierarchical clustering [
41], and mixture modeling [
42]. Hierarchical clustering requires calculating the distance matrix between all data, which increases processing costs considerably in the case of big datasets. Mixture modeling requires assuming restrictive hypotheses on the probability distribution of the clusters defined. We used K-means [
40] because it is simple, easy to implement, and quick and cost-effective for computing big datasets [
43]. K-means is an iterative clustering algorithm used for partitioning
n observations (
x1,
x2, …,
xn) into
k groups (
S = {
S1,
S2, …,
Sk}). Each observation is allocated to the cluster with the nearest mean value in order to minimize within-cluster variances, based on the following equation:
where
is the mean of observations in
.
This method, which has been proven to have an optimal performance in scientific research for analyzing data clusters, requires specifying the number of groups into which data will be classified. For HDZ detection, plot pixels were grouped into three categories (k = 3). In this sense, three classes were defined per plot because it is the number of homogeneous damage zones most frequently detected in the field using visual techniques; thus, it is validated by the expertise of claims adjusters seeking to simplify and standardize field surveys. The Euclidean distance was used to measure proximity to centroids. For different applications of the algorithm, this parameter can be modified to include other classification objectives.
The algorithm was programmed in Python 3 (
https://www.python.org/ (accessed on 31 January 2021)), using the pandas, GeoPandas, ee, eeconvert and DateTime libraries.
2.6. Algorithm Validation with In Situ Data
The quality of clusters provided by the algorithm proposed in this work was validated by external data, different from the data used to determine HDZs in each plot. Even though no information was available about the actual HDZ to which each pixel belonged for any of the plots analyzed, we had in situ damage measurements for some spots.
Pixels for which in situ data were identified in each plot. These pixels constituted a sub-sample of the three HDZs identified with the algorithm, for which actual damage data was available from in situ measurements. To validate the quality of pixel clusters into HDZs, we compared the means of the damage percentages calculated with in situ data in the three HDZs identified by the algorithm. If the algorithm identified HDZs correctly, the mean actual damage percentage had to differ between the zones compared. Homogeneity between HDZs was tested using one-way ANOVA tests. Differences were considered statistically significant if p-value < 0.05.