Next Article in Journal
The Assessment of Climate Variables and Geographical Distribution on Residential Drinking Water Demand in Ethiopia
Previous Article in Journal
Unusual Catalytic Effect of Fe3+ on 2,4-dichlorophenoxyacetic Acid Degradation by Radio Frequency Discharge in Aqueous Solution
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Clustering for Regional Time Trend in the Nonstationary Extreme Distribution

1
Department of Statistics, University of Seoul, Seoul 02504, Korea
2
Department of Statistics, Seoul National University, Seoul 02504, Korea
*
Author to whom correspondence should be addressed.
Water 2022, 14(11), 1720; https://doi.org/10.3390/w14111720
Submission received: 27 March 2022 / Revised: 16 May 2022 / Accepted: 19 May 2022 / Published: 27 May 2022
(This article belongs to the Section Hydrology)

Abstract

:
Since the estimation of tail properties requires a stationarity of observations, it is necessary to develop a de-trending method not dependent on underlying distributions for nonstationary hydrological processes. Moreover, de-trending has been independently applied to hydrological processes, even though the processes are observed in geometrically adjacent sites. This paper presents a distribution-free de-trending method for nonstationary hydrological processes. Our method also provides clustered regional trends obtained by sparse regularization in a general distribution. It aggregates the parameter estimation and clustering within a unified framework. In the simulation study, our proposed method has superiority over other compared methods with respect to MSE and variance of coefficients. In real data analysis, the clustered trends of the annual maximum precipitation in the South Korean peninsula are reported, and the patterns of the estimated trends are visualized.

1. Introduction

One of the main problems in extreme data analysis is the lack of observations and the scarcity of data, leading to large variances of the estimators, which deteriorate the predictive performances of the statistical model [1,2,3,4]. The regional frequency analysis (RFA) (e.g., [1,2]) has been commonly employed to overcome this problem. The RFA assumes that the hydrometeorological variables across homogeneous regions follow a common probability distribution. The observations are pooled on the identified homogenous regions, and the parameters of the distributions are estimated by the pooled observations. Therefore, the regionalization of the analyzed basins is a prerequisite for the RFA procedure, and its success mainly depends on the selection of appropriate clustering methodology.
Various clustering techniques have been developed and applied in RFA [1,2,3,4,5,6,7,8]. The geomorphological variables (e.g., drainage area, basin elevation, soil runoff coefficient) and meteorological variables (e.g., mean precipitation, quantiles of precipitations, annual mean degree days over 0 °C) are used to cluster the homogeneous regions. There is no consensus as to which variables should be considered in clustering. Rather, the choice of variables and clustering technique depends on the purpose of the analysis. However, since a specific class of distributions is determined after the identification of homogeneous regions, it is common to use the distribution-free clustering method.
A few researchers [9,10,11,12,13] proposed a new RFA under nonstationary circumstances. The main problem of the conventional RFA is its asymptotical validity based on the stationary assumption. For example, the L-moments used in RFA [1,2] should be obtained by the order statistics from a stationary distribution. When the underlying distribution is nonstationary, the large sample theory is not valid anymore. Furthermore, the homogeneity of the distributions is not clearly defined under nonstationary circumstances, so there is no universal framework to characterize regional homogeneity in the RFA.
Ref. [11] assumes the nonstationary distribution with a linear trend and proposes the nonstationary RFA where homogeneity is defined over de-trended distributions. According to Ref [11], considering the homogeneous regions improves the estimation of nonstationarity, which affects the final decision of the RFA. Thus, we propose a distribution-free de-trending method for nonstationary hydrological processes, which aggregates the parameter estimation and clustering within a unified framework. Building on Ref [11], a two-stage de-trending method is considered: the decomposition of regional homogeneities into stationarity and the estimation of nonstationarity. The first one includes regional specific factors considered in the conventional RFA, and the second one includes the patterns of changes in hydrometeorological variables.
In this study, it is assumed that the patterns are represented by linear functions of time, which implies nonstationarity, and we focus on detecting homogeneities on the linear trends. This study proposes a new methodology to cluster the linear trends with use of sparse regression models [14,15,16,17,18,19,20]. The proposed method simultaneously facilitates detection of potential nonstationarity and estimation of the regression parameters without any assumption of an underlying distribution. That is, the proposed method is a distribution-free clustering technique for grouping nonstationarity.
This paper is organized as follows. Section 2 introduces the regression model to estimate linear time trend of a nonstationary distribution. Section 3 explains the regularization method for the regression model and defines the penalized risk function for estimating the linear trends in the distribution. In addition, the optimization and the model selection follow in this section. Section 4 illustrates some numerical results by grouping coefficients and conducts a real data analysis. Discussion and concluding remarks are followed in Section 5.

2. Nonstationary Distribution with a Linear Time-Variant Mean

In the RFA, the class of distributions is identified only after determining the homogeneous regions. Because the regionalization contains the identification of the characteristics in distributions, it is not recommended to pre-determine the class of distributions before the regionalization. To avoid the misspecification of an underlying distribution, it is assumed that U t j = β j t + ϵ t j , where β j is a trend parameter at site j   and ϵ t j ~ i i d   f j for all t . Here, f j denotes a general probability density function (pdf) at a site   j . For example, f j is the pdf of a three-parameter distribution with the location ( ξ j ) , scale ( α j ) and shape ( κ j ) parameters, such as the GEV distribution. The homogeneity on the distribution f j s is investigated by L-moments in the conventional RFA. Apart from the homogeneity of the distributions f j s, the homogeneity in the linear time trend is also defined by the values of β j s. Thus, two types of homogeneity in stationary and nonstationary parts are considered in the model, and estimation of the latter one is of our interest in this study.
Let Υ j be the time domain of the site j , and let the true parameter of linear trend coefficients at a site j in Equation (1) be β j * . Then, the distribution of u ˜ t t j = u t j u t j is symmetric at ( t t ) β j * for all t and t . That is, it can be rewritten by u ˜ t t j = ( t t ) β j * + ϵ ˜ t t j with ϵ ˜ t t j = ϵ t j ϵ t j of which the distribution is symmetric at zero. If β j * = β * for all j , or if there is a global mean trend in considered sites, then the β * can be estimated by β ^ = ( β 1 ^ , , β p ^ )   , which is minimized. Let Υ ˜ j = { ( t , t ) : t , t   Υ j ,   0 < | t t | m j } for some m j ,   j = 1 ,   ,   p , then the objective function is given by
j = 1 p ( t , t ) Υ ˜ j | u ˜ t t j ( t t ) β j | ,
subject to β j = β j for all 1 j , j p . When m j is the number of elements of Υ j , Equation (1) is known as rank-based regression model, and β ^ is a generalization for the classical Wilcoxon–Mann–Whitney rank statistics for independent observations [11,21]. In addition, β ^ can be understood as the least absolute deviation (LAD) estimator, which is an M-estimator for minimizing the empirical risk with l 1 loss function. Its asymptotic properties are well studied by Refs [22,23].
The estimation method of minimizing the empirical risk function with l 1 loss function can be applied to analyze the linear trends combining the regionalization method. Assume that the time trends are clustered with q groups. Let β = ( β 1 , , β p ) R p and U ( β ) = { γ 1 ,   , γ q } be the set of the unique values of β . Additionally, let G ( β ) = { G 1 ( β ) , , G q ( β ) } be the set of indices of grouped variables with β j = γ k for all j G k ( β ) . For example, let p = 4 and β 1 = 1 ,   β 2 = 1 ,   β 3 = 1 ,   β 4 = 1 . Then, set q = 2 , γ 1 = 1 ,   γ 2 = 1 , and G 1 ( β ) = { 1 , 3 } ,   G 2 ( β ) = { 2 , 4 } such that U ( β ) = { 1 , 1 } and G ( β ) = { { 1 , 3 } ,   { 2 , 4 } } . When ideally G ( β * ) , the true cluster, is known, the parameter β can be estimated by
β ^ = a r g m i n β = ( β 1 , ,   β p ) j = 1 p ( t , t )   Υ ˜ j | u ˜ t t j ( t t ) β j | ,
subject   to   β j = β j   for   all   j , j G k ( β * )   ( k = 1 , , q ) .
By the constraint of Equation (2), each estimate in G k ( β * ) for k = 1 , , q has the same value. However, G k ( β * ) is unknown in practice, such that such β ^ is an ideal estimate in Equation (2). If the regions are clustered with the same linear time trend, the estimation of the coefficient β j is expected to be improved by pooling data. Otherwise, the improvement is not expected due to poor clusters of regions. That is, when the pooled observations for regionalization are used, the bias of the estimator can be caused by heterogeneity of populations with wrong clusters. That is, the clustering regions play a key role of regionalization. In the next section, we explain a new method to estimate the clusters and the linear time trend simultaneously.
Remark 1.
It can be assumed that  U t j = h j ( t ) + ϵ t j , where ϵ t j ~ i i d   f j , where h j is a time-varying locational function. In this case, U ˜ t t j = U t j U t j is also symmetric at h j ( t ) h j ( t ) , and the nonparametric method [24] can be employed to estimate h j .

3. Proposed Method

The proposed method is based on a regularization method in the regression model, which is widely studied in statistics and computer science. First, candidates of clusters are obtained by the regularization method [16]. Second, sets of coefficients are recovered by applying the method to estimate the linear time trend. At last, the estimated coefficients are selected according to Bayesian information criterion [25].

3.1. Regularization Method for Regression Models

First, the empirical risk function Equation (2) is reformulated as a typical regression model with predictors. Let l j be the number of elements in the set Υ ˜ j for j = 1 , ,   p and r j : Υ ˜ j { 1 , , l j } be a bijective function, which denotes the index of each element ( t , t )   Υ ˜ j . Let x ˜ s j = t t and y ˜ s j = u ˜ t t j for s = r j ( t , t ) , and let x ˜ j = ( x ˜ 1 j , , x ˜ l j j   ) T R l j and y ˜ j = ( y ˜ 1 j , , y ˜ l j j ) T R l j be the predictor and response vector in Equation (2). By concatenating the x ˜ j s and the y ˜ j s , extended vectors X j = ( 0 s j 1 T , x ˜ j T , 0 n s j T ) T and y = ( y ˜ 1 T ,   ,   y ˜ p T ) T are defined, where 0 p is the p -dimensional zero valued column vector and s j = k = 1 j l k , and n = s p . Finally, let X = ( X 1 X p ) be n × p design matrix in the regression model and x i be the i th row vector of X , and let y i be the i th element of y . Then, the empirical risk function Equation (2) is written by the terms of y i and x i as L ( β ) = i = 1 n | y i x i T β | , which is used in estimating the coefficients in the LAD regression model. The regression coefficient is estimated by minimizing L ( β ) , and the regularization method is applied to the L ( β ) .
The regularization method gives a sparse estimate whose values are exactly zero or grouped. The regularized estimator is defined by the minimizer of L ( β ) + p λ ( β ) , where p λ ( ) is a non-negative valued function depending on λ 0 . When p λ ( β ) = λ j = 1 p | β j | ,   p λ ( β ) is called the lasso penalty function [26], which gives a shrinkage estimate toward zero, such that some components in the estimate become exactly zero. Using this property, a sparse estimator, simultaneously achieving model selection and parameter estimation, is obtained. In our example, the model automatically detects the site with no linear trend. There is another penalty function called the fused lasso penalty, defined by p λ ( β ) = λ ( j , k ) N | β j β k | with N { ( j , k ) : 1 j , k p } . The fused lasso penalty produces a shrinkage estimate toward centers of the neighborhoods pre-defined by the network N . By a proper design of the network N , the fused lasso penalty is used for detecting change point of time series [27,28] and grouping regression coefficient [16,19] in the regression model. A regularized method using this fused lasso penalty in the regression model is proposed, as follows. In our case, the estimator is given by
β ^ λ 1 , λ 2 = L ( β ) + λ 1 j = 1 p | β j | + λ 2 ( j , k ) N | β j β k | ,
where N is a set of index pairs of sites, and λ 1 and λ 2 are the tuning parameters with a non-negative value.
The tuning parameters control the complexity of the estimate model, which is roughly represented by the effective number of parameters in the model. When λ 1 = λ 2 = , β ^ λ 1 , λ 2 = 0 . Especially, when λ 1 = 0 ,   λ 2 = ,   β ^ λ 1 , λ 2 is given by the minimizer of L ( β ) with constraints β j = β j for all 1 j , j p . When λ 1 = 0 ,   λ 2 = 0 , β ^ λ 1 , λ 2 is the minimizer of L ( β ) without the constraints. For λ 1 = λ 2 , the proposed estimator is a minimizer of the empirical risk function with constraints:
β ^ = argmin β   L ( β ) ,
subject to j = 1 p | β j | + ( j , k ) N | β j β k | C for some C , which is a non-negative constant corresponding to λ 1 . Figure 1 shows the boundaries of regions created by the constraints of lasso and fused lasso penalty function. The non-differentiable points on the boundary give a sparse or grouped solution minimizing L ( β ) .
Remark 2.
The network N { ( j , k ) : 1 j , k p } in the penalty function p λ ( β ) can be chosen adaptively by observations. If an initial estimator β ˜ j for β j with asymptotic variance being s j 2 is available, then we can use N = { ( j , k ) : | β ˜ j β ˜ k | s j 2 + s k 2   C ,   1 j , k p } for some C . The edge in the network N is based on the test statistics for the regression coefficients from two independent populations. That is, for the pairs with significantly different regression coefficients, the associated parameters are not regularized for grouping.

3.2. Recovering Procedure

The estimator β ^ λ 1 , λ 2 is known as a biased estimator, which shrinks toward zero or grouped coefficients. Even when the coefficients are well clustered, the bias can mislead the estimate of the linear trends. To avoid this problem, a recovering procedure is considered, which refits the linear trends on the clustered sites. Recall that U ( β ) is a set of unique elements of β , and G ( β ) is the set of indices of grouped variables. With β ^ λ 1 , λ 2 , homogeneous sites are given by G ( β ^ λ 1 , λ 2 ) . Then, a linear trend in a homogeneous cluster is estimated by
β ^ r c h ,   λ 1 , λ 2 = argmin γ j G k ( β ^ λ 1 , λ 2 ) t , t Υ j | u ˜ t t j ( t t ) γ | ,
For h G k ( β ^ λ 1 , λ 2 ) and k = 1 , , | U ( β ^ λ 1 , λ 2 ) | . We call β ^ r c λ 1 , λ 2 a recovered estimator for linear trends. Then, a set of recovered parameters is constructed by varying the regularization parameters λ 1 and λ 2 , denoted by B = { β ^ r c λ 1 , λ 2 R p : 0 λ 1 , λ 2 } , and the best estimate is chosen according to a model selection criteria.

3.3. Model Selection

The tuning parameters λ 1 and λ 2 control a complexity of the estimated model. The desired asymptotic properties, such as selection consistency, risk optimality, always require a proper selection of the tuning parameters in the regularization method. Under too large λ 1 and λ 2 , too simple a model is obtained, where the unique groups of estimated coefficients are not separated.
Under too small λ 1 and λ 2 , the grouping of homogeneous sites fails. Generally, the Bayesian information criterion (BIC) [25] is widely used to choose the tuning parameters. In the quantile regression, BIC is derived by Ref [29], which showed the robustness and model selection consistency of the BIC. We modify the conventional BIC [29] in the proposed regression model because the response variables are highly correlated. The BIC of the estimated model through β ^ r c λ 1 , λ 2 is defined by
BIC ( λ 1 , λ 2 ) = log ( i = 1 n | y i x i T β ^ r c λ 1 , λ 2   | ) + | U ( β ^ r c λ 1 , λ 2 ) | log ( n * ) 2 n * ,
where | U ( β ^ r c λ 1 , λ 2 ) | is the number of unique elements in U ( β ^ r c λ 1 , λ 2 ) , and n * is the number of observations. In the proposed regression model, y i for i = 1 , , n , is a pairwise difference of two response variables. For a site j , the number of { y i } corresponding to the site j is the number of elements in Υ j ˜ such that the value of   i = 1 n | y i x i T   β ^ r c λ 1 , λ 2   | has too much influence on BIC. For this reason, correcting the n by n * can be understood as employing the effective number of observations. By choosing λ 1 and λ 2 , the model minimizing BIC ( λ 1 , λ 2 ) on B = { β ^ r c λ 1 , λ 2 R p : 0 λ 1 , λ 2 } is selected.

3.4. Optimization

The proposed estimator is the minimizer of the convex function with non-differentiable points. Generally, the optimization of the non-differentiable function requires much more computational cost than the differentiable function. In particular, the fused lasso penalty increases computational complexity in obtaining β ^ λ 1 , λ 2 . For example, the coordinate algorithm [30] fails to achieve the minimum of the objective function due to non-differentiability of ( j , k ) N | β j β k | at β j = β k for ( j , k ) N . It is well known that the quantile regression can be solved by linear programing, which is an optimization algorithm of a linear objective function, subject to linear equality constraints. Thus, the objective function and its constraint are given as follows:
min i = 1 n ( ζ i + + ζ i ) + λ 1 j = 1 p ( β j + + β j   ) + λ 2 ( j , k ) N ( β j k + + β j k )
s . t .   ζ i + ζ i = y i x i T β ,   ζ i + , ζ i 0   ( i = 1 , , n )
β j = β j + β j ,     β j + , β j 0   ( j = 1 , , p )
β j k + β j k = β j β k ,   β j k + , β j k 0   ( j , k ) N .
Equation (7) is the minimization problem with respect to ζ i + ,   ζ i for i = 1 , , n ,   β j + , β j for j = 1 , , p and β j k + ,   β j k for ( j , k ) N . Then, the solution of β = ( β 1 , , β p ) T in Equation (3) is given by β j = β j + β j for j = 1 , , p , where   β j + and β j are the solution of Equation (7). We implement the optimization algorithm using R, whose code is found in https://github.com/chulhongsung/TrendClustering (accessed on 24 May 2021).
Remark 3.
When n is large, the computational problem frequently occurs in maintaining physical memory on the linear programing. In this case, we can replace the loss function in Equation (3) by Huber loss [31] and apply the alternative directional multiplier method (ADMM) [32] to minimize Equation (3). The ADMM consists of two convex problems, which can be easily solved by the MM algorithm [33]. The computational cost of the ADMM does not depend much on n , such that we can avoid the scalable problem in memory. However, the ADMM generally requires a large number of iterations, so we should consider the time cost before applying the ADMM.

4. Numerical Studies

4.1. Simulation

We compare the performance of the proposed method with those of the other four methods: the naive LSE(nLSE), the naive LAD(nLAD) estimator, the rank LSE(rLSE), the rank LAD(rLAD) estimator. The first two methods use the conventional linear regression model with u t j , as the response variable and estimate coefficients by minimizing the empirical risk with l 2 and l 1 loss function, respectively. The latter two methods use the differences of the observed value u t j u t j and estimate the coefficients with l 2 and l 1 loss function, which is known as rank regression [21]. These methods are known as being robust to the underlying distributions of u t j , while all methods except the proposed one do not provide grouped coefficients. u t j is generated from GEV ( ξ j + β j t ,   α j , κ j ) for t = 1 , , T and j = 1 , , p , and the heterogeneity of characteristics of sites is modeled by ξ j ~ N ( 120 , 20 ) ,   α j ~ N ( 40 , 10 ) , κ j ~ U ( 0 , 3 , 0.3 ) . a r denotes ( a , , a ) R r . Throughout all simulations, let p = 30 and T = 30 , and the performances of the considered estimators are measured by 200 repetitions. In simulation 1, let β = ( 5 10 , 0 10 , 5 10 ) and N = { ( j , k ) : 0 < | j k | 15 } . In this case, the linear time trends in the sites are clustered into three large groups. Figure 2 shows the mean squared error of differences between the estimated parameters and the true parameters.
The proposed method is the best, and the naive LSE, or the rank LSE, is worst (note that naive LSE or rank LSE are exactly equal in linear time trend model). It is found that the naive LAD is slightly better than the naive LSE. Using l 1 loss function, the variances are reduced due to the robustness of the LAD estimator to heavy tailed distribution. The rank LAD has smaller MSE than the naive LAD. The proposed method shows critical improvements in estimating the coefficients compared with the rank LAD. Since the proposed method gives grouped estimates, the variances are reduced, such that the MSE, surprisingly, decreases, as shown in the left panel of Figure 3. In this simulation, the proposed method gives 3.6 clusters of estimated coefficients, on average. In simulation 2, we let β = ( 5 5 , 2.5 5 , 0 10 ,   2.5 5 , 5 5 ) and N = { ( j , k ) : 0 < | j k | 15 } . In this case, the strength of signal in the time trend is weaker, and the grouping coefficient is more difficult than in simulation 1. As shown in Figure 4, in MSE, the proposed method is still the best, and the naive LSE is the worst. In this simulation, the rank LAD estimator is still better than the naive LAD estimator. However, the improvement of the proposed method becomes less compared with the result in simulation 1. Figure 5 shows that a promising improvement of the proposed method is not achieved when the true coefficients between clusters are not separated well.

4.2. Real Data Analysis

We apply the proposed method to estimate the linear time trend of the annual maximum of daily precipitation in South Korea. Our research area, the southern portion of the Korean Peninsula, is located in east Asia between the longitude 126°–132° and latitude 33°–38°. About 70% of the Korean Peninsula’s topography is composed of mountains. One of the important geographical factors affecting the climate in South Korea, the Taebaek Mountain range, is located on the east side from north to south. It creates overarching geographical property, which is east high and west low and climate diversity of South Korea (Figure 6). This area has four seasons, with glaringly distinctive climate features. In particular, the amount of precipitation is mainly in the summer, and 50 to 60% of the annual precipitation is concentrated at this time. The annual maximum of daily precipitation from 1971 to 2021 collected from 60 meteorological observatories is analyzed, and the information about the observatories is listed in Table A1. Define a set of indices of site pairs by neighborhood N defined in Section 3.1, whose distance is less than 100 km. In addition, let Υ ˜ j by { ( t , t ) : t , t   Υ j ,   0 < | t t | 50 } . By varying the tuning parameter, the set of estimates is obtained, and the tuning parameter, the minimizing BIC in Section 3.3, is chosen. The BIC is minimized at λ 1 = λ 2 and λ 2 = 330 Figure 7 illustrates the solution sets defined by { β : β = β ^ r c λ 1 , λ 2 ,   λ 1 = λ 2 and λ 2 > 0 } and shows the BIC corresponding to the solution sets. The estimated trend coefficients achieving the minimum BIC are provided in Table A2.
The coefficients are clustered into twelve groups. The largest group consists of 27 sites, which are located in southeastern coastal line of South Korea. The estimated trend coefficient of the group is about 0.47, such that the increasing trend of extreme rainfall is detected. The second largest group consists of 22 sites, which are located in the midland, and the estimates are about 0.16. From the obtained estimates, the trends in the west and the midland are much weaker than the ones in the south and east. In the western coastal site, the homogenous site is estimated, and relatively weak trend is detected.
Figure 8 summarizes the results that display the trend coefficient on the southern portion of the Korean Peninsula by the Kriging method with Gaussian filter. The trend coefficients illustrated in Figure 8 are estimated with λ 1 = λ 2 and λ 2 = 510 ,   330 ,   270 . Even though the trend maps are different to each other, it is remarkable that the pattern in the increasing trend is separated in western sites and southern sites when the number of clusters grows. Moreover, Figure 8 indicates that weak trends in the western coastal line and the midland are estimated, and strong positive trends in Jeju, which is the largest island in South Korea, and Ulleung, which is an island located 120 km east of the Korean Peninsula, are estimated. As the clusters are subdivided, the negative trend in Ganghwa, which is located in the northwestern end, tends to be clear. Compared to the west site, the east site has a relatively strong and nonstationary trend in heavy rainfall. These results imply that (1) the intensity of heavy rainfall tends to be increasing, and nonstationarity, represented by trend coefficients of the Jeju and Ulleung islands, is more severe; (2) the heavy rainfall trend in the coastal site has distinct characteristics depending on its location; (3) the trend in Ganghwa uniquely shows a negative trend.

5. Discussion and Concluding Remarks

We proposed a new method of simultaneously detecting and clustering regional time trends in the nonstationary distribution. The proposed method has three advantages. The first is that it does not depend on the underlying distribution of the observations. When the de-trending method is considered in the conventional RFA, it is essential to estimate the linear trend without any knowledge of the unknown distribution. The proposed method is distribution free, so the moment condition corresponding to the underlying distribution is not required. The second is that the proposed method provides an easy way to determine the number of clusters. Even though the typical clustering algorithm can be applied to group the statistics for the linear trends, it is not clear how to consider the estimation errors in the clustering algorithm. However, since our method gives clustered coefficients based on the empirical risk, the information criteria guarantee the selection of optimal clusters, theoretically. Lastly, the variance of estimated linear time trends is reduced. By pooling observations [21], more accurate estimated trends can be obtained.
In the real data analysis for the annual maximum daily precipitation, it is found that there is a pattern of homogeneity of nonstationarity across the South Korean peninsula. The result of our analysis shows that there are different patterns in the south and east sites and the west sites of the South Korean peninsula. The decreasing or stationary trends of extreme precipitation were found in the western sites. On the contrary, it was found that positive linear trends are clustered in the sites of eastern and southern coastal lines, and it is expected that the extreme events are more frequently observed on those sites. These results are interesting, in that Ref [11] found that the probability of extreme events was higher in the southern sites than others; however, our result shows that temporal trends of extreme events are decreasing in the southern sites. This means that a further analysis should be performed to detect the patterns of extreme climate events. We believe that our work can contribute to future research on regional frequency analysis under nonstationary distribution.
However, the proposed estimator severely depends on the tuning parameter selection. For example, a careless selection of the tuning parameter can mislead the result of estimating the linear time trend. Thus, it is essential to develop or justify a tuning parameter selection method, and we leave this problem for our future work.

Author Contributions

Conceptualization, J.-J.J., Y.K.; methodology, S.H., J.-J.J.; writing—review and editing, S.H., J.-J.J., Y.K.; visualization, S.H.; supervision, Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

Jeon was supported by the 2018 Research Fund of the University of Seoul.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available at https://data.kma.go.kr/data/grnd/selectAsosRltmList.do (accessed on 18 May 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Site information.
Table A1. Site information.
Site IDSiteLatitudeLongitudeSite IDSiteLatitudeLongitude
90Sokcho38°15′128°33′202Yangpyeong37°29′127°29′
100Daegwallyeong37°40′128°43′203Icheon37°15′127°29′
101Chuncheon37°54′127°44′211Inje38°03′128°10′
105Gangneung37°45′128°53′212Hongcheon37°41′127°52′
108Seoul37°34′126°57′221Jecheon37°09′128°11′
112Incheon37°28′126°37′226Boeun36°29′127°44′
114Wonju37°20′127°56′232Cheonan36°46′127°07′
115Ulleung37°28′130°53′235Boryeong36°19′126°33′
119Suwon37°16′126°59′236Buyeo36°16′126°55′
129Seosan36°46′126°29′238Geumsan36°06′127°28′
130Uljin36°59′129°24′243Buan35°43′126°42′
131Cheongju36°38′127°26′244Imsil35°36′127°17′
133Daejeon36°22′127°22′245Jeongeup35°33′126°51′
135Chupungnyeong36°13′127°59′247Namwon35°24′127°19′
136Andong36°34′128°42′256Juam35°04′127°14′
138Pohang36°01′129°22′260Jangheung34°41′126°55′
140Gunsan36°00′126°45′261Haenam34°33′126°34′
143Daegu35°53′128°37′262Goheung34°37′127°16′
146Jeonju35°49′127°09′272Yeongju36°52′128°31′
152Ulsan35°33′129°19′273Mungyeong36°37′128°08′
156Gwangju35°10′126°53′277Yeongdeok36°31′129°24′
159Busan35°06′129°01′278Uiseong36°21′128°41′
162Tongyeong34°50′128°26′279Gumi36°07′128°19′
165Mokpo34°49′126°22′281Yeongcheon35°58′128°57′
168Yeosu34°44′127°44′284Geochang35°40′127°54′
170Wando34°23′126°42′285Hapcheon35°33′128°10′
184Jeju33°30′126°31′288Miryang35°29′128°44′
189Seogwipo33°14′126°33′289Sancheong35°24′127°52′
192Jinju35°09′128°02′294Geoje34°53′128°36′
201Ganghwa37°42′126°26′295Namhae34°48′127°55′
Table A2. Trend coefficients of sites.
Table A2. Trend coefficients of sites.
Site IDSiteTrend Coef.Site IDSiteTrend Coef.
90Sokcho0.474071202Yangpyeong0.159259
100Daegwallyeong−0.465909203Icheon0.159259
101Chuncheon0.474071211Inje0.474071
105Gangneung0.474071212Hongcheon0.159259
108Seoul−0.465909221Jecheon0.474071
112Incheon0.070833226Boeun0.159259
114Wonju0.159259232Cheonan0.159259
115Ulleung0.766667235Boryeong0.159259
119Suwon0.159259236Buyeo0.159259
129Seosan0.159259238Geumsan0.159259
130Uljin0.793181243Buan0.159259
131Cheongju0.159259244Imsil0.159259
133Daejeon0.159259245Jeongeup0.474071
135Chupungnyeong0.159259247Namwon0.474071
136Andong0.474071256Juam0.474071
138Pohang0.474071260Jangheung0.474071
140Gunsan0.159259261Haenam0.159259
143Daegu0.474071262Goheung0.474071
146Jeonju0.159259272Yeongju0.159259
152Ulsan0.474071273Mungyeong0.159259
156Gwangju0.159259277Yeongdeok0.474071
159Busan0.474071278Uiseong−0.220588
162Tongyeong0.474071279Gumi0.474071
165Mokpo0.023529281Yeongcheon0.474071
168Yeosu0.474071284Geochang0.474071
170Wando−0.341463285Hapcheon0.474071
184Jeju0.5288Miryang0.474071
189Seogwipo1.634286289Sancheong0.474071
192Jinju0.474071294Geoje0.474071
201Ganghwa−0.47295Namhae0.474071

References

  1. Hosking, J.; Wallis, J. Some statistics useful in regional frequency analysis. Water Resour. Res. 1993, 29, 271–281. [Google Scholar] [CrossRef]
  2. Hosking, J.; Wallis, J. Regional frequency analysis. In Regional Frequency Analysis: An Approach Based on L-Moments; Cambridge University Press: Cambridge, UK, 1997; pp. 1–13. [Google Scholar] [CrossRef]
  3. Rao, A.R.; Srinivas, V.V. Regionalization of watersheds by fuzzy cluster analysis. J. Hydrol. 2006, 318, 57–79. [Google Scholar] [CrossRef]
  4. Shu, C.; Ouarda, T. Regional flood frequency analysis at ungauged sites using the adaptive neuro-fuzzy inference system. J. Hydrol. 2008, 349, 31–43. [Google Scholar] [CrossRef]
  5. Chebana, F.; Ouarda, T. BMJ. Multivariate L-moment homogeneity test. Water Resour. Res. 2007, 43, 1–14. [Google Scholar] [CrossRef] [Green Version]
  6. Sadri, S.; Burn, D.H. A Fuzzy C-Means approach for regionalization using a bivariate homogeneity and discordancy approach. J. Hydrol. 2011, 401, 231–239. [Google Scholar] [CrossRef]
  7. Yu, Y.; Shao, Q.; Lin, Z. Regionalization study of maximum daily temperature based on grid data by an objective hybrid clustering approach. J. Hydrol. 2018, 564, 149–163. [Google Scholar] [CrossRef]
  8. Asadi, P.; Engelke, S.; Davison, A.C. Optimal regionalization of extreme value distributions for flood estimation. J. Hydrol. 2018, 556, 182–193. [Google Scholar] [CrossRef] [Green Version]
  9. Leclerc, M.; Ouarda, T. Non-stationary regional flood frequency analysis at ungauged sites. J. Hydrol. 2007, 343, 254–265. [Google Scholar] [CrossRef]
  10. Sun, X.; Thyer, M.; Renard, B.; Lang, M. A general regional frequency analysis framework for quantifying local-scale climate effects: A case study of ENSO effects on Southeast Queensland rainfall. J. Hydrol. 2014, 512, 53–68. [Google Scholar] [CrossRef] [Green Version]
  11. Sung, J.H.; Kim, Y.O.; Jeon, J.J. Application of distribution-free nonstationary regional frequency analysis based on L-moments. Theor. Appl. Climatol. 2018, 133, 1219–1233. [Google Scholar] [CrossRef]
  12. Bracken, C.; Holman, K.D.; Rajagopalan, B.; Moradkhani, H. A Bayesian hierarchical approach to multivariate nonstationary hydrologic frequency analysis. Water Resour. Res. 2018, 54, 243–255. [Google Scholar] [CrossRef]
  13. Yang, Z. The unscented Kalman filter (UKF)-based algorithm for regional frequency analysis of extreme rainfall events in a nonstationary environment. J. Hydrol. 2021, 593, 125842. [Google Scholar] [CrossRef]
  14. Bondell, H.D.; Reich, B.J. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics 2018, 64, 115–123. [Google Scholar] [CrossRef] [PubMed]
  15. Shen, X.; Huang, H.C. Grouping pursuit through a regularization solution surface. J. Am. Stat. Assoc. 2010, 105, 727–739. [Google Scholar] [CrossRef] [Green Version]
  16. Petry, S.; Flexeder, C.; Tutz, G. Pairwise Fused Lasso; Technical Reports; LMU: Munich, Germany, 2011; p. 102. [Google Scholar] [CrossRef]
  17. Shen, X.; Huang, H.C.; Pan, W. Simultaneous supervised clustering and feature selection over a graph. Biometrika 2012, 99, 899–914. [Google Scholar] [CrossRef] [Green Version]
  18. Ke, Z.T.; Fan, J.; Wu, Y. Homogeneity pursuit. J. Am. Stat. Assoc. 2015, 110, 175–194. [Google Scholar] [CrossRef]
  19. Jeon, J.J.; Kwon, S.; Choi, H. Homogeneity detection for the high-dimensional generalized linear model. Comput. Stat. Data Anal. 2017, 114, 61–74. [Google Scholar] [CrossRef]
  20. Li, F.; Sang, H. Spatial homogeneity pursuit of regression coefficients for large datasets. J. Am. Stat. Assoc. 2019, 114, 1050–1062. [Google Scholar] [CrossRef]
  21. Jung, S.H.; Ying, Z. Rank-based regression with repeated measurements data. Biometrika 2003, 90, 732–740. [Google Scholar] [CrossRef]
  22. Bassett, G., Jr.; Koenker, R. Asymptotic theory of least absolute error regression. J. Am. Stat. Assoc. 1978, 73, 618–622. [Google Scholar] [CrossRef]
  23. Pollard, D. Asymptotics for least absolute deviation regression estimators. Economet. Theory 1991, 7, 186–199. [Google Scholar] [CrossRef]
  24. Yu, K.; Lu, Z.; Stander, J. Quantile regression: Applications and current research areas. J. R. Stat. Soc. Ser. D-Stat. 2003, 52, 331–350. [Google Scholar] [CrossRef]
  25. Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. Available online: https://www.jstor.org/stable/2958889 (accessed on 24 May 2022). [CrossRef]
  26. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B-Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  27. Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; Knight, K. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B-Stat. Methodol. 2005, 67, 91–108. [Google Scholar] [CrossRef] [Green Version]
  28. Jeon, J.J.; Sung, J.H.; Chung, E.S. Abrupt change point detection of annual maximum precipitation using fused lasso. J. Hydrol. 2016, 538, 831–841. [Google Scholar] [CrossRef]
  29. Machado, J.A. Robust model selection and M-estimation. Economet. Theory. 1993, 9, 478–493. [Google Scholar] [CrossRef]
  30. Friedman, J.; Hastie, T.; Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2010, 33, 1–22. Available online: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/ (accessed on 24 May 2022). [CrossRef] [Green Version]
  31. Huber, P.J. Robust estimation of a location parameter. Ann. Stat. 1964, 35, 73–101. [Google Scholar]
  32. Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 2011, 3, 1–122. [Google Scholar] [CrossRef]
  33. Hunter, D.R.; Lange, K. A tutorial on MM algorithms. Am. Stat. 2004, 58, 30–37. [Google Scholar] [CrossRef]
Figure 1. Boundary regions of the constraints: lasso (a) and fused lasso (b).
Figure 1. Boundary regions of the constraints: lasso (a) and fused lasso (b).
Water 14 01720 g001
Figure 2. MSE of estimated coefficients of five methods in simulation 1. The five methods are our proposed method(proposed), the naive LSE(nLSE), the naive LAD(nLAD), the rank LSE(rLSE) and the rank LAD(rLAD).
Figure 2. MSE of estimated coefficients of five methods in simulation 1. The five methods are our proposed method(proposed), the naive LSE(nLSE), the naive LAD(nLAD), the rank LSE(rLSE) and the rank LAD(rLAD).
Water 14 01720 g002
Figure 3. Estimated coefficients of the proposed method (a), rLSE (b) and rLAD (c) in simulation 1.
Figure 3. Estimated coefficients of the proposed method (a), rLSE (b) and rLAD (c) in simulation 1.
Water 14 01720 g003
Figure 4. MSE of estimated coefficients of five methods in simulation 2. The five methods are our proposed method(proposed), the naive LSE(nLSE), the naive LAD(nLAD), the rank LSE(rLSE) and the rank LAD(rLAD).
Figure 4. MSE of estimated coefficients of five methods in simulation 2. The five methods are our proposed method(proposed), the naive LSE(nLSE), the naive LAD(nLAD), the rank LSE(rLSE) and the rank LAD(rLAD).
Water 14 01720 g004
Figure 5. Estimated coefficients of the proposed method (a), rLSE (b) and rLAD (c) in simulation 2.
Figure 5. Estimated coefficients of the proposed method (a), rLSE (b) and rLAD (c) in simulation 2.
Water 14 01720 g005
Figure 6. Southern portion of the Korean Peninsula and islands and their elevation (m).
Figure 6. Southern portion of the Korean Peninsula and islands and their elevation (m).
Water 14 01720 g006
Figure 7. Path of coefficients (a) and BIC plot (b) with respect to tuning parameter λ 2 .
Figure 7. Path of coefficients (a) and BIC plot (b) with respect to tuning parameter λ 2 .
Water 14 01720 g007
Figure 8. Trend map of southern portion of the Korean Peninsula with 7 clusters (a), 12 clusters (minimum BIC) (b) and 19 clusters (c).
Figure 8. Trend map of southern portion of the Korean Peninsula with 7 clusters (a), 12 clusters (minimum BIC) (b) and 19 clusters (c).
Water 14 01720 g008
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Hong, S.; Jeon, J.-J.; Kim, Y. Clustering for Regional Time Trend in the Nonstationary Extreme Distribution. Water 2022, 14, 1720. https://doi.org/10.3390/w14111720

AMA Style

Hong S, Jeon J-J, Kim Y. Clustering for Regional Time Trend in the Nonstationary Extreme Distribution. Water. 2022; 14(11):1720. https://doi.org/10.3390/w14111720

Chicago/Turabian Style

Hong, Sungchul, Jong-June Jeon, and Yongdai Kim. 2022. "Clustering for Regional Time Trend in the Nonstationary Extreme Distribution" Water 14, no. 11: 1720. https://doi.org/10.3390/w14111720

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop