Abstract

The electrification of rural communities is crucial from both social and economic perspectives, aligned with Sustainable Development Goal 7: ”Affordable and Clean Energy.” This study presents a comprehensive comparison of clustering techniques, including k-means, Gaussian mixture models (GMM), hierarchical clustering, density-based spatial clustering of applications with noise (DBSCAN), and agglomerative clustering, aimed at enhancing solar irradiance prediction. Leveraging historical climate data from a rural community in the coastal region of Ecuador, each technique is evaluated using error metrics such as mean absolute error (MAE) and coefficient of determination (R2). This assessment identifies the most effective clustering technique in this specific context. In order to delve deeper into these comparisons, simulations are conducted in AMPL to validate and refine the selection of techniques. In this process, it is considered the sizing and design of a microgrid within the Barcelona community, Ecuador, which integrates various energy sources, including solar. Additionally, a penalty system is introduced for unmet energy demands during less critical periods, thereby optimizing efficiency and enhancing energy availability within the community. In conclusion, this article introduces a scalable methodology to analyze algorithms for solar irradiance prediction, emphasizing the significance of comparing clustering techniques as its main contribution. This advancement in prediction accuracy has the potential to enhance the feasibility and efficiency of renewable energy systems for rural communities, thereby fostering sustainable economic growth and bolstering efforts in climate change mitigation and adaptation.

1 Introduction

Electricity is vital for sustainable development. Advances in energy efforts and sustainable development goal (SDG) 7—ensure access to affordable, reliable, sustainable, and modern energy for all—can foster progress toward all the other SDGs, including access to clean water and promoting sustainable communities [1]. At the same time, greenhouse gas emissions from energy production represent a significant factor in climate change. Globally, for 2020, only 19.1% of total final energy production came from renewable sources, of which 6.6% still accounted for traditional use of biomass [2].

To mitigate greenhouse gas emissions, energy systems must adapt to integrate renewable energy sources. Among these, photovoltaic systems are used to harvest solar energy. Their output is variable, considering that solar radiation intensity is dependent on latitude, season, atmospheric conditions, and air quality, among others [3].

Therefore, integrating photovoltaic systems with existing energy systems presents a challenge by adding upstream uncertainty, which can result in output issues such as voltage surges, variations in frequency, harmonic distortion, among others [4]. Ahmed et al. classify the major factors affecting solar energy forecast as follows: forecasting horizons, weather classification, forecast model performance, and forecast model input [3].

Moreover, weather stations for measuring global solar radiation are not widely available. To address this, satellite data serve as a viable alternative, particularly in remote areas, due to its continuous spatial coverage. Additionally, machine-learning models are being developed to enhance the accuracy of solar radiation predictions [5]. These models analyze input and output data to identify patterns and correlations. In the context of forecasting, time-series models use historical solar irradiance data as input to predict future values [6].

Time-series aggregation is commonly used to reduce the complexity of energy system models. However, machine-learning approaches, such as clustering, are gaining prominence due to their ability to handle large volumes of input data. As Teichgraeber and Brandt suggest, these techniques are expected to be widely adopted in the future [7]. Clustering methods, as described by Pérez-Uresti et al., group days with similar characteristics, with each cluster represented by a single representative day. This approach allows a time series to be represented by a selected set of representative days [4]. However, it is important to note that clustering has limitations, such as preserving hourly variations within a representative day, which contrasts with traditional time-series models that typically capture continuity and temporal trends over time.

Unsupervised machine-learning models work without external intervention (expert) and are capable of identifying structures only with inputs. Among them, k-means and k-methods clustering, hierarchical clustering, Gaussian mixture models, and cluster evaluation are found [6]. k-means algorithms determine the centroid of each cluster as the average value of data included in each cluster [7]. Pérez-Uresti et al. used a k-means algorithm to determine representative periods for a renewable-based utility plant. The authors concluded from the optimization results that the plant investment and total annual costs increase as more extensive samples of representative periods are considered [4]. Gabrielli et al. used data clustering through k-means in an optimization framework to perform and evaluate the design of a small-scale electric grid [8]. Mallapragada et al. studied capacity expansion models for renewable energy and found out that models with time slice representation are likely to overestimate solar photovoltaic capacity. This was linked to the number of representative days used, where extreme points of data series are challenging to represent [9].

Hierarchical clustering distributes clusters in a hierarchy, which is represented in a dendrogram—a tree-like structure with roots (cluster with all observations) and leaves (clusters with individual observations). Agglomerative algorithms, which run from the root to the leaves, are included in this group. On the other hand, Gaussian mixture models come from a generalization of a multivariate Gaussian distribution to infinitely many variables [6].

As mentioned, integrating photovoltaic systems with existing energy infrastructures, referred to as hybrid microgrids, presents significant design and sizing challenges. To address these challenges, the use of optimization models is essential [10]. It is worth noting that once solar irradiance predictions are obtained through various clustering methods—which will be compared in this study—these results can be integrated into a classical optimization model. This approach enables a comprehensive evaluation from the perspectives of investment costs and the costs associated with load shedding [11].

In this study, clustering techniques were applied to a case study in the rural community of Barcelona, located in the province of Santa Elena, Ecuador. The outcomes of these clustering methods were subsequently integrated into an optimization model to estimate both investment and load shedding costs. The proposed microgrid is intended to supply power to the water distribution system serving multiple communities in the region.

Energy is required in water treatment and supply systems, which is known as the water-energy nexus [1]. Even though the community is connected to the national grid, reliability of the service is not optimal. For 2022, in Santa Elena province, CNEL EP2—the Ecuadorian energy distribution company—reported 13 h of service interruption in average per month, with a frequency of service interruption of six times in average per month [12]. This means that for 13 h per month, the area lacked electrical service, and the water distribution system was affected. For the communities in the area, a continuous energy supply means an improvement in the water supply system.

On the other side of the water-energy nexus, water is used in the production of energy. In Ecuador, most of the energy comes from hydroelectrical facilities. However, climate change is expected to disrupt water resources availability and thus affect access to electricity. Changes in precipitation patterns impact hydropower reserves [1]. In Ecuador, it has resulted in the need to implement energy rationing periods at the end of 2023, as reported by CNEL EP [13].

Since climate change effects deepen, the need to use reliable energy sources grows more urgent. Renewable energy has a role within the water-energy nexus, though quantitative and qualitative knowledge of their impact remains limited [1]. In this case, the proposed microgrid has the potential to ensure service, regardless of the changes in precipitation patterns and their effect on hydropower reserves.

The remainder of this article is structured as follows. Section 2 describes the algorithms and method for the classification of the renewable resource; additionally, the model for the optimal design of hybrid microgrid used to evaluate the performance of the results of each algorithm in a practical application using real data is presented. In Sec. 3, the results of each clustering algorithm and the optimal design for hybrid microgrid are assessed in different values of solar irradiation and air temperature obtained in each cluster. Finally, the discussion and conclusion are drawn in Secs. 4 and 5, respectively.

2 Materials and Methods

The methodology of the present work is summarized in Fig. 1. In this study, an onsite information survey of a rural community was carried out, including demographic information, survey of the vital drinking water system, survey of consumption patterns of the inhabitants, geographic location, among other relevant data. Then, meteorological data were taken from the site and processed with different clustering algorithms to obtain representative values of each variable. Performance metrics of each algorithm was analyzed. Finally, the optimal design of the microgrid was developed with the parameters obtained from the clustering algorithm, using an existing model for sizing microgrids, adapted to the particularities of the community.

Fig. 1
Flowchart of state variables, parameters, and output variables
Fig. 1
Flowchart of state variables, parameters, and output variables
Close modal

2.1 Barcelona Rural Community.

The Barcelona community is located at province of Santa Elena, Ecuador, with coordinates at latitude 193'5828.0"S and longitude 8068'80.15"W. Its location is shown in Fig. 2. It is located approximately 200 km from the city of Guayaquil. It has a population of approximately 3900 inhabitants with a total area of 1800 ha.

Fig. 2
Google eart view of Barcelona Community
Fig. 2
Google eart view of Barcelona Community
Close modal

The community bases its economy entirely on the production of citrus fruits, such as lemons, and “paja toquilla”, a weaving traditionally produced in these areas of the country. For its electricity supply, the community is connected directly to the national transmission system through a local distribution system. However, due to its location, the quality of the power supply is low and causes constant interruptions. One of the main consequences is the disruption of the supply of drinking water. The following section explains in detail this system and the current situation of the community.

2.2 Estimating the Electricity Demand.

A survey of information and estimation of energy demand in the rural community was carried out through a detailed characterization of the population, economic activity, and geographic conditions. In addition, the existing energy infrastructure, historical consumption, and community surveys were analyzed to understand the energy habits and needs of the residents. The potential of local renewable energy sources was evaluated, to project future demand, considering the demographic and technological growth of the community.

The survey revealed several significant issues related to the community’s energy supply. First, residential energy consumption is a concern, as the local distribution network provides poor service quality, marked by frequent blackouts. The second major issue, which is the focus of this work, involves the community’s water pumping system. This system includes a reservoir located at an elevation of 120 m above sea level (MASL), which is supplied by four wells situated at approximately 20 MASL. The pumping system must elevate the water by roughly 100 m from the wells to the reservoir. Figure 3 illustrates the system architecture.

Fig. 3
Architecture of the pumping system of the Barcelona community
Fig. 3
Architecture of the pumping system of the Barcelona community
Close modal

Two of the four wells are equipped with 25 kVA transformers that power the submersible pumps, while the other two wells have 15 kVA transformers. Although the installed pumps have a peak electrical power consumption of 15 kW, the transformers are currently operating below their maximum capacity. However, with the expected increase in demand, an expansion of the pumping capacity may become necessary.

The installed peak power of the system is 160 kW, with a 220 V three-phase power supply system. To establish the consumption characteristics of the system, the storage capacity of the water tank was analyzed. The reservoir is 25 m in diameter and 6 m high, with an approximate storage capacity of 3000m3. According to the surveys conducted in the community, the highest water consumption schedule occurs between 1:00 a.m. and 5:00 a.m. due to the production processes of toquilla straw.

The pumping schedule of the system was established during the early morning and during the day, when the solar resource is available, thus guaranteeing the water supply during the day and its availability during the night. The pumping system operates at its maximum capacity from 9:00 a.m. to 4:00 p.m. and partially operates a well from 1:00 a.m. to 5:00 a.m. (30 kW).

For the estimation of demand growth, a growth rate of 1% per year is considered, as projected by the World Bank for territories in Ecuador [14] and considering a horizon of 15 years as the average useful life of the renewable systems. The maximum peak consumption is set at 185 kW. With all these considerations, Fig. 4 presents the estimated energy consumption curve.

Fig. 4
Estimated hourly load profile for a typical day in Barcelona
Fig. 4
Estimated hourly load profile for a typical day in Barcelona
Close modal

2.3 Meteorological Data.

In the Barcelona community, the National Institute of Meteorology and Hydrology (INAMHI) operates a meteorological station. However, this station does not monitor certain key variables necessary for design of microgrids, specifically irradiance, air temperature, and wind speed. These variables are critical for accurately assessing the potential of renewable energy sources within a microgrid. Therefore, to proceed with the analysis, historical data for irradiance, air temperature, and wind were obtained from the NASA meteorological database [15]. Ten years of data, from Jan. 1, 2013, to Dec. 31, 2022, were used, with measurements taken at an hourly resolution, totaling 87,600 samples. Figure 5 presents the time series of these variables for the year 2022.

Fig. 5
Daily average values of irradiance (kW-h/m2/day), temperature (°C), and wind (m/s) for each month of the year 2022
Fig. 5
Daily average values of irradiance (kW-h/m2/day), temperature (°C), and wind (m/s) for each month of the year 2022
Close modal

Irradiation values oscillate between 1.55 kW-h/m2/day and 6.4 kW-h/m2/day. These numbers are encouraging because they give the impression that solar resources are abundant throughout the year, with considerations in times of low irradiation such as the period from June to August. This is precisely what cauterization seeks to determine, which is explained in the following sections of each of the methods used. Likewise, minimum average temperature value recorded in 1 day is 21.76C and the maximum is 26.22C. Regarding wind, the conditions by analyzing the average values are also encouraging: at 10 m above sea level, there is a minimum of 1.88 and a maximum of 6.56m/s; while at 50 m, there is a minimum of 2.59 m/s and a maximum of 7.77 m/s. Although wind generators are not considered in the sizing and design of microgrid, wind variables are incorporated in the analysis to obtain a more complete knowledge of the environmental conditions of the region. In addition, the possibility of integrating this type of renewable energy in future work is examined.

2.4 K-Means Clustering Algorithm.

k-Means clustering algorithm was employed to make predictions of solar energy production as a function of key meteorological variables. The algorithm is a clustering technique that seeks to divide a data set into k clusters, where each cluster is represented by a centroid [16]. It starts by assigning initial centroids and then assigns each point to the cluster whose centroid is closest. The centroids are updated iteratively by recalculating the average of the points in each cluster. The process is repeated until the centroids barely change between iterations, indicating convergence.

Before applying the k-means algorithm, extensive data preprocessing are performed to ensure the quality and consistency of the datasets. This includes the detection and handling of outliers, normalization of variables to ensure comparable scales, and imputation of any missing values using appropriate techniques. The dataset is divided into training and test sets to evaluate the performance of the model, and in this study, it is trained with 9 years of data and evaluated with 1 year. An adequate percentage of the data is reserved for the test set, ensuring an objective and realistic evaluation of the k-means model.

The k-means algorithm is implemented to group the data into k clusters. The choice of the optimal number of k clusters is made by elbow methods in conjunction with the silhouette index. The elbow method, also known as the elbow rule, is a technique used to determine the optimal number of clusters (k) in a clustering algorithm, such as k-means. The central idea is to observe how the sum of the intracluster quadratic distances varies as a function of the number of clusters and to find the point where the improvement in the reduction of the intracluster distance decreases significantly, forming a curve that resembles an “elbow.” The procedure involves running the clustering algorithm for different values of “k” and recording the sum of the intracluster quadratic distances for each case; Fig. 6 plots this information and looks for the point at which adding an additional cluster no longer provides a substantial improvement in data. This point is identified as the “bend” in the curve [17]. Note that the optimum number of clusters can be selected as 3 or 4 depending on the bend observed. To make a final and more informed decision, the silhouette index was applied in this research.

Fig. 6
k-Means inertia versus number of clusters from the meteorological dataset of elbow technique, to obtain optimal number of clusters “k” in k-means
Fig. 6
k-Means inertia versus number of clusters from the meteorological dataset of elbow technique, to obtain optimal number of clusters “k” in k-means
Close modal
The silhouette index is a metric used to evaluate the quality of a clustering in terms of how well defined the clusters are. This metric provides a score for each data point based on the cohesion with its own cluster and the separation with the nearest neighboring clusters. The silhouette index varies between –1 and 1, where a value close to 1 indicates that the point is well positioned in its cluster and is separated from neighboring clusters and a value close to 1 indicates that the point may have been assigned to the wrong cluster. The calculation of the silhouette index for a specific point involves comparing the average distance to points in the same cluster (a) with the average distance to points in the nearest neighboring cluster (b). The general formulation is shown in Eq. (1):
(1)
where a(i) is the average distance from i to the rest of the points in the same cluster and b(i) is the smallest average distance from i to points in a different neighboring cluster.

Figure 7 shows the silhouette index for the same dataset being worked with the elbow method. Note that both 3 and 4 have a similar score, while 2 is the one with the lowest value. A high silhouette index value indicates that the clustering is appropriate, with well-defined clusters and a clear separation between them. In contrast, a low value suggests that the clusters may be overlapping or that some points may have been incorrectly assigned [18]. Based on this, and considering the two methods, 3 is chosen as the optimal number of clusters “k” in k-means. Each cluster is associated with a specific meteorological profile that will influence solar energy production. The clusters obtained are analyzed to identify meteorological patterns and their correlation with energy production. Figure 8 presents the clusters obtained in k-means for each variable. As visualized in the provided figure, three distinct clusters are revealed based on the features of solar irradiance (Wh/m2) and air temperature (C). Cluster 0, the curve with the highest peak in the irradiation graph and the lowest peak in the air temperature graph, is associated with lower solar irradiance, primarily below 200Wh/m2, and moderate air temperatures ranging from approximately 20C to 25C. This cluster likely corresponds to conditions such as early morning or late afternoon, or cloudy days when the sun’s intensity is lower. Cluster 1, encompasses a broader range of solar irradiance values, mostly between 200 and 600Wh/m2 and air temperatures from 22C to 32C. This cluster may represent midday conditions with varying solar intensity, possibly due to partial cloud cover or seasonal variations. Finally, cluster 2, the curve that lies to the right of the irradiance and temperature graphs is characterized by higher solar irradiance values, typically between 500 and 1000Wh/m2, and higher air temperatures ranging from 25C to 35C. This cluster likely corresponds to clear, sunny conditions, typical of midday in warmer seasons, when solar energy production would be optimal. The methodology is applied for all the algorithms considered in this study, while the clusters are obtained. In Sec. 2.5, corresponding to metrics, it is explained how the performance of each one is calculated, and the methodology proposed for the election and evaluation of the clustering algorithms.

Fig. 7
k-Means silouette score versus a number of clusters from the meteorological datset to obtain optimal number of clusters “k” in k-means
Fig. 7
k-Means silouette score versus a number of clusters from the meteorological datset to obtain optimal number of clusters “k” in k-means
Close modal
Fig. 8
Clusters obtained in k-means algorithm and Gaussian distributions for the variables: solar irradiance (Wh/m2) and air temperature (°C)
Fig. 8
Clusters obtained in k-means algorithm and Gaussian distributions for the variables: solar irradiance (Wh/m2) and air temperature (°C)
Close modal

2.5 Gaussian Mixture Models.

In this section, the Gaussian mixture model (GMM) algorithm is implemented to model the distribution of data as a function of Gaussian components. GMMs are a probabilistic model used for representing the presence of subpopulations within an overall population without requiring that an observed dataset explicitly defines the subpopulations. A GMM assumes that all the data points are generated from a mixture of several Gaussian distributions with unknown parameters. Each Gaussian distribution, or component, within the mixture represents a different subpopulation, and the overall model is the weighted sum of these Gaussian components. Formally, a GMM can be expressed as follows [19]:
(2)
where K is the number of Gaussian componets; πk represents the mixing coefficient for the kth component, with the constraint that k=1Kπk=1 and πk0; and N(x|μk,k) is the kth Gaussian distribution with mean μk and covariance matrix k. In the context of this study, GMM allows to effectively model the underlying structure and probabilistic relationships in meteorological data, capturing the intrinsic variability in solar energy production associated with different atmospheric conditions. Each component represents a latent cluster with its own statistical characteristics. The optimal number of components for this algorithm was determined using methods such as the Bayesian information criterion (BIC). BIC is a metric used in statistics and machine learning for the selection of models among several candidates. In the context of GMM and other clustering algorithms, the BIC helps determine the optimal number of components or clusters.

These metric balances model accuracy and model complexity, penalizing models that are too complex. The BIC is calculated considering both the likelihood of the model and the number of parameters used, introducing a penalty term proportional to the number of parameters. In the optimization process, the value of k (number of clusters) that minimizes the BIC is sought, which helps to avoid overfitting and to select a model that generalizes well to new data. The choice of the optimal number of clusters based on the BIC helps to obtain an appropriate balance between the model’s fit ability and its simplicity, thus improving the interpretability and efficiency of the clustering algorithm [20].

2.6 Hierarchical Clustering.

Another algorithm incorporated was the hierarchical clustering approach as an integral part of the methodology for energy production prediction. Hierarchical clustering is an unsupervised learning method that organizes data into a tree structure or dendrogram, reflecting the similarity relationships between data points. This approach can be agglomerative, where each point starts as an independent cluster and progressively merges into larger clusters, or divisive, where all data start as a single cluster and are divided into smaller subclusters. Hierarchical clustering provides a structured and understandable representation of the similarity between groups of data, allowing the identification of patterns at different levels of granularity [21].

The choice of the optimal number of clusters is performed using the dendrogram method. Figure 9 shows this procedure. The dendrogram technique in the context of hierarchical clustering is an essential visual resource for understanding the hierarchical structure of clusters in a dataset. The dendrogram graphically represents the successive mergers or splits of clusters along the hierarchical process [21]. On the vertical axis, the branches of the dendrogram indicate the distance or dissimilarity between clusters, while the horizontal axis represents the individual data points or clusters. By observing the height at which the branches meet or cutoff, similarity between groups and cluster formation at different levels of granularity can be determined. Interpretation of the dendrogram provides insights into the hierarchical structure and organization of the data, which facilitates the choice of the optimal number of clusters.

Fig. 9
Hierarchical dendogram method graph; graphical representation of classification based on hierarchical clustering
Fig. 9
Hierarchical dendogram method graph; graphical representation of classification based on hierarchical clustering
Close modal

2.7 DBSCAN: Density-Based Spatial Clustering of Noisy Applications.

Density-based spatial clustering of applications with noise (DBSCAN) is a clustering algorithm used in the field of unsupervised learning and data mining. Its main strength lies in its ability to identify clusters of arbitrary shapes in datasets, without relying on an assumption about the shape or number of clusters. The algorithm is based on the concept of density, where a cluster is defined as a region of high density of points, separated by sparser regions of points. DBSCAN classifies data points into three main categories: core points, border points, noise points [22]. The algorithm proceeds by exploring the feature space and connecting core points with their neighboring points, thus forming the clusters shown in Fig. 10. It is observed that DBSCAN fails to adequately find the clusters. This observation will be further addressed in the results section and its respective comparison.

Fig. 10
Scatter plot of solar irradiation and air temperature clusters found with DBSCAN algorithm
Fig. 10
Scatter plot of solar irradiation and air temperature clusters found with DBSCAN algorithm
Close modal

2.8 Agglomerative Clustering.

The agglomerative algorithm is used to iteratively merge the data points into hierarchical clusters, forming a tree structure representing similarity relationships. The choice of the optimal number of clusters is performed by methods such as dendrogram observation that were previously executed in this work [23]. Figure 11 gives an overview of how the different groups that the algorithm manages to label were created and grouped. Visually, it is inferred that the grouping is adequate and seems to be acceptable for the choice of representatives.

Fig. 11
Cluster plot of solar irradiance and air temperature in aglomerative cluster
Fig. 11
Cluster plot of solar irradiance and air temperature in aglomerative cluster
Close modal

However, this is analyzed in Sec. 3 by comparing it with the other centroids found and the proposed metrics.

2.9 Metrics: Mean Absolute Error and Coefficient of Determination (R2).

The methodology proposed for the evaluation of the five implemented algorithms is the following: With the clusters found by each algorithm, a curve is created, as shown in Fig. 13, for each variable in each classifier. This curve is compared and evaluated day by day with the real dataset of each variable, which is projected in Fig. 14. From these two—estimated and actual curves—the mean absolute error (MAE) and coefficient of determination (R2) metrics are obtained.

MAE and R2 are metrics commonly used to assess the quality of regression models and to quantify the accuracy of predictions in the context of predictive analytics, and in this case, machine-learning models are assessed.

Mean Absolute Error: It measures the average magnitude of the errors between predictions and true values. For each data point, the absolute difference between the prediction and the true value is calculated. MAE is the average of these differences. It is expressed as a nonnegative quantity, where a lower MAE indicates better model accuracy. Equation (3) presents its calculation.
(3)
where n is the number of observations, yi is the true value, and y^i is the model prediction for the observation [24].
Coefficient of Determination (R2): The R2 evaluates the proportion of the variability in the dependent variable that is explained by the model. This coefficient ranges from 0 to 1. R2 is calculated by comparing the total variability of the dependent variable with the variability that the model can explain. It is calculated using Eq. (3).
(4)

In this equation, yi are the actual values, y^i are the predictions of the model, and y¯ is the mean of the actual values. R2 close to 1 indicates a good model fit, while lower values suggest that the model does not explain the variability of the data well [25,26].

Once the clusters have been found and the five algorithms have been evaluated, the microgrid design is carried out. This will serve to evaluate the algorithms in practice and observe first hand how they affect the use of each one and what implications they have on the total costs of microgrid sizing.

2.10 Optimal Microgrid Design.

Microgrids are defined by the integration of various distributed generation resources, energy storage systems, and loads in a small system capable of operating connected to a main grid and, in cases of emergency or scheduled events, capable of operating in an isolated manner, and controlling frequency and voltage.

The sizing and design of microgrids could be modeled as an optimization problem, which define technology, size, allocation, and operation:

  1. Distributed renewable generation (photovoltaic, wind).

  2. Thermal generation (diesel, gas, combined cycle).

  3. Energy storage system (batteries, ion–lithium, flywheel)

The mathematical model that allows sizing and designing the microgrid is basic and only focuses on analyzing the clustering results from an economic point of view to solve those problems.

As a hypothesis, it was assumed that the microgrid will operate with solar generation at maximum during the highest solar incidence hours, which means that including energy storage systems, such as batteries, is disregarded. Clustering techniques results are entered in mathematical model as a known parameter, and hence, their accuracy directly affects the proposed microgrid sizing and design.

The results for the microgrid were obtained by applying the mathematical model for sizing and designing used for other rural communities [11]. The model was implemented in the AMPL optimization program [27] and solved by the CPLEX solver [28].

2.11 The Proposed Optimization Problem.

The proposed microgrid to supply power to the community consists of a genset, battery energy storage system (BESS), solar PV panels, and a power inverter. In this section, we show the mathematical formulation of the optimization problem to minimize the annual total cost of investment of a microgrid, including the O&M cost of a genset in Barcelona community. Given this, the result reveals the optimal sizing of the microgrid. The linear optimization method was used to solve the optimization problem, which is mainly based on the fact that the decision variables are continuous and the objective function is linear in the decision variables [29]. The ampl tool was the software used to find the solution of the optimization problem using the CPLEX solver, which is based on primal-dual simplex algorithms [30]. This tool is an algebraic modeling language for linear or nonlinear problems with continuous and discrete variables [31].

For minimizing the annual total cost of investment of the proposed microgrid, the objective function is defined based on the linear cost functions of each component, as follows:
(5)
where F is composed of four terms, as follows:
(6)

The first term of Eq. (5) describes the cost of hourly genset operation for 365 days, where δ is a conversion unit, which is equivalent to 1 h, COT is the rate for the sale of electricity from the genset to the microgrid, and PtT is the power delivered to the microgrid by the generator every hour. The second term of Eq. (5) refers to the penalty due to the probability of failure of the microgrid during any hour of the day, where CCC is the cost of a power outage (power not supplied), PtD is the power demand by each hour, and Xt is the percentage of load shedding by each hour.

Finally, F contains the microgrid sizing variables and is composed of four terms according to Eq. (6), as follows: The first term CIGRP¯GR of F refers to the cost per electricity generation use from the PV system, where CIGR is the cost of investing in PV panels and P¯GR is the total power output of PV panels (renewable generation). The second term represents the cost of using power from a diesel generator, where CIT is the investment cost in a new generator whose maximum power output is equal to or less than the existing genset in the community, and PT system based on batteries, where CIPA is the cost of charging or discharging power from the BESS to the microgrid and P¯AE is the maximum power of charging or discharging of the BESS. The fourth term is composed of CIEA, which is the cost related to the storage capacity of the BESS, and E¯AE, which is the maximum capacity of electricity that can store the BESS.

The proposed constraints for objective function are given as follows:

  • Active Power Balance: Equation (7) corresponds to the restriction related to active power balance, i.e., the power produced by the microgrid is equal to the power consumed by the community, including the process of charging and discharging of the BESS.
    (7)
    where PtAEiny is the power injected to the microgrid from BESS for each hour in the respective scenario and PtAEext is the charging power from the PV power system to the BESS in the respective scenario.
  • Genset Capacity: Equation (8) shows that PtT must be less than or equal to the maximum capacity of the genset (P¯T) and higher than zero.
    (8)
  • Active Power Injection Capacity: Equation (9) shows the range of active power that can be delivered by the BESS to the microgrid, where the maximum value is given by P¯AE.
    (9)
  • Active Power Extraction Capacity: Equation (10) represents the maximum storage capacity limit of the BESS, during each hour of the day.
    (10)
  • Hourly Energy Balance for t>1 in Each Operating Scenario: Equation (11) represents the energy balance, indicating the power stored in the BESS in each hour for t>1, where it is equal to the power that is stored in the BESS in the previous hour, including the energy transferred in that time.
    (11)
    where is the BESS efficiency and β is the BESS self-discharge rate.
  • Initial Energy Balance in t=1 in Each Operating Scenario: Equation (12) represents the energy balance, indicating the power stored in the BESS in t=1, where it is equal to the power that is stored in the BESS, including the energy transferred in that time.
    (12)
    where EAE0 is the initial energy in the BESS.
  • BESS Maximum Storage Capacity: Equation (13) limits the power in each hour where it ranges between 0 and E¯AE.
    (13)
  • Load-Shedding Percentage: Equation (14) defines the load cutoff percentage to be used in the optimization model.
    (14)
  • Renewable Source Sizing: Equation (15) establishes the boundaries for the sizing of the renewable power source (PV power system).
    (15)
    where IP¯GR is the maximum boundary in the renewable power generation.
  • Genset Sizing: Equation (16) restricts the power generation of the genset.
    (16)
    where IP¯T is the maximum generation capacity of the community genset.
  • Power Transfer Capacity: Equation (17) indicates that the minimum power exchange (charging and discharging processes) of the BESS must be higher or equal to zero.
    (17)
  • BESS Sizing: Equation (18) indicates the minimum quantity of electricity that can be stored in the BESS.
    (18)

Figure 12 categorizes and represents the variables involved in the optimization problem, the sizing variables of the mathematical model are the decision variables, it is the size of each subsystem that is unknown, and we seek to find an optimal value for it, size of the renewable generation, size of the genset, and size of the BESS.

Fig. 12
Flowchart of state variables, parameters, and output variables
Fig. 12
Flowchart of state variables, parameters, and output variables
Close modal

2.12 Input Data to the Model.

Table 1 summarizes the values of the component capacities and other technical–economic values used in the optimization problem. The costs of the components were based on the Ecuadorian market price at the time of the present study.

Table 1

Component capacities and other technical-economic values used in the optimization model

DescriptionValue
Solar source installation limit60.00 kW
Cost of installing solar fountains464.60 $/kW
Cost per load shedding0.65 $/kW
Cost of investing in thermal generation997.28 $/kW
Power purchase tariff to the generator set0.16 $/kWh
Diesel generator capacity limit185.00 kW
Cost of investing battery transfer power197.92 $/kW
Cost of investing maximum battery storage capacity197.92 $/kWh
Initial battery energy0.00 kWh
Battery efficiency0.95
Battery self-discharge rate0.02
DescriptionValue
Solar source installation limit60.00 kW
Cost of installing solar fountains464.60 $/kW
Cost per load shedding0.65 $/kW
Cost of investing in thermal generation997.28 $/kW
Power purchase tariff to the generator set0.16 $/kWh
Diesel generator capacity limit185.00 kW
Cost of investing battery transfer power197.92 $/kW
Cost of investing maximum battery storage capacity197.92 $/kWh
Initial battery energy0.00 kWh
Battery efficiency0.95
Battery self-discharge rate0.02

3 Results

Table 2 shows the clusters found by each algorithm; k-means, GMM, and agglomerative clustering group the data into three clusters, while Hierarchical does so with four clusters and DBSCAN finds a single representative for the data group. k-Means allows to construct an estimated curve Fig. 13 using the representative’s values in their respective schedules: 527.23Wh/m2 in the hours of highest solar irradiation, from 9 a.m. to 4 p.m.; 241.04Wh/m2 for the afternoon, from 4 p.m. to 6 p.m.; 119.03Wh/m2 in the morning, from 6 a.m. to 9 a.m. This result shows how k-means can be used to construct an envelope that is representative of the irradiance data. In the same way, three temperature values are found: 28.75C, 27.23C, 23.37C, at their respective times. Regarding the R2 fit, k-means and GMM have a similar score being 0.72 for the former and 0.76—slightly higher—for the latter. Similarly, it occurs for temperature with 0.64 and 0.61, respectively.

Fig. 13
Irradiance curve estimated using k-means clusters; cluster 1 (highest value at 12 O'clock), cluster 2 (mean value at 16 O'clock), and cluster 3 (lowest value at 7 O'clock)
Fig. 13
Irradiance curve estimated using k-means clusters; cluster 1 (highest value at 12 O'clock), cluster 2 (mean value at 16 O'clock), and cluster 3 (lowest value at 7 O'clock)
Close modal
Table 2

Clustering and metric R2 results for each of the algorithms: k-means, GMM, DBSCAN, hierarchical clustering, and agglomerative clustering

k-MeansGMMHierarchical clusteringDBSCANAgglomerative Clustering
Clustering R2Irradiation (Wh/m2)Temperature (C)Irradiation (Wh/m2)Temperature (C)Irradiation (Wh/m2)Temperature (C)Irradiation (Wh/m2)Temperature (C)Irradiation (Wh/m2)Temperature (C)
C1527.2328.75498.9028.36320.4225.49310.0426.51465.7728.59
C2241.0427.23182.0826.98229.3524.58160.3923.49
C3119.0323.3798.1122.82229.0528.80114.9125.75
C4198.3526.77-
R20.720.640.760.610.280.660.920.560.840.55
k-MeansGMMHierarchical clusteringDBSCANAgglomerative Clustering
Clustering R2Irradiation (Wh/m2)Temperature (C)Irradiation (Wh/m2)Temperature (C)Irradiation (Wh/m2)Temperature (C)Irradiation (Wh/m2)Temperature (C)Irradiation (Wh/m2)Temperature (C)
C1527.2328.75498.9028.36320.4225.49310.0426.51465.7728.59
C2241.0427.23182.0826.98229.3524.58160.3923.49
C3119.0323.3798.1122.82229.0528.80114.9125.75
C4198.3526.77-
R20.720.640.760.610.280.660.920.560.840.55

Hierarchical clustering finds that the optimal number of clusters is 4. Therefore, there are four representatives that compose the daily irradiance and temperature data: 320.42Wh/m2, 229.35Wh/m2, 229.05Wh/m2, 198.35Wh/m2; and 25.49C, 24.58C, 28.80C, 26.77C, respectively. For analysis, only three clusters were considered —like the k-means or GMM case—since 229.35Wh/m2 and 229.05Wh/m2 are the same value in magnitude and, when constructing the curve, there is no major variation or effect by using only one of them. On the other hand, the R2 fit is the lowest of all the algorithms evaluated, with a score of 0.28 for irradiance and 0.56 for temperature. In reference to the MAE, it is also the one with the lowest accuracy. Figure 14 shows that it has a low accuracy during all the hours of the day, which leads to infer that a higher number of representatives does not necessarily imply better data prediction.

Fig. 14
Mean absolute error metric for clustering algorithms (for irradiation)
Fig. 14
Mean absolute error metric for clustering algorithms (for irradiation)
Close modal

The case of DBSCAN shows the complete opposite side of hierarchical, GMM, and k-means. It establishes a single representative value of irradiance for the whole dataset with 310.04Wh/m2 and obtains an R2 fit of 0.92, being the highest and best result of the whole set. In parallel, it obtains 0.56 for temperature. This is reflected in the MAE, where according to this metric, DBSCAN has low accuracy of prediction during the hours of highest irradiance, corresponding to around 12 a.m. Finally, agglomerative clustering finds representatives like those of k-means and GMM with similar characteristics: 465.67Wh/m2 in the hours of highest solar irradiance from 9 a.m. to 4 p.m.; 160.39Wh/m2 for the afternoon from 4 p.m. to 6 p.m.; 114.91Wh/m2 in the morning from 6 a.m. to 9 p.m. A better fit is obtained for irradiance with 0.84 in R2 for irradiance—only below DBSCAN—and 0.55 for temperature, close to the two mentioned algorithms. Figure 14 shows that the MAE of this algorithm is also close to what would be k-means and GMM, except around 1 p.m. From the mathematical model point of view, after conducting a comparative research study of clustering techniques to design and size a microgrid and after reviewing the findings in Table 3, it was observed that k-means and GMM clustering techniques produced the most favorable results for reference investment, considering the microgrid’s characteristics but excluding the monetary cost due to load shedding, note that the MAE rel a measure of the imprecision in the prediction of the energy resource, this imprecision of the algorithm translates into energy service problems that lead to load cuts, although the investment in the microgrid is not significantly increased, the penalty or loss in dollars due to load cuts has an extreme change, with DBSCAN being the one that generates the greatest losses.

Table 3

Investment cost of renewable components of the microgrid and the highest MAE value obtained by each algorithm

Clustering techniqueHighest MAEMicrogrid investment (USD$)Load shedding (USD$)
k-Means196212,040.000
GMM214211,127.000
Hierarchical283213,101.002,259,290.00
DBSCAN266222,175.006,870,940.00
Agglomerative282220,661.00584,757.00
Clustering techniqueHighest MAEMicrogrid investment (USD$)Load shedding (USD$)
k-Means196212,040.000
GMM214211,127.000
Hierarchical283213,101.002,259,290.00
DBSCAN266222,175.006,870,940.00
Agglomerative282220,661.00584,757.00

4 Discussion

In the realm of renewable energy, particularly within microgrid development, accurate data analysis is paramount for ensuring the efficiency and reliability of energy systems. Climate and radiation data serve as essential inputs for sizing microgrids, directly influencing the potential generation of renewable energy from sources like solar and wind. Leveraging clustering techniques in data manipulation not only enhances data processing accuracy but also facilitates informed decision-making in microgrid design and management.

Clustering techniques provide a systematic approach to organizing vast and complex datasets, unveiling underlying patterns and structures. In the context of climate and radiation data, these techniques enable the identification of distinct meteorological and environmental conditions affecting renewable energy generation. By categorizing data points into clusters based on similarities, researchers and practitioners gain valuable insights into seasonal variations, weather patterns, and irradiance levels crucial for optimal microgrid sizing.

The integration of clustering techniques in data manipulation significantly influences the efficacy of microgrid sizing methodologies. Organizing climate and radiation data into meaningful clusters enables the development of robust predictive models for renewable energy generation. These models consider environmental factor variability, enabling accurate energy output estimation under different scenarios. As a result, microgrid designers can confidently dimension systems, ensuring optimal resource utilization and improved resilience to environmental fluctuations.

Beyond quantity, the quality of energy generated within microgrids holds utmost importance. Clustering techniques aid in assessing energy quality parameters such as voltage stability, frequency regulation, and harmonic distortion. By integrating environmental conditions and energy output data with clustering analysis, researchers identify correlations between climatic factors and energy quality metrics. This holistic approach optimizes microgrid configurations to maintain stable and reliable power supply, meeting stringent requirements of end-users and grid operators.

5 Conclusion

This article aims to underscore the fundamental role of clustering techniques in bridging the gap between data analytics and microgrid design, paving the way toward sustainable and reliable energy solutions in an era of increasing environmental challenges. K-Means and GMM demonstrated the highest precision, leading to the most accurate microgrid designs, from the point of view of both microgrid investment and load shedding. It is the most suitable technique for emulating and solving microgrid optimization problems because this technique considers microgrid behavior letting it operate in both on grid and off grid. Furthermore, we highlighted the critical role of accurate clustering in microgrids design. For future work, it is imperative to explore advancements in the clustering technique to enhance accuracy in microgrid design. Additionally, investigating the viability of harnessing wind energy for hydraulic water pumping in remote communities represents a promising avenue, offering sustainable and accessible solutions for water supply. Simultaneously, addressing the social impact arising from water scarcity in community production processes is crucial, necessitating the exploration of strategies to mitigate these effects and ensure sustainable development in these regions.

Footnote

2

Empresa Eléctrica Pública Estratégica Corporación Nacional de Electricidad CNEL EP.

Acknowledgment

The authors would like to thank CERA, Sostenibilidad, SENESCYT, and PRESTIGE Research Group from ESPOL for supporting this research project.

Conflict of Interest

There are no conflicts of interest.

Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.

Nomenclature

AMPL =

a mathematical programming language

BIC =

Bayesian information criteria

DBSCAN =

density-based spatial clustering of applications with noise

GMM =

Gaussian mixture model

MAE =

mean absolute error

MASL =

National Institute of Meteorology and Hydrology

NASA =

National Aeronautics and Space Administration

SDG =

sustainable development goals

UNESCO =

United Nations Educational, Scientific and Cultural Organization

References

1.
Nations
,
U.
,
2021
, “
SDG7 TAG Policy Briefs: Leveraging Energy Action for Advancing the Sustainable Development Goals
,” Progress Report 1,
United Nations, San Francisco, CA
.
2.
Nations
,
U.
,
2023
, “
Sustainable Development Report 2023 Implementing the SDG Stimulus Includes the SDG Index and Dashboards
,” Progress Report 1,
United Nations, San Francisco, CA
.
3.
Ahmed
,
R.
,
Sreeram
,
V.
,
Mishra
,
Y.
, and
Arif
,
M.
,
2020
, “
A Review and Evaluation of the State-of-the-Art in PV Solar Power Forecasting: Techniques and Optimization
,”
Renewable. Sustainable. Energy. Rev.
,
124
, p.
109792
.
4.
Perez-Uresti
,
S. I.
,
Lima
,
R. M.
,
Martin
,
M.
, and
Jimenez-Gutierrez
,
A.
,
2023
, “
On the Design of Renewable-Based Utility Plants Using Time Series Clustering
,”
Comput. Chem. Eng.
,
170
, p.
108124
.
5.
Zhou
,
Y.
,
Liu
,
Y.
,
Wang
,
D.
,
Liu
,
X.
, and
Wang
,
Y.
,
2021
, “
A Review on Global Solar Radiation Prediction With Machine Learning Models in a Comprehensive Perspective
,”
Energy. Convers. Manage.
,
235
, p.
113960
.
6.
Voyant
,
C.
,
Notton
,
G.
,
Kalogirou
,
S.
,
Nivet
,
M.-L.
,
Paoli
,
C.
,
Motte
,
F.
, and
Fouilloy
,
A.
,
2017
, “
Machine Learning Methods for Solar Radiation Forecasting: A Review
,”
Renewable Energy
,
105
, pp.
569
582
.
7.
Teichgraeber
,
H.
, and
Brandt
,
A. R.
,
2022
, “
Time-Series Aggregation for the Optimization of Energy Systems: Goals, Challenges, Approaches, and Opportunities
,”
Renewable. Sustainable. Energy. Rev.
,
157
, p.
111984
.
8.
Gabrielli
,
P.
,
Gazzani
,
M.
,
Martelli
,
E.
, and
Mazzotti
,
M.
,
2018
, “
Optimal Design of Multi-energy Systems With Seasonal Storage
,”
Appl. Energy.
,
219
, pp.
408
424
.
9.
Mallapragada
,
D. S.
,
Papageorgiou
,
D. J.
,
Venkatesh
,
A.
,
Lara
,
C. L.
, and
Grossmann
,
I. E.
,
2018
, “
Impact of Model Resolution on Scenario Outcomes for Electricity Sector System Expansion
,”
Energy
,
163
, pp.
1231
1244
.
10.
Dolara
,
A.
,
Grimaccia
,
F.
,
Magistrati
,
G.
, and
Marchegiani
,
G.
,
2017
, “
Optimization Models for Islanded Micro-grids: A Comparative Analysis Between Linear Programming and Mixed Integer Programming
,”
Energies
,
10
(
2
), p.
241
.
11.
Pesantes
,
L. A.
,
Hidalgo-León
,
R.
,
Rengifo
,
J.
,
Torres
,
M.
,
Aragundi
,
J.
,
Cordova-Garcia
,
J.
, and
Ugarte
,
L. F.
,
2024
, “
Optimal Design of Hybrid Microgrid in Isolated Communities of Ecuador
,”
J. Modern Power Syst. Clean Energy
,
12
(
2
), pp.
488
499
.
12.
CNEL EP
,
2023
, “Indicadores de Calidad de Servicio Técnico,” Progress Report 1,
Empresa Eléctrica Pública Estratégica Corporación Nacional de Electricidad
,
Ecuador, EC
.
13.
EPC
,
2023
, “Indicadores de Calidad de Servicio Técnico,” Progress Report 1,
Programación de Corte del Servicio de Energía Eléctrica
,
Ecuador, EC
.
14.
Mundial
,
B.
,
2024
, “Crecimiento de la población (% anual) - Ecuador | Data,” Progress Report 1,
Banco Mundial
,
Ecuador, EC
.
15.
National Aeronautics and Space Administration
,
2024
, “Prediction of Worldwide Energy Resources,” Progress Report 1,
NASA
,
Ecuador, EC
.
16.
Sinaga
,
K. P.
, and
Yang
,
M. -S.
,
2020
, “
Unsupervised K-Means Clustering Algorithm
,”
IEEE Access
,
8
, pp.
80716
80727
.
17.
Ikotun
,
A. M.
,
Ezugwu
,
A. E.
,
Abualigah
,
L.
,
Abuhaija
,
B.
, and
Heming
,
J.
,
2023
, “
K-Means Clustering Algorithms: A Comprehensive Review, Variants Analysis, and Advances in the Era of Big Data
,”
Inf. Sci.
,
622
, pp.
178
210
.
18.
Liu
,
F.
, and
Deng
,
Y.
,
2020
, “
Determine the Number of Unknown Targets in Open World Based on Elbow Method
,”
IEEE Trans. Fuzzy Syst.
,
29
(
5
), pp.
986
995
.
19.
Yang
,
X.
,
Li
,
Y.
,
Zhao
,
Y.
,
Li
,
Y.
,
Hao
,
G.
, and
Wang
,
Y.
,
2024
, “
Gaussian Mixture Model Uncertainty Modeling for Power Systems Considering Mutual Assistance of Latent Variables
,”
IEEE Trans. Sustainable Energy
,
0
, pp.
1
4
.
20.
Ke
,
X.
,
Zhao
,
Y.
, and
Huang
,
L.
,
2021
, “
On Accurate Source Enumeration: A New Bayesian Information Criterion
,”
IEEE Trans. Signal Process.
,
69
, pp.
1012
1027
.
21.
Patel
,
E.
, and
Kushwaha
,
D. S.
,
2020
, “
Clustering Cloud Workloads: K-Means vs Gaussian Mixture Model
,”
Procedia Comput. Sci.
,
171
, pp.
158
167
.
22.
Goudarzi
,
N.
, and
Ziaei
,
D.
,
2022
, “
Wind Farm Clustering Methods for Power Forecasting
,”
ASME Power Conference, Vol. 85826
,
Pittsburgh, PA
,
July 18–19
,
American Society of Mechanical Engineers
, p.
V001T07A015
.
23.
Govender
,
P.
, and
Sivakumar
,
V.
,
2020
, “
Application of K-Means and Hierarchical Clustering Techniques for Analysis of Air Pollution: A Review (1980–2019)
,”
Atmos. Pollut. Res.
,
11
(
1
), pp.
40
56
.
24.
Tokuda
,
E. K.
,
Comin
,
C. H.
, and
Costa
,
L. d. F.
,
2022
, “
Revisiting Agglomerative Clustering
,”
Physica A: Stat. Mech. Appl.
,
585
, p.
126433
.
25.
Najian
,
M.
, and
Goudarzi
,
N.
,
2023
, “
Evaluating Critical Weather Parameters Using Machine Learning Models
,”
ASME Power Conference, Vol. 87172
,
Long Beach, CA
,
Aug. 6–8
,
American Society of Mechanical Engineers
, p.
V001T04A006
.
26.
Turney
,
S.
,
2022
, “
Coefficient of Determination (R2)| Calculation & Interpretation
,”
Scribbr
, https://www.scribbr.com/statistics/coefficient-of-determination/, Accessed December 13, 2022.
27.
Fourer
,
R.
,
Gay
,
D. M.
, and
Kernighan
,
B. W.
,
1995
,
Ampl
,
Boyd & Fraser Danvers
,
MA
.
28.
Mahdavi
,
M.
,
Alhelou
,
H. H.
,
Hatziargyriou
,
N. D.
, and
Al-Hinai
,
A.
,
2021
, “
An Efficient Mathematical Model for Distribution System Reconfiguration Using Ampl
,”
IEEE Access
,
9
, pp.
79961
79993
.
29.
Sioshansi
,
R.
, and
Conejo
,
A.
,
2019
,
Optimization in Engineering: Models and Algorithms
,
Springer
, London.
30.
AMPL
, “
CPLEX IBM ILOG CPLEX Solver
,” https://documentation.aimms.com, Accessed January 20, 2023.
31.
Olszak
,
A.
, and
Karbowski
,
A.
,
2018
, “
Parampl: A Simple Tool for Parallel and Distributed Execution of AMPL Programs
,”
IEEE Access
,
6
, pp.
49282
49291
.