Abstract
The electrification of rural communities is crucial from both social and economic perspectives, aligned with Sustainable Development Goal 7: ”Affordable and Clean Energy.” This study presents a comprehensive comparison of clustering techniques, including k-means, Gaussian mixture models (GMM), hierarchical clustering, density-based spatial clustering of applications with noise (DBSCAN), and agglomerative clustering, aimed at enhancing solar irradiance prediction. Leveraging historical climate data from a rural community in the coastal region of Ecuador, each technique is evaluated using error metrics such as mean absolute error (MAE) and coefficient of determination (). This assessment identifies the most effective clustering technique in this specific context. In order to delve deeper into these comparisons, simulations are conducted in AMPL to validate and refine the selection of techniques. In this process, it is considered the sizing and design of a microgrid within the Barcelona community, Ecuador, which integrates various energy sources, including solar. Additionally, a penalty system is introduced for unmet energy demands during less critical periods, thereby optimizing efficiency and enhancing energy availability within the community. In conclusion, this article introduces a scalable methodology to analyze algorithms for solar irradiance prediction, emphasizing the significance of comparing clustering techniques as its main contribution. This advancement in prediction accuracy has the potential to enhance the feasibility and efficiency of renewable energy systems for rural communities, thereby fostering sustainable economic growth and bolstering efforts in climate change mitigation and adaptation.
1 Introduction
Electricity is vital for sustainable development. Advances in energy efforts and sustainable development goal (SDG) 7—ensure access to affordable, reliable, sustainable, and modern energy for all—can foster progress toward all the other SDGs, including access to clean water and promoting sustainable communities [1]. At the same time, greenhouse gas emissions from energy production represent a significant factor in climate change. Globally, for 2020, only 19.1% of total final energy production came from renewable sources, of which 6.6% still accounted for traditional use of biomass [2].
To mitigate greenhouse gas emissions, energy systems must adapt to integrate renewable energy sources. Among these, photovoltaic systems are used to harvest solar energy. Their output is variable, considering that solar radiation intensity is dependent on latitude, season, atmospheric conditions, and air quality, among others [3].
Therefore, integrating photovoltaic systems with existing energy systems presents a challenge by adding upstream uncertainty, which can result in output issues such as voltage surges, variations in frequency, harmonic distortion, among others [4]. Ahmed et al. classify the major factors affecting solar energy forecast as follows: forecasting horizons, weather classification, forecast model performance, and forecast model input [3].
Moreover, weather stations for measuring global solar radiation are not widely available. To address this, satellite data serve as a viable alternative, particularly in remote areas, due to its continuous spatial coverage. Additionally, machine-learning models are being developed to enhance the accuracy of solar radiation predictions [5]. These models analyze input and output data to identify patterns and correlations. In the context of forecasting, time-series models use historical solar irradiance data as input to predict future values [6].
Time-series aggregation is commonly used to reduce the complexity of energy system models. However, machine-learning approaches, such as clustering, are gaining prominence due to their ability to handle large volumes of input data. As Teichgraeber and Brandt suggest, these techniques are expected to be widely adopted in the future [7]. Clustering methods, as described by Pérez-Uresti et al., group days with similar characteristics, with each cluster represented by a single representative day. This approach allows a time series to be represented by a selected set of representative days [4]. However, it is important to note that clustering has limitations, such as preserving hourly variations within a representative day, which contrasts with traditional time-series models that typically capture continuity and temporal trends over time.
Unsupervised machine-learning models work without external intervention (expert) and are capable of identifying structures only with inputs. Among them, k-means and k-methods clustering, hierarchical clustering, Gaussian mixture models, and cluster evaluation are found [6]. k-means algorithms determine the centroid of each cluster as the average value of data included in each cluster [7]. Pérez-Uresti et al. used a k-means algorithm to determine representative periods for a renewable-based utility plant. The authors concluded from the optimization results that the plant investment and total annual costs increase as more extensive samples of representative periods are considered [4]. Gabrielli et al. used data clustering through k-means in an optimization framework to perform and evaluate the design of a small-scale electric grid [8]. Mallapragada et al. studied capacity expansion models for renewable energy and found out that models with time slice representation are likely to overestimate solar photovoltaic capacity. This was linked to the number of representative days used, where extreme points of data series are challenging to represent [9].
Hierarchical clustering distributes clusters in a hierarchy, which is represented in a dendrogram—a tree-like structure with roots (cluster with all observations) and leaves (clusters with individual observations). Agglomerative algorithms, which run from the root to the leaves, are included in this group. On the other hand, Gaussian mixture models come from a generalization of a multivariate Gaussian distribution to infinitely many variables [6].
As mentioned, integrating photovoltaic systems with existing energy infrastructures, referred to as hybrid microgrids, presents significant design and sizing challenges. To address these challenges, the use of optimization models is essential [10]. It is worth noting that once solar irradiance predictions are obtained through various clustering methods—which will be compared in this study—these results can be integrated into a classical optimization model. This approach enables a comprehensive evaluation from the perspectives of investment costs and the costs associated with load shedding [11].
In this study, clustering techniques were applied to a case study in the rural community of Barcelona, located in the province of Santa Elena, Ecuador. The outcomes of these clustering methods were subsequently integrated into an optimization model to estimate both investment and load shedding costs. The proposed microgrid is intended to supply power to the water distribution system serving multiple communities in the region.
Energy is required in water treatment and supply systems, which is known as the water-energy nexus [1]. Even though the community is connected to the national grid, reliability of the service is not optimal. For 2022, in Santa Elena province, CNEL EP2—the Ecuadorian energy distribution company—reported 13 h of service interruption in average per month, with a frequency of service interruption of six times in average per month [12]. This means that for 13 h per month, the area lacked electrical service, and the water distribution system was affected. For the communities in the area, a continuous energy supply means an improvement in the water supply system.
On the other side of the water-energy nexus, water is used in the production of energy. In Ecuador, most of the energy comes from hydroelectrical facilities. However, climate change is expected to disrupt water resources availability and thus affect access to electricity. Changes in precipitation patterns impact hydropower reserves [1]. In Ecuador, it has resulted in the need to implement energy rationing periods at the end of 2023, as reported by CNEL EP [13].
Since climate change effects deepen, the need to use reliable energy sources grows more urgent. Renewable energy has a role within the water-energy nexus, though quantitative and qualitative knowledge of their impact remains limited [1]. In this case, the proposed microgrid has the potential to ensure service, regardless of the changes in precipitation patterns and their effect on hydropower reserves.
The remainder of this article is structured as follows. Section 2 describes the algorithms and method for the classification of the renewable resource; additionally, the model for the optimal design of hybrid microgrid used to evaluate the performance of the results of each algorithm in a practical application using real data is presented. In Sec. 3, the results of each clustering algorithm and the optimal design for hybrid microgrid are assessed in different values of solar irradiation and air temperature obtained in each cluster. Finally, the discussion and conclusion are drawn in Secs. 4 and 5, respectively.
2 Materials and Methods
The methodology of the present work is summarized in Fig. 1. In this study, an onsite information survey of a rural community was carried out, including demographic information, survey of the vital drinking water system, survey of consumption patterns of the inhabitants, geographic location, among other relevant data. Then, meteorological data were taken from the site and processed with different clustering algorithms to obtain representative values of each variable. Performance metrics of each algorithm was analyzed. Finally, the optimal design of the microgrid was developed with the parameters obtained from the clustering algorithm, using an existing model for sizing microgrids, adapted to the particularities of the community.
2.1 Barcelona Rural Community.
The Barcelona community is located at province of Santa Elena, Ecuador, with coordinates at latitude and longitude . Its location is shown in Fig. 2. It is located approximately 200 km from the city of Guayaquil. It has a population of approximately 3900 inhabitants with a total area of 1800 ha.
The community bases its economy entirely on the production of citrus fruits, such as lemons, and “paja toquilla”, a weaving traditionally produced in these areas of the country. For its electricity supply, the community is connected directly to the national transmission system through a local distribution system. However, due to its location, the quality of the power supply is low and causes constant interruptions. One of the main consequences is the disruption of the supply of drinking water. The following section explains in detail this system and the current situation of the community.
2.2 Estimating the Electricity Demand.
A survey of information and estimation of energy demand in the rural community was carried out through a detailed characterization of the population, economic activity, and geographic conditions. In addition, the existing energy infrastructure, historical consumption, and community surveys were analyzed to understand the energy habits and needs of the residents. The potential of local renewable energy sources was evaluated, to project future demand, considering the demographic and technological growth of the community.
The survey revealed several significant issues related to the community’s energy supply. First, residential energy consumption is a concern, as the local distribution network provides poor service quality, marked by frequent blackouts. The second major issue, which is the focus of this work, involves the community’s water pumping system. This system includes a reservoir located at an elevation of 120 m above sea level (MASL), which is supplied by four wells situated at approximately 20 MASL. The pumping system must elevate the water by roughly 100 m from the wells to the reservoir. Figure 3 illustrates the system architecture.
Two of the four wells are equipped with 25 kVA transformers that power the submersible pumps, while the other two wells have 15 kVA transformers. Although the installed pumps have a peak electrical power consumption of 15 kW, the transformers are currently operating below their maximum capacity. However, with the expected increase in demand, an expansion of the pumping capacity may become necessary.
The installed peak power of the system is 160 kW, with a 220 V three-phase power supply system. To establish the consumption characteristics of the system, the storage capacity of the water tank was analyzed. The reservoir is 25 m in diameter and 6 m high, with an approximate storage capacity of . According to the surveys conducted in the community, the highest water consumption schedule occurs between 1:00 a.m. and 5:00 a.m. due to the production processes of toquilla straw.
The pumping schedule of the system was established during the early morning and during the day, when the solar resource is available, thus guaranteeing the water supply during the day and its availability during the night. The pumping system operates at its maximum capacity from 9:00 a.m. to 4:00 p.m. and partially operates a well from 1:00 a.m. to 5:00 a.m. (30 kW).
For the estimation of demand growth, a growth rate of 1% per year is considered, as projected by the World Bank for territories in Ecuador [14] and considering a horizon of 15 years as the average useful life of the renewable systems. The maximum peak consumption is set at 185 kW. With all these considerations, Fig. 4 presents the estimated energy consumption curve.
2.3 Meteorological Data.
In the Barcelona community, the National Institute of Meteorology and Hydrology (INAMHI) operates a meteorological station. However, this station does not monitor certain key variables necessary for design of microgrids, specifically irradiance, air temperature, and wind speed. These variables are critical for accurately assessing the potential of renewable energy sources within a microgrid. Therefore, to proceed with the analysis, historical data for irradiance, air temperature, and wind were obtained from the NASA meteorological database [15]. Ten years of data, from Jan. 1, 2013, to Dec. 31, 2022, were used, with measurements taken at an hourly resolution, totaling 87,600 samples. Figure 5 presents the time series of these variables for the year 2022.
Irradiation values oscillate between 1.55 kW-h/m2/day and 6.4 kW-h/m2/day. These numbers are encouraging because they give the impression that solar resources are abundant throughout the year, with considerations in times of low irradiation such as the period from June to August. This is precisely what cauterization seeks to determine, which is explained in the following sections of each of the methods used. Likewise, minimum average temperature value recorded in 1 day is and the maximum is . Regarding wind, the conditions by analyzing the average values are also encouraging: at 10 m above sea level, there is a minimum of 1.88 and a maximum of ; while at 50 m, there is a minimum of 2.59 m/s and a maximum of 7.77 m/s. Although wind generators are not considered in the sizing and design of microgrid, wind variables are incorporated in the analysis to obtain a more complete knowledge of the environmental conditions of the region. In addition, the possibility of integrating this type of renewable energy in future work is examined.
2.4 K-Means Clustering Algorithm.
k-Means clustering algorithm was employed to make predictions of solar energy production as a function of key meteorological variables. The algorithm is a clustering technique that seeks to divide a data set into k clusters, where each cluster is represented by a centroid [16]. It starts by assigning initial centroids and then assigns each point to the cluster whose centroid is closest. The centroids are updated iteratively by recalculating the average of the points in each cluster. The process is repeated until the centroids barely change between iterations, indicating convergence.
Before applying the k-means algorithm, extensive data preprocessing are performed to ensure the quality and consistency of the datasets. This includes the detection and handling of outliers, normalization of variables to ensure comparable scales, and imputation of any missing values using appropriate techniques. The dataset is divided into training and test sets to evaluate the performance of the model, and in this study, it is trained with 9 years of data and evaluated with 1 year. An adequate percentage of the data is reserved for the test set, ensuring an objective and realistic evaluation of the k-means model.
The k-means algorithm is implemented to group the data into k clusters. The choice of the optimal number of k clusters is made by elbow methods in conjunction with the silhouette index. The elbow method, also known as the elbow rule, is a technique used to determine the optimal number of clusters (k) in a clustering algorithm, such as k-means. The central idea is to observe how the sum of the intracluster quadratic distances varies as a function of the number of clusters and to find the point where the improvement in the reduction of the intracluster distance decreases significantly, forming a curve that resembles an “elbow.” The procedure involves running the clustering algorithm for different values of “k” and recording the sum of the intracluster quadratic distances for each case; Fig. 6 plots this information and looks for the point at which adding an additional cluster no longer provides a substantial improvement in data. This point is identified as the “bend” in the curve [17]. Note that the optimum number of clusters can be selected as 3 or 4 depending on the bend observed. To make a final and more informed decision, the silhouette index was applied in this research.
Figure 7 shows the silhouette index for the same dataset being worked with the elbow method. Note that both 3 and 4 have a similar score, while 2 is the one with the lowest value. A high silhouette index value indicates that the clustering is appropriate, with well-defined clusters and a clear separation between them. In contrast, a low value suggests that the clusters may be overlapping or that some points may have been incorrectly assigned [18]. Based on this, and considering the two methods, 3 is chosen as the optimal number of clusters “k” in k-means. Each cluster is associated with a specific meteorological profile that will influence solar energy production. The clusters obtained are analyzed to identify meteorological patterns and their correlation with energy production. Figure 8 presents the clusters obtained in k-means for each variable. As visualized in the provided figure, three distinct clusters are revealed based on the features of solar irradiance () and air temperature (). Cluster 0, the curve with the highest peak in the irradiation graph and the lowest peak in the air temperature graph, is associated with lower solar irradiance, primarily below , and moderate air temperatures ranging from approximately to . This cluster likely corresponds to conditions such as early morning or late afternoon, or cloudy days when the sun’s intensity is lower. Cluster 1, encompasses a broader range of solar irradiance values, mostly between 200 and and air temperatures from to . This cluster may represent midday conditions with varying solar intensity, possibly due to partial cloud cover or seasonal variations. Finally, cluster 2, the curve that lies to the right of the irradiance and temperature graphs is characterized by higher solar irradiance values, typically between 500 and , and higher air temperatures ranging from to . This cluster likely corresponds to clear, sunny conditions, typical of midday in warmer seasons, when solar energy production would be optimal. The methodology is applied for all the algorithms considered in this study, while the clusters are obtained. In Sec. 2.5, corresponding to metrics, it is explained how the performance of each one is calculated, and the methodology proposed for the election and evaluation of the clustering algorithms.
2.5 Gaussian Mixture Models.
These metric balances model accuracy and model complexity, penalizing models that are too complex. The BIC is calculated considering both the likelihood of the model and the number of parameters used, introducing a penalty term proportional to the number of parameters. In the optimization process, the value of k (number of clusters) that minimizes the BIC is sought, which helps to avoid overfitting and to select a model that generalizes well to new data. The choice of the optimal number of clusters based on the BIC helps to obtain an appropriate balance between the model’s fit ability and its simplicity, thus improving the interpretability and efficiency of the clustering algorithm [20].
2.6 Hierarchical Clustering.
Another algorithm incorporated was the hierarchical clustering approach as an integral part of the methodology for energy production prediction. Hierarchical clustering is an unsupervised learning method that organizes data into a tree structure or dendrogram, reflecting the similarity relationships between data points. This approach can be agglomerative, where each point starts as an independent cluster and progressively merges into larger clusters, or divisive, where all data start as a single cluster and are divided into smaller subclusters. Hierarchical clustering provides a structured and understandable representation of the similarity between groups of data, allowing the identification of patterns at different levels of granularity [21].
The choice of the optimal number of clusters is performed using the dendrogram method. Figure 9 shows this procedure. The dendrogram technique in the context of hierarchical clustering is an essential visual resource for understanding the hierarchical structure of clusters in a dataset. The dendrogram graphically represents the successive mergers or splits of clusters along the hierarchical process [21]. On the vertical axis, the branches of the dendrogram indicate the distance or dissimilarity between clusters, while the horizontal axis represents the individual data points or clusters. By observing the height at which the branches meet or cutoff, similarity between groups and cluster formation at different levels of granularity can be determined. Interpretation of the dendrogram provides insights into the hierarchical structure and organization of the data, which facilitates the choice of the optimal number of clusters.
2.7 DBSCAN: Density-Based Spatial Clustering of Noisy Applications.
Density-based spatial clustering of applications with noise (DBSCAN) is a clustering algorithm used in the field of unsupervised learning and data mining. Its main strength lies in its ability to identify clusters of arbitrary shapes in datasets, without relying on an assumption about the shape or number of clusters. The algorithm is based on the concept of density, where a cluster is defined as a region of high density of points, separated by sparser regions of points. DBSCAN classifies data points into three main categories: core points, border points, noise points [22]. The algorithm proceeds by exploring the feature space and connecting core points with their neighboring points, thus forming the clusters shown in Fig. 10. It is observed that DBSCAN fails to adequately find the clusters. This observation will be further addressed in the results section and its respective comparison.
2.8 Agglomerative Clustering.
The agglomerative algorithm is used to iteratively merge the data points into hierarchical clusters, forming a tree structure representing similarity relationships. The choice of the optimal number of clusters is performed by methods such as dendrogram observation that were previously executed in this work [23]. Figure 11 gives an overview of how the different groups that the algorithm manages to label were created and grouped. Visually, it is inferred that the grouping is adequate and seems to be acceptable for the choice of representatives.
However, this is analyzed in Sec. 3 by comparing it with the other centroids found and the proposed metrics.
2.9 Metrics: Mean Absolute Error and Coefficient of Determination ().
The methodology proposed for the evaluation of the five implemented algorithms is the following: With the clusters found by each algorithm, a curve is created, as shown in Fig. 13, for each variable in each classifier. This curve is compared and evaluated day by day with the real dataset of each variable, which is projected in Fig. 14. From these two—estimated and actual curves—the mean absolute error (MAE) and coefficient of determination () metrics are obtained.
MAE and are metrics commonly used to assess the quality of regression models and to quantify the accuracy of predictions in the context of predictive analytics, and in this case, machine-learning models are assessed.
In this equation, are the actual values, are the predictions of the model, and is the mean of the actual values. close to 1 indicates a good model fit, while lower values suggest that the model does not explain the variability of the data well [25,26].
Once the clusters have been found and the five algorithms have been evaluated, the microgrid design is carried out. This will serve to evaluate the algorithms in practice and observe first hand how they affect the use of each one and what implications they have on the total costs of microgrid sizing.
2.10 Optimal Microgrid Design.
Microgrids are defined by the integration of various distributed generation resources, energy storage systems, and loads in a small system capable of operating connected to a main grid and, in cases of emergency or scheduled events, capable of operating in an isolated manner, and controlling frequency and voltage.
The sizing and design of microgrids could be modeled as an optimization problem, which define technology, size, allocation, and operation:
Distributed renewable generation (photovoltaic, wind).
Thermal generation (diesel, gas, combined cycle).
Energy storage system (batteries, ion–lithium, flywheel)
The mathematical model that allows sizing and designing the microgrid is basic and only focuses on analyzing the clustering results from an economic point of view to solve those problems.
As a hypothesis, it was assumed that the microgrid will operate with solar generation at maximum during the highest solar incidence hours, which means that including energy storage systems, such as batteries, is disregarded. Clustering techniques results are entered in mathematical model as a known parameter, and hence, their accuracy directly affects the proposed microgrid sizing and design.
2.11 The Proposed Optimization Problem.
The proposed microgrid to supply power to the community consists of a genset, battery energy storage system (BESS), solar PV panels, and a power inverter. In this section, we show the mathematical formulation of the optimization problem to minimize the annual total cost of investment of a microgrid, including the O&M cost of a genset in Barcelona community. Given this, the result reveals the optimal sizing of the microgrid. The linear optimization method was used to solve the optimization problem, which is mainly based on the fact that the decision variables are continuous and the objective function is linear in the decision variables [29]. The ampl tool was the software used to find the solution of the optimization problem using the CPLEX solver, which is based on primal-dual simplex algorithms [30]. This tool is an algebraic modeling language for linear or nonlinear problems with continuous and discrete variables [31].
The first term of Eq. (5) describes the cost of hourly genset operation for days, where is a conversion unit, which is equivalent to 1 h, is the rate for the sale of electricity from the genset to the microgrid, and is the power delivered to the microgrid by the generator every hour. The second term of Eq. (5) refers to the penalty due to the probability of failure of the microgrid during any hour of the day, where is the cost of a power outage (power not supplied), is the power demand by each hour, and is the percentage of load shedding by each hour.
Finally, contains the microgrid sizing variables and is composed of four terms according to Eq. (6), as follows: The first term of refers to the cost per electricity generation use from the PV system, where is the cost of investing in PV panels and is the total power output of PV panels (renewable generation). The second term represents the cost of using power from a diesel generator, where is the investment cost in a new generator whose maximum power output is equal to or less than the existing genset in the community, and system based on batteries, where is the cost of charging or discharging power from the BESS to the microgrid and is the maximum power of charging or discharging of the BESS. The fourth term is composed of , which is the cost related to the storage capacity of the BESS, and , which is the maximum capacity of electricity that can store the BESS.
The proposed constraints for objective function are given as follows:
- —Active Power Balance: Equation (7) corresponds to the restriction related to active power balance, i.e., the power produced by the microgrid is equal to the power consumed by the community, including the process of charging and discharging of the BESS.where is the power injected to the microgrid from BESS for each hour in the respective scenario and is the charging power from the PV power system to the BESS in the respective scenario.(7)
- —Genset Capacity: Equation (8) shows that must be less than or equal to the maximum capacity of the genset () and higher than zero.(8)
- —Active Power Injection Capacity: Equation (9) shows the range of active power that can be delivered by the BESS to the microgrid, where the maximum value is given by .(9)
- —Active Power Extraction Capacity: Equation (10) represents the maximum storage capacity limit of the BESS, during each hour of the day.(10)
- —Hourly Energy Balance for in Each Operating Scenario: Equation (11) represents the energy balance, indicating the power stored in the BESS in each hour for , where it is equal to the power that is stored in the BESS in the previous hour, including the energy transferred in that time.where is the BESS efficiency and is the BESS self-discharge rate.(11)
- —Initial Energy Balance in in Each Operating Scenario: Equation (12) represents the energy balance, indicating the power stored in the BESS in , where it is equal to the power that is stored in the BESS, including the energy transferred in that time.where is the initial energy in the BESS.(12)
- —BESS Maximum Storage Capacity: Equation (13) limits the power in each hour where it ranges between 0 and .(13)
- —Load-Shedding Percentage: Equation (14) defines the load cutoff percentage to be used in the optimization model.(14)
- —Renewable Source Sizing: Equation (15) establishes the boundaries for the sizing of the renewable power source (PV power system).where is the maximum boundary in the renewable power generation.(15)
- —Genset Sizing: Equation (16) restricts the power generation of the genset.where is the maximum generation capacity of the community genset.(16)
- —Power Transfer Capacity: Equation (17) indicates that the minimum power exchange (charging and discharging processes) of the BESS must be higher or equal to zero.(17)
- —BESS Sizing: Equation (18) indicates the minimum quantity of electricity that can be stored in the BESS.(18)
Figure 12 categorizes and represents the variables involved in the optimization problem, the sizing variables of the mathematical model are the decision variables, it is the size of each subsystem that is unknown, and we seek to find an optimal value for it, size of the renewable generation, size of the genset, and size of the BESS.
2.12 Input Data to the Model.
Table 1 summarizes the values of the component capacities and other technical–economic values used in the optimization problem. The costs of the components were based on the Ecuadorian market price at the time of the present study.
Description | Value |
---|---|
Solar source installation limit | 60.00 kW |
Cost of installing solar fountains | 464.60 $/kW |
Cost per load shedding | 0.65 $/kW |
Cost of investing in thermal generation | 997.28 $/kW |
Power purchase tariff to the generator set | 0.16 $/kWh |
Diesel generator capacity limit | 185.00 kW |
Cost of investing battery transfer power | 197.92 $/kW |
Cost of investing maximum battery storage capacity | 197.92 $/kWh |
Initial battery energy | 0.00 kWh |
Battery efficiency | 0.95 |
Battery self-discharge rate | 0.02 |
Description | Value |
---|---|
Solar source installation limit | 60.00 kW |
Cost of installing solar fountains | 464.60 $/kW |
Cost per load shedding | 0.65 $/kW |
Cost of investing in thermal generation | 997.28 $/kW |
Power purchase tariff to the generator set | 0.16 $/kWh |
Diesel generator capacity limit | 185.00 kW |
Cost of investing battery transfer power | 197.92 $/kW |
Cost of investing maximum battery storage capacity | 197.92 $/kWh |
Initial battery energy | 0.00 kWh |
Battery efficiency | 0.95 |
Battery self-discharge rate | 0.02 |
3 Results
Table 2 shows the clusters found by each algorithm; k-means, GMM, and agglomerative clustering group the data into three clusters, while Hierarchical does so with four clusters and DBSCAN finds a single representative for the data group. k-Means allows to construct an estimated curve Fig. 13 using the representative’s values in their respective schedules: in the hours of highest solar irradiation, from 9 a.m. to 4 p.m.; for the afternoon, from 4 p.m. to 6 p.m.; in the morning, from 6 a.m. to 9 a.m. This result shows how k-means can be used to construct an envelope that is representative of the irradiance data. In the same way, three temperature values are found: , , , at their respective times. Regarding the fit, k-means and GMM have a similar score being 0.72 for the former and 0.76—slightly higher—for the latter. Similarly, it occurs for temperature with 0.64 and 0.61, respectively.
k-Means | GMM | Hierarchical clustering | DBSCAN | Agglomerative Clustering | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Clustering | Irradiation () | Temperature () | Irradiation () | Temperature () | Irradiation () | Temperature () | Irradiation () | Temperature () | Irradiation () | Temperature () |
C1 | 527.23 | 28.75 | 498.90 | 28.36 | 320.42 | 25.49 | 310.04 | 26.51 | 465.77 | 28.59 |
C2 | 241.04 | 27.23 | 182.08 | 26.98 | 229.35 | 24.58 | – | – | 160.39 | 23.49 |
C3 | 119.03 | 23.37 | 98.11 | 22.82 | 229.05 | 28.80 | – | – | 114.91 | 25.75 |
C4 | – | – | – | – | 198.35 | 26.77 | - | – | – | – |
R2 | 0.72 | 0.64 | 0.76 | 0.61 | 0.28 | 0.66 | 0.92 | 0.56 | 0.84 | 0.55 |
k-Means | GMM | Hierarchical clustering | DBSCAN | Agglomerative Clustering | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Clustering | Irradiation () | Temperature () | Irradiation () | Temperature () | Irradiation () | Temperature () | Irradiation () | Temperature () | Irradiation () | Temperature () |
C1 | 527.23 | 28.75 | 498.90 | 28.36 | 320.42 | 25.49 | 310.04 | 26.51 | 465.77 | 28.59 |
C2 | 241.04 | 27.23 | 182.08 | 26.98 | 229.35 | 24.58 | – | – | 160.39 | 23.49 |
C3 | 119.03 | 23.37 | 98.11 | 22.82 | 229.05 | 28.80 | – | – | 114.91 | 25.75 |
C4 | – | – | – | – | 198.35 | 26.77 | - | – | – | – |
R2 | 0.72 | 0.64 | 0.76 | 0.61 | 0.28 | 0.66 | 0.92 | 0.56 | 0.84 | 0.55 |
Hierarchical clustering finds that the optimal number of clusters is 4. Therefore, there are four representatives that compose the daily irradiance and temperature data: , , , ; and , , , , respectively. For analysis, only three clusters were considered —like the k-means or GMM case—since and are the same value in magnitude and, when constructing the curve, there is no major variation or effect by using only one of them. On the other hand, the fit is the lowest of all the algorithms evaluated, with a score of 0.28 for irradiance and 0.56 for temperature. In reference to the MAE, it is also the one with the lowest accuracy. Figure 14 shows that it has a low accuracy during all the hours of the day, which leads to infer that a higher number of representatives does not necessarily imply better data prediction.
The case of DBSCAN shows the complete opposite side of hierarchical, GMM, and k-means. It establishes a single representative value of irradiance for the whole dataset with and obtains an fit of 0.92, being the highest and best result of the whole set. In parallel, it obtains 0.56 for temperature. This is reflected in the MAE, where according to this metric, DBSCAN has low accuracy of prediction during the hours of highest irradiance, corresponding to around 12 a.m. Finally, agglomerative clustering finds representatives like those of k-means and GMM with similar characteristics: in the hours of highest solar irradiance from 9 a.m. to 4 p.m.; for the afternoon from 4 p.m. to 6 p.m.; in the morning from 6 a.m. to 9 p.m. A better fit is obtained for irradiance with 0.84 in R2 for irradiance—only below DBSCAN—and 0.55 for temperature, close to the two mentioned algorithms. Figure 14 shows that the MAE of this algorithm is also close to what would be k-means and GMM, except around 1 p.m. From the mathematical model point of view, after conducting a comparative research study of clustering techniques to design and size a microgrid and after reviewing the findings in Table 3, it was observed that k-means and GMM clustering techniques produced the most favorable results for reference investment, considering the microgrid’s characteristics but excluding the monetary cost due to load shedding, note that the MAE rel a measure of the imprecision in the prediction of the energy resource, this imprecision of the algorithm translates into energy service problems that lead to load cuts, although the investment in the microgrid is not significantly increased, the penalty or loss in dollars due to load cuts has an extreme change, with DBSCAN being the one that generates the greatest losses.
Clustering technique | Highest MAE | Microgrid investment (USD$) | Load shedding (USD$) |
---|---|---|---|
k-Means | 196 | 212,040.00 | 0 |
GMM | 214 | 211,127.00 | 0 |
Hierarchical | 283 | 213,101.00 | 2,259,290.00 |
DBSCAN | 266 | 222,175.00 | 6,870,940.00 |
Agglomerative | 282 | 220,661.00 | 584,757.00 |
Clustering technique | Highest MAE | Microgrid investment (USD$) | Load shedding (USD$) |
---|---|---|---|
k-Means | 196 | 212,040.00 | 0 |
GMM | 214 | 211,127.00 | 0 |
Hierarchical | 283 | 213,101.00 | 2,259,290.00 |
DBSCAN | 266 | 222,175.00 | 6,870,940.00 |
Agglomerative | 282 | 220,661.00 | 584,757.00 |
4 Discussion
In the realm of renewable energy, particularly within microgrid development, accurate data analysis is paramount for ensuring the efficiency and reliability of energy systems. Climate and radiation data serve as essential inputs for sizing microgrids, directly influencing the potential generation of renewable energy from sources like solar and wind. Leveraging clustering techniques in data manipulation not only enhances data processing accuracy but also facilitates informed decision-making in microgrid design and management.
Clustering techniques provide a systematic approach to organizing vast and complex datasets, unveiling underlying patterns and structures. In the context of climate and radiation data, these techniques enable the identification of distinct meteorological and environmental conditions affecting renewable energy generation. By categorizing data points into clusters based on similarities, researchers and practitioners gain valuable insights into seasonal variations, weather patterns, and irradiance levels crucial for optimal microgrid sizing.
The integration of clustering techniques in data manipulation significantly influences the efficacy of microgrid sizing methodologies. Organizing climate and radiation data into meaningful clusters enables the development of robust predictive models for renewable energy generation. These models consider environmental factor variability, enabling accurate energy output estimation under different scenarios. As a result, microgrid designers can confidently dimension systems, ensuring optimal resource utilization and improved resilience to environmental fluctuations.
Beyond quantity, the quality of energy generated within microgrids holds utmost importance. Clustering techniques aid in assessing energy quality parameters such as voltage stability, frequency regulation, and harmonic distortion. By integrating environmental conditions and energy output data with clustering analysis, researchers identify correlations between climatic factors and energy quality metrics. This holistic approach optimizes microgrid configurations to maintain stable and reliable power supply, meeting stringent requirements of end-users and grid operators.
5 Conclusion
This article aims to underscore the fundamental role of clustering techniques in bridging the gap between data analytics and microgrid design, paving the way toward sustainable and reliable energy solutions in an era of increasing environmental challenges. K-Means and GMM demonstrated the highest precision, leading to the most accurate microgrid designs, from the point of view of both microgrid investment and load shedding. It is the most suitable technique for emulating and solving microgrid optimization problems because this technique considers microgrid behavior letting it operate in both on grid and off grid. Furthermore, we highlighted the critical role of accurate clustering in microgrids design. For future work, it is imperative to explore advancements in the clustering technique to enhance accuracy in microgrid design. Additionally, investigating the viability of harnessing wind energy for hydraulic water pumping in remote communities represents a promising avenue, offering sustainable and accessible solutions for water supply. Simultaneously, addressing the social impact arising from water scarcity in community production processes is crucial, necessitating the exploration of strategies to mitigate these effects and ensure sustainable development in these regions.
Footnote
Empresa Eléctrica Pública Estratégica Corporación Nacional de Electricidad CNEL EP.
Acknowledgment
The authors would like to thank CERA, Sostenibilidad, SENESCYT, and PRESTIGE Research Group from ESPOL for supporting this research project.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.
Nomenclature
- AMPL =
a mathematical programming language
- BIC =
Bayesian information criteria
- DBSCAN =
density-based spatial clustering of applications with noise
- GMM =
Gaussian mixture model
- MAE =
mean absolute error
- MASL =
National Institute of Meteorology and Hydrology
- NASA =
National Aeronautics and Space Administration
- SDG =
sustainable development goals
- UNESCO =
United Nations Educational, Scientific and Cultural Organization