Abstract
This study investigates the combined use of generative grammar rules and Monte Carlo tree search (MCTS) for optimizing truss structures. Our approach accommodates intermediate construction stages characteristic of progressive construction settings. We demonstrate the significant robustness and computational efficiency of our approach compared to alternative reinforcement learning frameworks from previous research activities, such as Q-learning or deep Q-learning. These advantages stem from the ability of MCTS to strategically navigate large state spaces, leveraging the upper confidence bounds for trees formula to effectively balance exploitation–exploration trade-offs. We also emphasize the importance of early decision nodes in the search tree, reflecting design choices crucial for highly performative solutions. Additionally, we show how MCTS dynamically adapts to complex and extensive state spaces without significantly affecting solution quality. While the focus of this article is on truss optimization, our findings suggest that MCTS is a powerful tool for addressing other increasingly complex engineering applications.
1 Introduction
Machine learning (ML) is impacting engineering applications, from structural health monitoring [1,2] and predictive maintenance [3] to optimal flow control [4] and automation in construction [5]. Thanks to algorithmic advances and increased computational capabilities, there is a promise in enabling new approaches in computational design synthesis (CDS)—a multidisciplinary research field aimed at automating the generation of design solutions for complex engineering problems [6–8]. By integrating constraints related to the fabrication process, for instance, through physics-based simulation, CDS could unlock the potential of additive manufacturing in various fields [9], e.g., 3D concrete printing [10].
The effectiveness of traditional approaches to truss optimization, such as the ground-structure method [11,12], has been established through decades of research. However, these methods suffer from high computational complexity and solution instability [13,14]. Alternative strategies for discrete truss optimization rely on heuristic techniques, including genetic algorithms [15–17], particle swarm [18,19], differential evolution [20], and simulated annealing [21]. Nevertheless, the applicability of these methods is similarly limited by their high computational burden and slow convergence as the size of the search space increases [14].
The search space of candidate solutions can be narrowed through generative design grammars [22], which facilitate the exploration of alternative designs within a coherent framework [23,24]. These grammars are structured sets of rules that constrain the space of design configurations by accounting for mechanical information, such as the stability of static equilibrium. By integrating these rules within optimization procedures, it is therefore possible to explore incremental construction processes where the final design is reached through intermediate feasible configurations. The use of grammar-based approaches for truss topology generation and optimization has been proposed in Ref. [25], while their integration within heuristic approaches has been explored in Refs. [26–30].
Recently, the optimal truss design problem has been formalized as a Markov decision process (MDP) [31]. The solution to an MDP involves a series of choices, or actions, aimed at maximizing the long-term accumulation of rewards, which in this context measures the design objective. Viewing truss optimal design through the MDP lens, an action consists of adding or removing truss members, with the ultimate goal of optimizing a design objective, e.g., minimize the structural compliance. The final design thus emerges from a series of actions, possibly guided by grammar rules. This procedure is particularly suitable for truss structures, as it naturally accommodates discrete structural optimization, where adding a single member can significantly alter the functional objective of the design problem. Additionally, it can be extended to design optimization in additive manufacturing settings and continuum mechanics. The same methodology is similarly applicable to parametric optimization problems [32], including cases with stochastic control variables [33].
Reinforcement learning (RL) is the branch of ML that addresses MDPs through repeated and iterative evaluations of how a single action affects a certain objective [34]. Relevant instances of RL-based optimization in engineering include two-dimensional kinematic mechanisms [35] and the ground structure of binary trusses [36]. The advantage of RL over heuristic methods lies in its flexibility in handling high-dimensional problems, as demonstrated in Refs. [37–39]. In Ref. [31], the MDP formalizing the optimal truss design has been solved using Q-learning [40], constraining the search space with the grammar rules proposed in Ref. [41]. In a separate work [42], the same authors have also addressed the challenges of large and continuous design spaces through deep Q-learning.
In this article, we demonstrate how addressing optimal truss design problems with the Monte Carlo tree search (MCTS) algorithm [43,44] can offer significant computational savings compared to both Q-learning and deep Q-learning. MCTS is the RL algorithm behind the successes achieved by “AlphaGo” [45] and its successors [46,47] in playing board games and video games. In science and engineering, MCTS has been used for various applications employing its single-player games version [48]. Notable instances include protein folding [49], materials design [50,51], fluid-structure topology optimization [52], and the optimization of the dynamic characteristics of reinforced concrete structures [53].
For truss design, MCTS has been used in “AlphaTruss” [54] to achieve state-of-the-art performance while adhering to constraints on stress, displacement, and buckling levels. The same framework has been extended to handle continuous state-action spaces through either kernel regression [55] or soft actor-critic [56]—an off-policy RL algorithm. Despite the potential of using continuous descriptions of the design problem, the combination of RL and grammar rules proposed in Refs. [31,42] remains highly competitive, as it enables constraining the design process with strong inductive biases reflecting engineering knowledge. Building on this insight, the novelty of our approach lies in the integration of MCTS with grammar rules to strategically navigate the solution space, allowing for significant computational gains compared to Refs. [31,42], where Q-learning and deep Q-learning have been respectively adopted.
The effectiveness of the proposed approach lies in the MCTS capability to propagate information from the terminal nodes of the tree, which are associated with the final design performance, back to the ancestor nodes linked with the initial design states. This feedback mechanism allows for informing subsequent simulations, exploiting previously synthesized designs to enhance the decision-making process at initial branches and progressively refine the search toward optimal designs. Moreover, the probabilistic nature of MCTS enables the discovery of highly performative design solutions by balancing the exploitation–exploration trade-off. This balance is achieved through a heuristic hyperparameter that tunes the upper confidence bounds for trees (UCT) formula, whose effect is investigated through a parametric analysis.
The remainder of the article is organized as follows. Section 2 states the optimization problem and provides an overview of the MDP setting, grammar rules, and the MCTS algorithm. In Sec. 3, the computational procedure is assessed on a series of case studies. We provide comparative results with respect to Refs. [31,42], demonstrating superior design capabilities, and we test our methodology on two novel progressive construction setups. Section 4 finally summarizes the obtained results and draws the conclusions.
2 Methodology
In this section, we describe the methodology characterizing our optimal truss design strategy. This includes the physics-based numerical model behind the design problem in Sec. 2.1, the MDP formalizing the design process in Sec. 2.2, the grammar rules for truss design synthesis in Sec. 2.3, the MCTS algorithm for the optimal truss design formulated as an MDP in Sec. 2.4, and the UCT formula behind the selection policy in Sec. 2.5, before detailing their algorithmic integration in Sec. 2.6.
2.1 Optimal Truss Design Problem.
The design problem involves defining the truss geometry that optimizes a design objective under statically applied loading conditions. In the following, we consider minimizing the maximum absolute displacement experienced by the structure, although this is not a restrictive choice. This design setting, similar to the compliance minimization problem typical of topology optimization [13], has been retained for the purpose of comparison with Refs. [31,42].
2.2 Markov Decision Process Framework for Sequential Decision Problems.
In a decision-making setting, an agent must choose from a set of possible actions, each potentially leading to uncertain effects on the state of the system. The decision-making process aims to maximize, at least on average, the numerical utilities assigned to each possible action outcome. This involves considering both the probabilities of various outcomes and our preferences among them.
In sequential decision problems, the agent’s utility is influenced by a sequence of decisions. MDPs provide a framework for describing these problems in fully observable, stochastic environments with Markov transition models and additive rewards [58]. Formally, an MDP is a four-tuple , comprising a space of states that the system can assume, a space of actions that can be taken, a Markov transition model , and a space of rewards . The characterization of these quantities for truss optimization purposes is detailed below, after discussing their roles in MDPs.
We consider a time discretization of a planning horizon using nondimensional time-steps , and we denote the system state at time as , which is the realization of the random variable , with being the probability distribution encoding the relative likelihood that . Moreover, we denote the control input at time as . The transition model encodes the probability of reaching any state at time , given the current state and an action , i.e., . The reward , with , quantifies the value associated with each possible set .
We define a control policy as the mapping from any system state to the space of actions. The goal is to find the optimal control policy that provides the optimal action for each possible state . The optimal policy is learned by identifying the action that maximizes the expected utility over . The problem of finding the optimal control policy is inherently stochastic. Consequently, the associated objective function is additive and relies on expectations [59]. This is typically expressed as the total expected discounted reward over .
The sequential decision problem can be viewed from the perspective of an agent–environment interaction, as depicted in Fig. 1. In this view, the agent perceives the environment and aims to maximize the long-term accumulation of rewards by choosing an action that influences the environment at time . The environment interacts with the agent by defining the evolution of the system state, and providing a reward for taking and moving to .
One way to characterize an MDP is to consider the expected utility associated with a policy when starting in any state and following thereafter. To this aim, the state-value function quantifies, for every state , the total expected reward an agent can accumulate starting in and following policy . In contrast, the action-value function, , reflects the expected accumulated reward starting from , taking action , and then following policy . In both cases, the probability of reaching any state is estimated using transition probabilities .
For our purposes of optimal truss design, we refer to a grid-world environment with a predefined number of nodes on which possible truss layouts can be defined. The reward function shaping could account for local design objectives, such as the displacement at a prescribed node, or global performance indicators, such as the maximum absolute displacement, stress level, or strain energy. As previously commented, we monitor the maximum absolute displacement experienced by the structure. The state space could potentially include any feasible truss layout resulting from progressive construction processes. Accordingly, the space of actions could account for any possible modification of a given layout. In this scenario, the sizes of and increase significantly, even considering a reasonably small design domain. For this reason, explicitly modeling the Markov transition model is not feasible.
The availability of a transition model for an MDP influences the selection of appropriate solution algorithms. Dynamic programming algorithms, for instance, require explicit transition probabilities. In situations where representing the transition model becomes challenging, a simulator is often employed to implicitly model the MDP dynamics. This is typical in episodic RL, where an environment simulator is queried with control actions to sample environment trajectories of the underlying transition model. Examples of such algorithms include Q-learning, as seen in Refs. [31,42], and MCTS, both of which approximate the action-value function and use this estimate as a proxy for the optimal control policy. As noted in Ref. [44], convergence to the global optimal value function can only be guaranteed asymptotically in these cases. In our truss design problem, optimal planning is achieved via simulated experience provided by the FE model in Eq. (4b), which can be queried to produce a sample transition given a state and an action.
2.3 Grammar Rules for Truss Design Synthesis.
To introduce the grammar rules that we employ to guide the process of optimal design synthesis, we refer to a starting seed configuration , defined by deploying a few bars to create a statically determinate truss structure. This initial configuration must be modified through a series of actions selected by an agent. Every time an allowed action is enacted on the current state , a new configuration is generated (see Fig. 2). The process continues until reaching a state , characterized by a terminal condition, such as achieving the maximum allowed volume of the truss members.
To identify the allowed actions, we use the same grammar rules as those used in Refs. [31,41,42]. Starting from an isostatic seed configuration, these rules constrain the space of design configurations by allowing only truss elements resulting in triangular forms to be added to the current configuration, thereby ensuring statically determinate configurations. Given any current configuration , an allowed action is characterized by a sequence of three operations:
Choosing a node among those not yet reached by the already placed truss elements. We term these nodes as inactive, to distinguish them from the previously selected active nodes.
Selecting a truss element already in place.
Applying a legal operator based on the position of the chosen node with respect to the selected element. The legal operators are either “” or “,” see also Fig. 2. A operator adds the new node and links it to the current configuration without removals, while a operator also removes the selected element before connecting the new node. In both cases, the connections to the new node are generated ensuring no intersection with existing elements.

Exemplary actions following operators and . The current configuration (top) is modified either through action (bottom left) following the operator or through action (bottom right) following the operator, resulting in a new configuration . In both cases, the selected truss element is , and the chosen inactive node is .

Exemplary actions following operators and . The current configuration (top) is modified either through action (bottom left) following the operator or through action (bottom right) following the operator, resulting in a new configuration . In both cases, the selected truss element is , and the chosen inactive node is .
2.4 Monte Carlo Tree Search.
The MCTS algorithm is a decision-time planning RL method [34]. It relies on two fundamental principles: (i) approximating action-values through random sampling of simulated environment trajectories and (ii) using these estimates to inform the exploration of the search space, progressively refining the search toward highly rewarding trajectories.
In the context of optimal truss design formulated as a sequential decision-making problem, MCTS incrementally grows a search tree where each node represents a specific design configuration and edges correspond to potential state transitions triggered by allowed actions (see Fig. 3). During training, the algorithm explores the search space of feasible truss designs to progressively learn a control policy, referred to as the tree policy. This progressive policy improvement is based on value estimates of state-action pairs derived from previous runs of the algorithm, termed episodes. Each episode consists of four main phases [34]:
Selection: Starting from the root node associated with the seed configuration, the algorithm traverses the tree by selecting child nodes according to the tree policy until reaching a leaf node. The tree policy typically uses the UCT formula [44] to select child nodes. This formula ensures that actions leading to promising nodes are more likely to be chosen while still allowing for the exploration of less-visited nodes.
Expansion: If the selected leaf node corresponds to a nonterminal state , the algorithm expands the tree by adding one or more child nodes representing unexplored actions from . This expansion phase introduces new potential design configurations into the search tree, broadening the scope of exploration.
Simulation (rollout): From one of the newly added nodes, the algorithm performs a path simulation or “rollout” to estimate the value gained by passing from that node. Since the tree policy does not yet cover the newly added nodes, MCTS employs a rollout policy during this simulation phase to pick actions until reaching a terminal state . The rollout policy is a random policy satisfying the truss design grammar rules, directing action along unexplored paths to backpropagate the associated reward signal back up the decision tree. While the tree policy expands the tree via selection and expansion, the rollout policy simulates environment interaction based on random exploration.
Backpropagation: Upon reaching a terminal state , the associated design is synthesized to evaluate the design objective. This reward is then backpropagated through the nodes traversed during selection and expansion. This process involves updating the visit counts of the nodes and the values for the corresponding state-action pairs, both of which influence the decision-making process through the UCT formula, as detailed in the following section.

Exemplary use of grammar rules for the optimal truss design, formalized as a Markov decision process and solved through Monte Carlo tree search. The search tree construction and the corresponding truss design synthesis are achieved by repeating the four steps of selection, expansion, simulation, and backpropagation.

Exemplary use of grammar rules for the optimal truss design, formalized as a Markov decision process and solved through Monte Carlo tree search. The search tree construction and the corresponding truss design synthesis are achieved by repeating the four steps of selection, expansion, simulation, and backpropagation.
The advantages of MCTS stem from its online, incremental, sample-based value estimation and policy improvement. MCTS is particularly adept at managing environments where rewards are not immediate, as it effectively explores broad search spaces despite the minimal feedback. This makes MCTS especially suitable for progressive construction settings, where the final design requirements often differ from those of intermediate structural states. Intermediate construction stages typically involve sustaining self-load only, while different combinations of dead and live loads are experienced during operations. This capability stems from the backpropagation step, which allows information related to to be transferred to the early nodes of the tree. In contrast, bootstrapping methods like Q-learning may require a longer training phase to equivalently backpropagate information, as we demonstrate in Sec. 3. Further advantages of MCTS include: (i) accumulating experience by sampling environment trajectories, without requiring domain-specific knowledge to be effective; (ii) incrementally growing a lookup table to store a partial action-value function for the state-action pairs yielding highly rewarding trajectories, without needing to approximate a global action-value function; (iii) updating the search tree in real-time whenever the outcome of a simulation becomes available, in contrast, e.g., with minimax’s iterative deepening; and (iv) focusing on promising paths thanks to the selective process, leading to an asymmetric tree that prioritizes more valuable decisions. This last aspect not only enhances the algorithm’s efficiency but can also offer insights into the domain itself by analyzing the tree’s structure for patterns of successful courses of action.
2.5 Upper Confidence Bounds for Trees.
2.6 Algorithmic Description.
The algorithmic description of the optimal truss design strategy using the proposed MCTS approach is detailed in Algorithm 1. It begins by initializing the root node with a seed configuration and then iteratively explores potential truss configurations through a sequence of selection, expansion, simulation, and backpropagation phases. In each episode, the algorithm selects a child node based on the UCT formula, generates and evaluates a new child node from a possible action, simulates random descendant nodes to explore the design space, and backpropagates the computed reward to update the policy.
Monte Carlo tree search for optimal truss design
input: number of episodes
parametrization of the physics-based model
grid design domain
seed configuration
grammar rules for truss design synthesis
exploration parameter
1: initialize root node for the seed configuration
2: fordo
3:
4: set root node for (seed configuration)
⊳selection
5: while and previously explored do
6: select via UCT formula
7:
⊳expansion
8: if and not previously explored then
9: for states from allowed actions do
10: solve static equilibrium for
11: compute design objective
12:
⊳simulation
13: whiledo
14: select a random child
15: if not previously explored then
16: solve static equilibrium for
17: compute design objective
18:
⊳backpropogation
19: compute reward from terminal state
20: whiledo
21: append to rewards list
22: visit count += 1
23:
24: return deterministic control policy
3 Results
In this section, we assess the proposed MCTS framework on different truss optimization problems. First, we adopt six case studies from Refs. [31,42], each featuring different domain and boundary conditions, to directly compare the achieved performance. Then, we consider two additional case studies to demonstrate the applicability of our procedure for progressive construction purposes. While in the former case studies the seed configuration fully covers the available design domain, in the latter we allow the seed configuration to grow—mimicking an additive construction process—until reaching a terminal node at the far end of the domain.
The experiments have been implemented in python using the Spyder development environment. All computations have been carried out on a PC featuring an AMD Ryzen™ 9 5950X CPU @ 3.4 GHz and 128 GB RAM.
3.1 Truss Optimization.
In the following, we present the results achieved for the six case studies adapted from Refs. [31,42], providing comparative insights for each scenario. All case studies deal with planar trusses, with truss elements featuring dimensionless Young’s modulus and cross-sectional area . The applied forces have a dimensionless value of , as per Refs. [31,42]. The monitored displacement refers to the maximum absolute displacement experienced by the structure.
Each row of Table 1 describes a case study in terms of design domain size, number of decision times or planning horizon , and volume threshold . These parameters have been set according to Refs. [31,42] to facilitate the comparison between the proposed MCTS procedure and the Q-learning methods. For each case study, the design domain, structural seed configuration, externally applied force(s), and boundary conditions are shown under the corresponding label in Fig. 4. The target optimal configuration , identified through a brute-force exhaustive search of the state space, is illustrated under the label. Case Study 4 is the only one that differs from the reference due to the additional constraint at .
![Truss optimization—case studies adapted from Ref. [42]: summary of design domain, seed configuration s0, and target optimal design sT identified through a brute-force exhaustive search](https://asmedc.silverchair-cdn.com/asmedc/content_public/journal/mechanicaldesign/147/10/10.1115_1.4068300/1/m_md_147_10_101702_f004.png?Expires=1747925522&Signature=e2ZPC-Zrmef9fRwk2nuPRk3Iisz6FAtYySP2OaY-9RTl3790goBBDTb5pmNch0nE9~vYikaKkDtKrkjtYGmtyk39l34BEg~X~wEttvx6uuS34PR9K-dVfDg32gDSzqd1gtDVwVuihKRaAcVAh~8KkvQcL4IUb9rYrt97CP84rPDuHCEDVE7xgsRmhtt~jPShrIilF9PULsAVmb~4IcCnJTb1STzoSBqZHp9rxGNTNq8UU6DmAb1Sdr-QhLSHahAwDvFIE6JRvGyqJEOs9OvDv2XwEmM22yvFTZ7g~BSewbTzU7OQMs4GKkzEDIyXvoGqH6QMoBVAJWXDBQCyyqITug__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Truss optimization—case studies adapted from Ref. [42]: summary of design domain, seed configuration , and target optimal design identified through a brute-force exhaustive search
![Truss optimization—case studies adapted from Ref. [42]: summary of design domain, seed configuration s0, and target optimal design sT identified through a brute-force exhaustive search](https://asmedc.silverchair-cdn.com/asmedc/content_public/journal/mechanicaldesign/147/10/10.1115_1.4068300/1/m_md_147_10_101702_f004.png?Expires=1747925522&Signature=e2ZPC-Zrmef9fRwk2nuPRk3Iisz6FAtYySP2OaY-9RTl3790goBBDTb5pmNch0nE9~vYikaKkDtKrkjtYGmtyk39l34BEg~X~wEttvx6uuS34PR9K-dVfDg32gDSzqd1gtDVwVuihKRaAcVAh~8KkvQcL4IUb9rYrt97CP84rPDuHCEDVE7xgsRmhtt~jPShrIilF9PULsAVmb~4IcCnJTb1STzoSBqZHp9rxGNTNq8UU6DmAb1Sdr-QhLSHahAwDvFIE6JRvGyqJEOs9OvDv2XwEmM22yvFTZ7g~BSewbTzU7OQMs4GKkzEDIyXvoGqH6QMoBVAJWXDBQCyyqITug__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Truss optimization—case studies adapted from Ref. [42]: summary of design domain, seed configuration , and target optimal design identified through a brute-force exhaustive search
Truss optimization—problem setting description
Domain size | Decision times, | threshold | |
---|---|---|---|
Case 1 | 4 × 3 | 2 | 160 |
Case 2 | 5 × 3 | 3 | 240 |
Case 3 | 5 × 5 | 3 | 225 |
Case 4 | 5 × 9 | 3 | 305 |
Case 5 | 5 × 5 | 4 | 480 |
Case 6 | 7 × 7 | 4 | 350 |
Domain size | Decision times, | threshold | |
---|---|---|---|
Case 1 | 4 × 3 | 2 | 160 |
Case 2 | 5 × 3 | 3 | 240 |
Case 3 | 5 × 5 | 3 | 225 |
Case 4 | 5 × 9 | 3 | 305 |
Case 5 | 5 × 5 | 4 | 480 |
Case 6 | 7 × 7 | 4 | 350 |
For each case study, Fig. 5 shows the evolution of the design objective, i.e., the maximum absolute displacement experienced by the structure, as the number of training episodes increases. Results are reported in terms of average displacement (solid line) and one-standard-deviation credibility interval (shaded area), over ten independent training runs. Each run utilizes MCTS for a predefined number of episodes. In practice, the number of episodes is set after an initial long training run in which we assess the number of episodes required to achieve convergence—which typically depends on the complexity of the case study. After each training run, the best configuration is saved to subsequently compute relevant statistics. The attained displacement values are compared with those associated with the global minima (dashed lines), representing the optimal design configurations in Fig. 4.

Truss optimization—case studies 1–6: evolution of the design objective during training, shown as the average value (solid line) with its one-standard-deviation credibility interval (shaded area) and target global minimum (dashed line). Results averaged over ten training runs.
The heuristic parameter in Eq. (5) controls the balance between exploitation and exploration. The values employed for the six case studies are overlaid on each learning curve in Fig. 5. Since an optimal value for this parameter is not known a priori, this is set using a rule of thumb derived through a parametric analysis, as explained in the following section for case study 4.
A quantitative assessment of the optimization performance for each case study is summarized in Table 2. Results are reported in terms of the optimal design objective , the percentage ratio of the optimal design objective to the displacement achieved by the learned policy, and the percentile score relative to the exhaustive search space. To clarify, a percentile score of corresponds to reaching the global optimum. A lower score, such as , indicates that the design objective achieved with the final design , synthesized from the learned optimal policy, is lower than the displacement associated with of all the possible configurations explored through an exhaustive search. An exemplary distribution of the design objective across the population of designs synthesized from the exhaustive search of the state space is shown in Fig. 6 for case study 4. Interestingly, the distributions obtained for the other case studies also exhibit a lognormal-like shape, although these are not shown here due to space constraints. While the objective ratio provides a dimensionless measure of how close the achieved design is to the global optimum in terms of performance, the percentile score quantifies the capability of MCTS to navigate the search space and find a design solution close to the optimal one. Both performance indicators are computed by averaging over ten training runs. Additionally, we report the number of FE evaluations required to achieve a near-optimal or optimal policy, also averaged over ten training runs, and indicate the percentage savings in the number of FE evaluations compared to those required by the deep Q-learning strategy from Ref. [42]. It is worth noting that FE evaluations are only performed for the terminal state , after it has been selected.

Truss optimization—case study 4: design objective distribution over the population of designs synthesized from an exhaustive search of the state space
Truss optimization—case studies 1–6: optimal design objective , percentage ratio of the optimal design objective to the displacement achieved by the learned policy, percentile score relative to the exhaustive search space, number of finite element evaluations required to achieve a near-optimal or optimal policy, and relative speed-up compared to Ref. [42]. The speed-up is not reported for case study 4, as it differs from the reference for the additional constraint at . Results averaged over ten training runs.
Objective ratio (%) | Percentile score (%) | FE runs | FE runs versus Ref. [42] (%) | ||
---|---|---|---|---|---|
Case 1 | 0.0895 | 100 | 100 | 106 | −74.70 |
Case 2 | 0.1895 | 100 | 100 | 517 | −76.27 |
Case 3 | 0.0361 | 100 | 100 | 966 | −56.51 |
Case 4 | 0.5916 | 91.91 | 99.90 | 1672 | N/A |
Case 5 | 0.0390 | 95.23 | 99.99 | 9739 | −70.74 |
Case 6 | 0.0420 | 90.44 | 99.98 | 7931 | −31.25 |
Objective ratio (%) | Percentile score (%) | FE runs | FE runs versus Ref. [42] (%) | ||
---|---|---|---|---|---|
Case 1 | 0.0895 | 100 | 100 | 106 | −74.70 |
Case 2 | 0.1895 | 100 | 100 | 517 | −76.27 |
Case 3 | 0.0361 | 100 | 100 | 966 | −56.51 |
Case 4 | 0.5916 | 91.91 | 99.90 | 1672 | N/A |
Case 5 | 0.0390 | 95.23 | 99.99 | 9739 | −70.74 |
Case 6 | 0.0420 | 90.44 | 99.98 | 7931 | −31.25 |
3.2 Case Study 4—Detailed Analysis.
In this section, we provide a detailed analysis of case study 4. We selected this case study because it is the only one in which we employ boundary conditions different from those reported in Ref. [42], which is useful for checking the MCTS capability to exploit constrained domain portions not included in the seed configuration. Figure 7 illustrates the sequence of structural configurations synthesized from the optimal policy obtained through an exhaustive search. For each decision time, we report the corresponding value of the design objective and the volume of the truss lattice below the synthesized configuration .

Truss optimization—case study 4: sequence of design configurations from the target optimal policy, identified through a brute-force exhaustive search, with details about the design objective value and the truss lattice volume
Figure 8 summarizes the impact of varying the parameter on the attained percentile score to provide insights into the selection of an appropriate value. Specifically, Fig. 8(a) shows the percentile score relative to the exhaustive search space for different values, averaged over ten training runs. Figure 8(b) illustrates how the percentile score evolves as the number of episodes increases, offering insights into the effect of on the convergence of MCTS. To compare the achieved performance for varying values with the associated computational burden, Fig. 9 presents the number of FE evaluations required to achieve a near-optimal design policy, revealing an almost linear increase in the number of FE evaluations as grows. Therefore, we consider to provide an appropriate balance between exploitation and exploration, yielding an average percentile score of across ten training runs, which is close to the scores for and , but with only 1672 FE evaluations. The achieved ratio of the optimal design objective to the displacement achieved by the learned policy is (see Table 2). Similar results from the parametric analysis of for the other case studies are provided in Appendix A.

Truss optimization—case study 4: impact of varying the parameter on the attained percentile score relative to the exhaustive search space. For each value of , results are reported in terms of the average percentile score with its one-standard-deviation credibility interval and the evolution of the percentile score during training, shown as the average value with its credibility interval. Results averaged over ten training runs.

Truss optimization—case study 4: impact of varying the parameter on the attained percentile score relative to the exhaustive search space. For each value of , results are reported in terms of the average percentile score with its one-standard-deviation credibility interval and the evolution of the percentile score during training, shown as the average value with its credibility interval. Results averaged over ten training runs.

Truss optimization—case study 4: number of finite element evaluations required to achieve a near-optimal design policy for varying values of parameter
3.3 Progressive Construction.
In this section, we showcase the potential of the proposed MCTS strategy in guiding the progressive construction of a truss cantilever beam and a bridge-like structure. Unlike in the previous case studies, where a simplified seed configuration was initially assigned to comply with the target boundary conditions and then refined, here we allow the seed configuration to progressively grow until reaching a prescribed terminal node not included in the initial configuration. Therefore, the agent must account for the intermediate construction stages per se, not just as necessary steps to reach the final configuration. Another difference compared to the previous case studies is that instead of considering a fixed loading configuration, the structure is subjected to self-weight (unit dimensionless density), modifying the loading configuration at each stage. However, as in the previous cases, since the design process aims to maximize the performance of the final configuration, the chosen design objective is again the maximum absolute displacement. Although we did not set a limit on the maximum number of states, the agent must strike a balance between achieving higher structural stiffness by adding additional members and the weight these extra elements bring.
For the cantilever case study, we assign a domain size of , while for the bridge-like case study, we consider a larger domain of , which features a central passive area where FEs cannot be connected. The sequence of optimal design configurations is shown in Fig. 10 for the cantilever beam and in Fig. 12 for the bridge-like structure. These optimal sequences have been synthesized from an exhaustive search, halted due to computational constraints after scanning and possible configurations, respectively. The maximum length of the individual elements has been constrained to comply with the typical fabrication, transportation, and on-site assembly limitations encountered in construction projects. This realistic constraint compels the algorithm to explore more detailed designs, avoiding trivial configurations that rely on only a few long elements to reach the target node.

Progressive construction—cantilever case study: sequence of design configurations from the target optimal policy
For the cantilever case study, the optimal configuration is synthesized of the time over ten training runs. Using , MCTS requires an average of 507 FE evaluations per training run, yielding an optimal displacement of . It is worth noting how the algorithm identifies the optimal configuration by focusing on the most promising solutions, which feature more elements near the clamped side rather than near the free end (see in Fig. 10). Refer to Fig. 11 for the evolution of the attained design objective as the number of episodes increases during training.

Progressive construction—cantilever case study: evolution of the design objective during training, shown as the average value (solid line) with its one-standard-deviation credibility interval (shaded area), and target global minimum (dashed line). Results averaged over ten training runs.
Similarly, for the bridge-like case study, the MCTS policy synthesizes the optimal configuration of the time over ten training runs. The evolution of states from the MCTS policy is identical to that of the target optimal policy, as shown in Fig. 12. Using , the algorithm requires an average of 901 FE evaluations per training run, yielding an optimal displacement of . Finally, Fig. 13 presents the corresponding evolution of the attained design objective during training.

Progressive construction—bridge-like case study: sequence of design configurations from the target optimal policy

Progressive construction—bridge-like case study: evolution of the design objective during training, shown as the average value (solid line) with its one-standard-deviation credibility interval (shaded area), and target global minimum (dashed line). Results averaged over ten training runs.
These case studies highlight the advantages of MCTS over Q-learning approaches for optimal design synthesis in large state spaces. In such cases, Q-learning struggles because it requires sufficient sampling of each state-action pair to build a Q-table that stores values for every possible pair, leading to exponential growth in memory and computational demands as the number of states increases. In contrast, MCTS dynamically builds a decision tree based on the most promising moves explored through simulation, focusing computational resources on more relevant parts of the search space. This selective exploration allows MCTS to handle large state spaces more efficiently than Q-learning, making it better suited to problems where direct enumeration of all state-action pairs is infeasible. Appendix B provides an overview of the computational burden associated with a pilot MCTS training of 1000 episodes for the bridge-like case study.
3.4 Discussion.
In the case studies used for comparison with Ref. [42], the proposed MCTS framework has been capable of synthesizing a near-optimal solution with significantly fewer FE evaluations. Case study 2 has shown the greatest reduction, requiring fewer evaluations. All examples have achieved a percentile score greater than when compared to the candidate solutions from the exhaustive search. In the first three case studies, the global optimal solution has been synthesized of the time. However, this was not the case for the last three.
The heuristic parameter balances the exploitation and exploration terms in the UCT formula. From Fig. 8(b), we observe that for both and , MCTS converges early on local minima. This occurs because the first term in the UCT formula dominates the second term, preventing sufficient exploration of child nodes in the tree. While increasing the value of mitigates this issue, the number of FE evaluations is observed to rise linearly with , as shown in Fig. 9. However, the greater computational burden required by and results only in a limited improvement in the percentile score compared to , as shown in Fig. 8. Thus, has been chosen to balance a high percentile score with a low number of FE evaluations.
The cantilever and bridge-like case studies discussed in Sec. 3.3, within the context of progressive construction, demonstrate the potential of the proposed MCTS framework to synthesize optimal designs in problems with very large state spaces. The (partial) exhaustive search has analyzed more than possible design configurations for the cantilever beam and for the bridge-like structure. Despite this, the optimal solution has been synthesized of the time, requiring only 507 FE evaluations per training run for the cantilever beam and 901 for the bridge-like structure. In contrast, case study 4, which features a much smaller state space, has required significantly more FE evaluations without achieving the global optimum. This discrepancy is partly due to the fact that, although there are more layers in the trees of the progressive construction case studies, each layer is much narrower, allowing the algorithm to more easily identify promising branches. For instance, in the cantilever case, the first layer of the decision tree has 15 times fewer nodes compared to case study 4, resulting in much lower complexity.
One strength of MCTS over Q-learning [31] and deep Q-learning [42] approaches to optimal design synthesis is its ability to backpropagate reward signals to ancestor nodes in the tree more effectively. The difficulty deep Q-learning faces in performing the backpropagation step has also been mentioned in Ref. [42]. Another limitation of Q-learning is that the reward signal is not continuously positive. Q-learning updates Q-values based on the difference between future and current reward estimates, adjusting only the values for the state-action pairs experienced in each step. These incremental updates cannot take place before backpropagating information from the final stage to the intermediate design configurations. These drawbacks are not present in the MCTS framework because the reward is computed at the end of every episode, as typical in Monte Carlo approaches, in contrast to temporal difference methods like Q-learning. Another key advantage of MCTS is that it builds a tree incrementally and selectively, exploring parts of the state space that are more promising based on previous episodes. This selective expansion is particularly advantageous in environments with extremely large or infinite state spaces, where attempting to maintain a value for every state-action pair (as in Q-learning) becomes infeasible. While we may not synthesize the absolute global optimum solution every time, we are able to achieve very high percentile scores. Most importantly, this framework can scale to large state spaces without significantly compromising the relative quality of the solutions.
4 Conclusion
This study has presented a comprehensive analysis of combining MCTS and generative grammar rules to optimize the design of planar truss lattices. The proposed framework has been tested across various case studies, demonstrating its capability to efficiently synthesize near-optimal (if not optimal) configurations even in large state spaces, with minimal computational burden. Specifically, we have compared MCTS with a recently proposed approach based on deep Q-learning, achieving significant reductions in the number of required finite element evaluations, ranging from to across different case studies. Moreover, two novel case studies have been used to highlight the adaptability of MCTS to dynamic and large state spaces typical of progressive construction scenarios. A critical analysis has been carried out to explain why the method has been able to get close to the global optimum without reaching it in some cases. Specifically, we have noted that this difficulty is not solely connected to the size of the state space but is due to the width of the tree and the adopted UCT formula. This formula encourages the exploration of tree sections with the best average reward, potentially neglecting sections with lower average rewards that may contain the global optimum.
Compared to Q-learning, the proposed MCTS-based strategy has demonstrated two key advantages: (i) an improved capability of backpropagating reward signals and (ii) the ability to selectively expand the decision tree toward more promising paths, thereby addressing large state spaces more efficiently and effectively.
The obtained results underscore the potential of MCTS not only in achieving high-percentile solutions but also in its scalability to large state spaces without compromising solution quality. As such, this framework is poised to be a robust tool in the field of structural optimization and beyond, where complex decision-making and extensive state explorations are required. In the future, modifications to the UCT formula will be explored to address the occasional challenges in reaching the global optimum. Moreover, we foresee the possibility of exploiting this approach in progressive construction, extending beyond the domain of planar truss lattices.
Acknowledgment
The authors of this article would like to thank Eng. Syed Yusuf and Professor Matteo Bruggi (Politecnico di Milano) for the invaluable insights and contributions during our discussions.
Funding Data
This work is partly supported by ERC advanced grant IMMENSE—101140720. (Funded by the European Union. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.)
Matteo Torzoni acknowledges the financial support from Politecnico di Milano through the interdisciplinary PhD grant “Physics-Informed Deep Learning for Structural Health Monitoring.”
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.
Appendix A: Parametric Analysis of the Value
For , the modified UCT formula in Eq. (5) coincides with Eq. (A1) when . The choice of is based on theoretical bounds that optimize the exploitation–exploration balance for the general case of the multi-armed bandit problem, where rewards are normalized between 0 and 1 [61]. While this serves as a good starting point, determining the appropriate value is challenging to ascertain a priori, as it is case-dependent, as demonstrated in Fig. 14. A small choice of can heavily prioritize exploitation, as shown in Fig. 14 for case studies 3–6 with –, which plateau well below the global optimum. This is also summarized in Table 3, where we provide the percentile scores and the number of required FE evaluations at varying for the six case studies. More FE runs means more unique terminal states and more branches explored in the tree.

Truss optimization—case studies 1–6: impact of varying the parameter on the attained percentile score relative to the exhaustive search space. For each value of , results are reported in terms of the evolution of the percentile score during training, shown as the average value with its one-standard-deviation credibility interval. Results averaged over ten training runs.

Truss optimization—case studies 1–6: impact of varying the parameter on the attained percentile score relative to the exhaustive search space. For each value of , results are reported in terms of the evolution of the percentile score during training, shown as the average value with its one-standard-deviation credibility interval. Results averaged over ten training runs.
Truss optimization—results for case studies 1–6: impact of varying the parameter on the attained percentile score relative to the exhaustive search space, and on the number of finite element evaluations required to achieve a near-optimal or optimal policy. Results averaged over ten training runs.
Case study 1 | Percentile | 100% | 100% | 100% | 100% | 100% |
FE runs | 63 | 128 | 106 | 279 | 279 | |
Case study 2 | Percentile | 100% | 100% | 100% | 100% | 100% |
FE runs | 173 | 545 | 517 | 978 | 988 | |
Case study 3 | Percentile | 99.75% | 99.90% | 100% | 100% | 100% |
FE runs | 268 | 623 | 961 | 966 | 966 | |
Case study 4 | Percentile | 99.19% | 99.59% | 99.90% | 99.91% | 99.94% |
FE runs | 217 | 710 | 1672 | 2800 | 3433 | |
Case study 5 | Percentile | 99.77% | 99.95% | 99.99% | 99.99% | 99.99% |
FE runs | 586 | 2699 | 7235 | 9717 | 9739 | |
Case study 6 | Percentile | 99.83% | 99.96% | 99.98% | 99.98% | 99.99% |
FE runs | 770 | 3270 | 7931 | 9204 | 9357 |
Case study 1 | Percentile | 100% | 100% | 100% | 100% | 100% |
FE runs | 63 | 128 | 106 | 279 | 279 | |
Case study 2 | Percentile | 100% | 100% | 100% | 100% | 100% |
FE runs | 173 | 545 | 517 | 978 | 988 | |
Case study 3 | Percentile | 99.75% | 99.90% | 100% | 100% | 100% |
FE runs | 268 | 623 | 961 | 966 | 966 | |
Case study 4 | Percentile | 99.19% | 99.59% | 99.90% | 99.91% | 99.94% |
FE runs | 217 | 710 | 1672 | 2800 | 3433 | |
Case study 5 | Percentile | 99.77% | 99.95% | 99.99% | 99.99% | 99.99% |
FE runs | 586 | 2699 | 7235 | 9717 | 9739 | |
Case study 6 | Percentile | 99.83% | 99.96% | 99.98% | 99.98% | 99.99% |
FE runs | 770 | 3270 | 7931 | 9204 | 9357 |
Appendix B: Computational Cost Analysis
In this appendix, we provide an overview of the computational burden associated with a pilot MCTS training of 1000 episodes for the bridge-like case study. The timing analysis is summarized in Table 4, reporting the computational time taken by each MCTS phase. The simulation phase dominates the computational time, accounting for 82.97% of the total execution time, followed by the expansion phase at 16.89%. In contrast, the selection and backpropagation phases require significantly less time, contributing 0.09% and 0.04%, respectively. The remaining operations, collectively termed “Other,” take up a minimal 0.01% of the total execution time.
Progressive construction—bridge-like case study: timing breakdown of the MCTS phases during a pilot training session of 1000 episodes
MCTS phase | Time (s) | Percentage |
---|---|---|
Selection | 0.91 | 0.09% |
Expansion | 163.08 | 16.89% |
Simulation | 801.00 | 82.97% |
Backpropagation | 0.38 | 0.039% |
Other | 0.01 | 0.01% |
Total elapsed | 965.3872 | 100% |
MCTS phase | Time (s) | Percentage |
---|---|---|
Selection | 0.91 | 0.09% |
Expansion | 163.08 | 16.89% |
Simulation | 801.00 | 82.97% |
Backpropagation | 0.38 | 0.039% |
Other | 0.01 | 0.01% |
Total elapsed | 965.3872 | 100% |
We have identified that the most computationally demanding task is not the FE analysis itself, but rather the frequent execution of relatively simple geometric checking functions that determine whether a new configuration violates geometric constraints. Specifically, a function that checks whether a line passes over an active node has been called 14,559,543 times during execution. This function, invoked primarily during the child node population process, has been responsible for a cumulative execution time of 438.77 s, representing 45.45% of the total runtime. These calls occurred in both the expansion and simulation phases, significantly contributing to the overall computational load despite the function’s simplicity. Overall, in the simulation phase, the algorithm must populate more layers of children nodes than in the expansion phase, therefore contributing to its higher computational cost.