Abstract
Conceptual design evaluation is an indispensable component of innovation in the early stage of engineering design. Properly assessing the effectiveness of conceptual design requires a rigorous evaluation of the outputs. Traditional methods to evaluate conceptual designs are slow, expensive, and difficult to scale because they rely on human expert input. An alternative approach is to use computational methods to evaluate design concepts. However, most existing methods have limited utility because they are constrained to unimodal design representations (e.g., texts or sketches). To overcome these limitations, we propose an attention-enhanced multimodal learning (AEMML)-based machine learning (ML) model to predict five design metrics: drawing quality, uniqueness, elegance, usefulness, and creativity. The proposed model utilizes knowledge from large external datasets through transfer learning (TL), simultaneously processes text and sketch data from early-phase concepts, and effectively fuses the multimodal information through a mutual cross-attention mechanism. To study the efficacy of multimodal learning (MML) and attention-based information fusion, we compare (1) a baseline MML model and the unimodal models and (2) the attention-enhanced models with baseline models in terms of their explanatory power for the variability of the design metrics. The results show that MML improves the model explanatory power by 0.05–0.12 and the mutual cross-attention mechanism further increases the explanatory power of the approach by 0.05–0.09, leading to the highest explanatory power of 0.44 for drawing quality, 0.60 for uniqueness, 0.45 for elegance, 0.43 for usefulness, and 0.32 for creativity. Our findings highlight the benefit of using multimodal representations for design metric assessment.
1 Introduction
Conceptual design evaluation is a core component of the innovation process in engineering design [1,2]. Numerous design ideas are generated at the early design stages, which creates the need for effective and efficient conceptual design evaluation to facilitate informed decision making [3] and boost designers’ creative and innovative behaviors [4,5]. Creativity is an overarching metric for conceptual design evaluation. Novelty, usefulness, and content are usually considered comprehensively to evaluate the creativity of a single design [6,7]. When designers are the target of evaluation, quantity and diversity are also seen as a part of creativity [8]. In prior studies, creativity has been assessed through a consensual creativity definition relying on the independent judgment of groups of human experts [6,9,10] or by measuring the uncommonness of a design concept in terms of the key attributes, such as working or physical principles identified at various abstraction levels [7,8,11]. In this study, we focus on single design concept evaluation using five design metrics, including the overarching metric creativity and four other metrics, drawing quality, uniqueness, elegance, and usefulness. Many of the existing methods rely on intensive human inputs [6,8,12], which makes the methods inherently subjective, resource-demanding, unscalable, and subject to human fatigue [13].
Artificial intelligence (AI) has shown great potential to overcome these issues with manual evaluation. With its advances in computer vision (CV) and natural language processing (NLP), AI is becoming increasingly capable of comprehending information represented in different data modes. However, conceptual design evaluation using AI is challenging for two primary reasons. First, design ideas are often complex, involving heterogeneous representations, such as sketches, text, and three-dimensional (3D) models. Prior work has shown promise for leveraging AI to evaluate simple unimodal design ideas, such as sketches [14–16] and text [14,17]. The use of AI for evaluating multimodal design ideas remains largely unexplored. Second, compared to more straightforward tasks (e.g., classification and object recognition), design evaluation requires a profound and comprehensive understanding of object functions, behaviors, structures, and esthetics based on the information conveyed by multimodal representations. Training artificial neural networks (ANNs) for such a complex task requires a large volume of labeled data. Since manual evaluation requires design expertise and a large amount of time, most labeled design datasets are often small in volume, posing an additional challenge in training ANNs for this purpose. Accordingly, we aim to address the challenges of lacking multimodal evaluation models and large datasets faced by the approaches of exploiting AI for more objective, affordable, scalable, and reliable conceptual design evaluation.
To this end, we propose an attention-enhanced multimodal learning (AEMML) model to predict the five design metrics for conceptual design evaluation in this paper.2 The model exploits multimodal learning (MML) [18] to learn and integrate features from two modalities for conceptual design evaluation. It is developed and validated based on a set of milk frother designs represented by sketches and text descriptions and evaluated by design experts in terms of the five design metrics. Since the training set is relatively small, the proposed model also utilizes transfer learning (TL) [19] to transfer knowledge from large generic datasets to the target dataset. The contributions of this paper are:
We develop a baseline MML model by simply joining the unimodal text and sketch models and show that MML improves the model explanatory power by 0.05–0.12 through utilizing the complementary features between the two modalities.
We develop an AEMML model using a mutual cross-attention mechanism and show that the proposed attention mechanism further enhances the performance of the baseline MML model by 0.05–0.09 through capturing more interactive information between the two modalities.
We compare the predictability of the design metrics and show that with the proposed AEMML model, uniqueness presents the highest predictability, while creativity exhibits the lowest predictability.
The remainder of this paper is organized as follows. Section 2 provides a detailed review of the relevant building blocks of the proposed model. The labeled milk frother design ideas for training the model, the associated data pre-processing modules, and the key components of the AEMML model are introduced in Sec. 3. Section 4 reports and discusses the performance of the AEMML model and summarizes the challenges and opportunities in AI-based conceptual design evaluation. Section 5 concludes this paper by highlighting the findings and contributions of this paper.
2 Background
In this section, we provide a background of conceptual design evaluation in the design domain. Since design concepts are most commonly available as free-hand sketches and text descriptions, we review the relevant sketch and text learning techniques. As the main building blocks of the proposed model, the TL techniques for tackling the small dataset issue and the MML models for handling multimodal data are also surveyed.
2.1 Conceptual Design Evaluation.
In general, the effectiveness of conceptual design can be evaluated in terms of the concept generation process or outcome [8]. We focus on outcome-based evaluation in this paper. Creativity assessment is an overarching element of conceptual design evaluation. In design literature, consensual assessment technique (CAT) is a commonly used approach to creativity assessment, which relies on a consensual creativity definition and the independent judgment of groups of human experts [6,9,10]. The common CAT metrics considered by experts for creativity evaluation include appropriateness, usefulness, neatness, elaboration, elegance, novelty (or uniqueness), etc. Experts do not need to justify their assessment [20,21]. Although CAT has been implemented and validated to be accurate in various domains and contexts, it requires specific procedural practices and heavily relies on expertise-based human judgment, making it expensive and time-consuming [22]. Researchers have made some efforts to mitigate these limitations [14,23].
Another strand of research assesses creativity by comparing a design concept against a self-defined solution space containing the existing solutions in terms of the key attributes identified at various abstraction levels [7,8,11]. Following this approach, a variety of models have been developed by integrating different attributes at different levels, including the most commonly used Shah, Vargas-Hernandez, and Smith (SVS) model [8] and its variants [24–26], and the SAPPhIRE model developed by Sarkar and Chakrabarti [7,27] and its variants [28,29]. This approach relies on humans to identify and assess the relevant attributes using either qualitative or quantitative identifiers [30]. The drawbacks of this strategy reside in the difficulties in the accurate identification and assessment of a set of generalizable attributes from various design representations, and therefore, low subjectivity.
In recent years, researchers have started to explore computational methods for design concept evaluation. Ahmed et al. [15,16] proposed to measure the novelty of a design concept as its average distance to all other concepts which is calculated using a design embedding generated based on triplet comparisons. Another computational model mimics the expert process for assessing textual solutions to design questions [14]. The model first examines the correctness of a solution using word embeddings and then assesses the uniqueness of the answer through a clustering algorithm. Then, both aspects are integrated to predict design novelty assessed by a group of experts [14]. Additionally, a set of computational tools utilize large design knowledge databases, such as ConceptNet [31] and TechNet [32,33], to assess the distance between different design concepts for novelty evaluation [34,35]. As reviewed above, the existing computational methods only apply to design concepts represented in single modes and mainly focus on novelty evaluation.
In this study, the design concepts in our dataset were evaluated following the CAT approach in terms of five design metrics, including the overarching metric, creativity, and four other metrics, drawing quality, uniqueness, elegance, and usefulness. This set of metrics covers the expression neatness, content, and effectiveness of each design concept for conceptual design evaluation. We use them to demonstrate and validate the proposed method.
2.2 Sketch and Text Learning.
In this study, the sketches are drawn on paper solution sheets, which can be stored and processed as static pixel-based spaces. In recent years, convolutional neural networks (CNNs) [36] have been dominating computer vision using pixel data (e.g., images and sketches). Accordingly, we focus on CNN-based models for sketch learning. While these models are primarily used for images, our focus is on free-hand sketches, which are fundamentally different from realistic photo images. Free-hand sketches have both unique challenges (e.g., highly sparse, abstract, and designer-dependent) and advantages (e.g., lack of background and use of iconic representation) [37]. Previous studies have employed both customized CNNs [38] and standardized CNNs (e.g., ResNet [39], VGG [39], and Inception [40]) for sketch classification and similarity search. Researchers also studied the differences in hyperparameters between CNNs for encoding images versus sketches [41]. Adapted CNN models have also been explored by incorporating an additional channel to learn shape [42] or contour [43] information of sketches to improve model performance. Additionally, when created using touchscreen devices, sketches can also be rendered as dynamic stroke coordinate spaces or graph spaces. This affords the use of different ANNs, such as recurrent neural networks [44] and graph neural networks [45]. While most prior work on sketches focuses on classifying sketches into categories [39,40], we study a more difficult learning task: predicting design metrics, which can be viewed as a regression problem.
To analyze text data, modern ML primarily focuses on NLP methods that encode text data as continuous vectors. Before the advent of transformer-based language models (TLMs) [46], word2vec [47], global vector for word representation [48], and bidirectional long short-term memory [49] were common text embedding models. In recent years, TLMs have been proven exceedingly effective in many benchmark NLP applications (e.g., translation and search) [46]. Their strengths can be seen from two perspectives: (1) the transformer encoder reads and learns entire input sequences of words at once instead of word by word in a direction, which enables it to understand the context of single words more comprehensively than other models and (2) transformers employ a self-attention mechanism to strengthen the learning of contextual relations between words [46]. Universal sentence encoder (USE) [4] and bidirectional encoder representations from transformers (BERTs) [50] are two of the most popular TLMs. Pre-trained on multiple large text databases (e.g., Wikipedia) for multiple tasks, they are able to transfer the knowledge learned from generic domains to specified data domains. Compared to USE using unidirectional transformer decoders, BERT adopts bidirectional transformer decoders and deeper network architectures [50], leading to better embeddings in general. As most other models were designed for generating word-level embeddings, the transformer-based models also surpass them in terms of generating sentence-level embeddings [51]. In many application cases, the pooled output from the BERT model is used as the embedding of input sequences for various downstream tasks. The pooled output is derived from the last-layer hidden state of the first token of the input sequence (i.e., the [classification (CLS)] token) through a linear layer with the Tanh activation function.
2.3 Transfer Learning.
Labeled design datasets are often small in size because it is time-consuming and labor-intensive to evaluate designs through expert-based approaches or simulation. TL is a promising approach to tackle this issue. TL is an ML technique that aims at improving the performance of a model within a target domain by transferring the knowledge learned from a different but related source domain [19]. With the transferred knowledge, the model can perform better while less labeled data are required in the target domain. Knowledge transfer from one domain to another relies on the similarities and relevance between the domains [52]. However, knowledge transfer is not always beneficial. For example, the pronunciation transfer from Spanish to French is misleading. Negative transfer occurs when TL has negative impacts [53]. Positive TL is built on capturing the transferable and beneficial knowledge elements across domains [53]. In deep learning, the most common approach to knowledge transfer is sharing or transferring the parameters and weights of the model trained on the source domain(s) to the target domain. The bridge for knowledge transfer is the same pivot features shared by the source data and the target data [19].
TL has been proven effective in many ML domains, including engineering design. We have seen positive knowledge transfer between different design topologies [54], multiple optimization tasks [55], legacy data and new scenarios [56], and high-fidelity and low-fidelity representations [57]. In this study, we take advantage of TL to learn the features of a small set of milk frother designs for design evaluation.
2.4 Multimodal Learning.
Because a single design can be represented in multiple data modes, deep learning models that can capture design features simultaneously from multimodal representations are required. To leverage the complementarity and alignment of multimodal data, unimodal models are first constructed to encode different design representations separately. Then, they are fused to generate joint embeddings through one or multiple shared layers at earlier or later training stages.
One category of approaches learns joint embeddings from multimodal data in a supervised manner with given tasks. Among them, the operation-based approach integrates learned embeddings using simple operations, such as concatenation [58,59] and summation or weighted sum [59,60]. For summation or weighted sum, the pre-trained embeddings for all modalities need to have the same dimension and be rearranged in an order suitable for element-wise addition [61]. The attention-based approach employs cross-attention mechanisms to fuse unimodal embeddings [18,62]. By attributing more weight to more relevant features, this strand of methods can dynamically learn the alignment and capture the interactive information between multiple modalities [18,63]. Cross-attention mechanisms can be either directional (i.e., only one modality attends to another) [64] or symmetric (i.e., two modalities attend to each other mutually) [65]. Particularly, the multi-head self-attention mechanism of transformers [46] has been adapted from unimodal learning (i.e., text) to MML. For example, researchers proposed the bimodal transformer models to learn text and images simultaneously by seeing each detected region in an image as an image token in parallel with the word tokens; then, transformers’ attention mechanism is applied between the multimodal tokens [66]. Other supervised approaches include the bilinear pooling-based [67] and graph-based [68] methods, for which interested readers can read the references.
The joint embeddings of multimodal data can also be learned through unsupervised learning. The subspace-based methods fall into this category. They learn joint embeddings by projecting different modalities to a common subspace. By maximizing the similarities between the embeddings of different modalities, the approaches can capture the correlations and the mutual information across modalities [18,62]. The learned embeddings of different modalities can be joined using operation-based methods. Similarly, reconstruction models, such as auto-encoders, have also been adapted to embed multimodal data. These methods train multiple unimodal reconstruction models simultaneously and join them through shared layers to obtain a shared embedding by minimizing the reconstruction losses [69,70].
As an emerging topic, researchers have explored MML for a variety of tasks, such as text-to-image or image-to-text generation [71] and multimodal classification [65]. However, its application in engineering design is quite limited. Recently, Yuan et al. [72] reported an attention-based MML model learning from images and text for product evaluation. Their model analyzed detailed orthographic product images and textual product descriptions to predict product user ratings. Unlike the design metrics evaluated in this paper, these user ratings reflect the sentimental feedback towards a design rather than the technical assessment, such as creativity and usefulness. Additionally, this study focuses on design idea evaluation at a comparatively earlier stage when little design detail is available.
Due to the strengths of attention mechanisms in capturing interactive information, we select the attention-based approach and propose a mutual cross-attention mechanism to fuse multimodal information for conceptual design evaluation in this study.
3 Data and Method
This section introduces the data used and the AEMML model proposed for design metric prediction in this study. We take the design metric prediction problem as a regression task. The AEMML model can be broken into four modules, as depicted in Fig. 1. First, the raw design data are pre-processed and converted into sketches and text processable by computers. The second and third modules entail two unimodal learning models that learn features from the sketches and text, respectively. Both of them are pre-trained to transfer knowledge from respective large external datasets. In the final module, the two unimodal models are joined together through a mutual cross-attention mechanism to construct the AEMML model. Each module is introduced separately.
3.1 Data.
The AEMML model is developed based on a repository of 1112 milk frother design ideas3 from a prior study [73]. These design ideas were generated in response to a design challenge that asked participants4 to design an innovative device that froths milk in a short amount of time. Each design idea was created and recorded free-hand on a solution sheet. The solution sheet consists of a drawing area denoted by “Draw Idea Here” for sketching the idea and a description area starting with “Idea Description” for adding a text description. The collected ideas were then scanned into electronic files. Following the CAT evaluation approach [6], two design experts were asked to evaluate the design ideas in terms of five design metrics: (i) drawing quality, (ii) uniqueness, (iii) usefulness, (iv) creativity, and (v) elegance. Specifically, drawing quality of a sketch reflects clean lines, accurate proportions, appropriate shading, etc. Uniqueness refers to how original or surprising the idea is. Usefulness refers to how logical, practical, valuable, and understandable the idea is. Elegance indicates the simplicity, clear insight, and concise presentation of an idea [74]. Creativity is defined by ideas that are of high quality and novelty, which is the overarching metric evaluated based on the other four metrics. A six-point Likert scale was used to score each metric, with 1 meaning low quality and 6 meaning high quality regarding a specific metric. The inter-rater reliability test shows that the ratings from the two experts achieve a median Spearman correlation of 0.76 for the five metrics, with the highest value being 0.88 and the lowest value being 0.44. Twenty-six out of the 1112 labeled milk frother design ideas are excluded from this study due to low data quality after scanning or missing any of the five design metrics. As a result, 1086 design ideas constitute the dataset for training and validating the proposed model.
3.2 Data Pre-Processing.
The raw data involve two representation modes: hand-written text descriptions and free-hand sketches. We utilize a semi-automatic process to pre-process the raw data, as described in a previous study [75]. The text descriptions are first extracted from the solution sheets manually and recorded as strings readable by computers. The text descriptions’ maximal, minimal, and average word lengths are 54, 1, and 9, respectively. The pre-trained BERT tokenizer is used to pre-process the raw text and prepare the input for the BERT embedding model. Then, the sketches are pre-processed using a five-step module based on the OpenCV Python package to convert the original answer sheets into cleaned images only containing the design sketches. The five steps, respectively, focus on cropping the files, converting red–green–blue images to grayscale images, removing hand-written texts from the drawing area, reducing noise from the sketches, and resizing the sketches. The output sketches have a size of 309 × 309 × 3. Interested readers can refer to our previous study [75] for further details regarding pre-processing.
3.3 Models.
In this subsection, we first introduce the unimodal learning models that extract features from the text descriptions or sketches. In the previous study [75], we explored a frequency-based model [76] and two transformer-based models [4,50] to learn the text descriptions, and three CNN-based models to learn the sketches. The BERT model [50] and one Inception model [77] performed best for text and sketch learning, respectively. The unimodal text and sketch models introduced in this section are adapted from the best models from the previous study [75]. Then, we join the two unimodal models through a mutual cross-attention mechanism to build the AEMML model.
3.3.1 BERT Model for Text Learning With Transfer Learning.
In the previous study [75], we adopted a BERT model [50] with 12 transformer encoder layers and frozen parameters. To find the best BERT model for this task, we explore BERT models with different numbers (from 2 to 12) of transformer encoder layers and unfreeze all trainable parameters in this study. Among all BERT models, the small BERT model with four transformer encoder layers and 512-dimensional outputs is selected due to its best prediction performance. This can be explained by the relatively small size of the milk frother dataset, which cannot afford the effective fine-tuning of larger BERT models with more trainable parameters. The pooled output of the BERT output is used as the text embedding. Following the BERT module, two linear and dropout layer pairs and a linear output layer are sequentially added, as shown by the blocks linked by the solid arrows on the left in Fig. 2. The output layer employs the rectified linear unit (ReLU) activation function for the regression task.

The architecture of the AEMML model. As shown by the legend of the figure, each other model described in this section is part of the AEMML model.
3.3.2 Inception Net for Sketch Learning With Transfer Learning.
The sketch model is also adapted from the best sketch model in the previous study [75]. The core of the sketch model is the Inception net [77] for sketch embedding. To enable TL, we pre-trained the Inception net on 50 million sketches from the QuickDraw dataset for a classification task. Through model parameter and weight sharing, the Inception net can reuse the learned knowledge when it is fine-tuned to embed the milk frother sketches. Taking images with the size of 309 × 309 × 3 as input, the output of the Inception net is a 3D tensor with the shape of 8 × 8 × 2, 048 in our case. Then, we attach a flatten layer, two linear and dropout layer pairs, and a linear output layer with the ReLU activation function sequentially, as visualized by the blocks connected via the solid arrows on the right in Fig. 2.
3.3.3 Attention-Enhanced Multimodal Learning Model.
Then, the two unimodal models are integrated through a mutual cross-attention mechanism to construct the AEMML model. Herein, mutual cross-attention means the sketch embedding and the text embedding are used to attend to the features of the other modality mutually [65,78]; more weight is attributed to the features more relevant to the attending embedding [63]. The proposed mutual cross-attention mechanism is inspired by the existing mechanisms [65,78] and adapted according to the characteristics of our data. It integrates two directional cross-attention models, the attention-enhanced text (AET) model and the attention-enhanced sketch (AES) model, as depicted in Fig. 2.
Following the concatenation layer, a dropout layer and a linear layer are added to produce the final output. During training, the two directional cross-attention models are partly initialized with the pre-trained weights of the corresponding unimodal models. The trained sketch and text embeddings (Sm or Tm) from the corresponding unimodal models serve as the input to the AET and AES models, respectively. Before being integrated, the two models are separately trained to predict the design metrics.
Finally, the AEMML joins the AET and AES models by concatenating the attended embeddings from them. This joint model reuses the pre-trained weights of the AET and AES models to transfer the learned knowledge and avoid modality failure [79]. During training, the trainable parameters of the AET and AES models are fine-tuned jointly to leverage the interactive information between the two representation modalities. To study the effect of the attention mechanism for information fusion, we compare the AEMML model with a baseline MML model, which fuses the text and sketch embeddings through simple concatenation without the attention mechanism. That is, only the text and sketch embeddings from the unimodal models (Te and Se) are concatenated as the joint embedding. As a comparison, the solid blue and yellow arrows in Fig. 2 correspond to the baseline MML model, while all arrows, i.e., both solid and dashed and in both blue and yellow colors, describe the AEMML model.
4 Results and Discussion
In this section, we compare and discuss the performances of the unimodal models, the AET and AES models, and the baseline MML and AEMML models to assess the efficacy of the AEMML model. To train and test the models, we split the 1086 milk frother design ideas with the expert-assessed design metrics into training, validation, and test sets, following the ratio of 0.8:0.1:0.1. The distribution stratification of the design metrics is maintained during the data split, and the design idea split is generated for each design metric uniquely according to its specific stratified distribution. All models for predicting the same metric are trained and tested on the same split. The hyperparameters of the models are tuned through a series of pilot experiments. Specifically, we choose a batch size of 24 for all models, which is limited by the GPU memory. Different learning rates ranging from 5 × 10−6 to 5 × 10−4 are explored. We achieve the best model performance with the learning rates ranging from 4 × 10−5 to 6 × 10−5 for training the sketch models, those ranging from 2 × 10−5 to 4 × 10−5 for the sketch models, and a learning rate of 2 × 10−5 for the baseline MML, AET, AES, and AEMML models. We also experiment with different numbers of warmup epochs (from 0 to 50) at the beginning of the training process and end up with a set of numbers ranging from 0 to 10. The maximal number of training epochs is set to 300, while the training process can be ended beforehand if the validation loss does not decrease for 50 consecutive epochs. Within the range from 20 to 70, the early stopping threshold of 50 balances the computational cost and the resultant model performance best.
During training, besides knowledge transfer enabled by the pre-trained Inception V3 model and BERT model, we also explore two other types of knowledge transfer through a set of pilot studies. The first transfer is from unimodal data to multimodal data. We observe that the training of the baseline MML, AET, and AES models benefits from reusing the pre-trained weights of the unimodal models, and the training of the AEMML model further benefits from reusing the pre-trained weights of the AET and AES models. The second is knowledge transfer across tasks, such as predicting different design metrics. We learn that the multi-task model that predicts all design metrics simultaneously is inferior to the models that predict each design metric separately, indicating a negative knowledge transfer between the tasks. A possible reason is that the evaluation of different metrics relies on different pivot features, as indicated by the low average correlation coefficient (0.092) between the metrics.
The performance of each model is evaluated in terms of its explanatory power for the variability of the five design metrics, i.e., the coefficient of determination (R2 value) in statistics. The reason why we use the R2 value instead of the mean squared error (MSE) is that the R2 value is not affected by the scale and distribution of the predicted metrics, giving a better picture of the quality of the regression model. In this study, each experiment is repeated 15 times. In this section, we will report and discuss effect of MML and the attention mechanism. The results reported are the best results from the repeated experiments. The discussion on MSEs of the models and the results of all repeated experiments are available in Appendices A and B, respectively.
4.1 Effect of Multimodal Learning.
Compared to unimodal learning from only sketches or text, MML enables the model to capture interactive features between the two modalities for design metric prediction. The performances of the unimodal models and the baseline MML model are compared in Fig. 3. By joining the text and sketch embeddings, MML improves the explanatory power (i.e., R2 values) of the unimodal models by 0.05–0.12. The results indicate that the sketch representation and text representation complement each other when they are learned jointly, enabling the MML model to capture more informative features for design metric prediction.
Moreover, the comparison between the text and sketch models also informs us about the design metric prediction. First, the sketch representation is more informative than the text representation for predicting drawing quality, which is in line with the fact that the evaluation of drawing quality relies more on the visual features than on the semantic features. Second, the text representation surpasses the sketch representation in uniqueness prediction, which suggests the features explicitly expressed in the text descriptions play a bigger role in uniqueness evaluation. Third, the text and sketch representations are similarly expressive for predicting creativity, elegance, and usefulness. That is, the text and sketch representations convey similar amounts of information regarding these metrics. It is worth noting that the informativeness of sketches and text descriptions may vary across different design domains or settings. Although the visual and semantic features are respectively more expressive for evaluating different metrics, MML enables them to complement and augment each other in evaluating all the metrics. In addition, MML is more beneficial for evaluating those metrics relying more evenly on the text and sketch representations, such as creativity, elegance, and usefulness. This may imply that the complementarity and alignment between the two representations are more important for evaluating these metrics, which can be captured through MML to improve model explanatory power.
4.2 Effect of the Attention Mechanism.
Attention mechanisms allow ML models to focus on the more informative chunks of information by attributing more weight to the more relevant features [63]. In this study, the text features and sketch features are respectively attended to by the embedding of the other modality to facilitate multimodal information fusion. Figure 4 shows the performances of the attention-enhanced models. Compared with the unimodal models in Fig. 3, the AET and AES models achieve much higher explanatory power for the variability of the design metrics. This can be explained by their emphasis on the more relevant features and incorporation of information from both modalities. Moreover, the AES model outperforms the unimodal sketch model to a larger extent (by 0.17 on average) than the AET model surpasses the unimodal text model (by 0.12 on average). This observation may indicate that the visual features conveyed by the sketches are more ambiguous and difficult to learn, but the incorporation of the semantic information facilitates the interpretation of the visual features significantly. Moreover, since the AET and AES models each take the embeddings of the other modality as input, the better model between the two also outperforms the baseline MML model for all design metrics.
However, although the directional cross-attention models (i.e., AET and AES) fuse the information from both the text and sketch representations, they put more emphasis on one of the two representations. For example, the AET model conveys the visual features but attributes more weight to the semantic features, which is not beneficial for evaluating drawing quality that is more reliant on the visual features. To overcome this issue, the AEMML model balances the emphasis on the semantic and visual features through the mutual cross-attention mechanism, i.e., joining the two directional cross-attention models. This change improves the model explanatory power by 0.02–0.05 for all design metrics. Furthermore, we can see that the AEMML model outperforms the baseline MML models in Fig. 3 by 0.05–0.09 for all design metrics. This implies that the mutual cross-attention mechanism enables the MML model to capture more interactive features between the two modalities, achieving more effective information fusion.
Furthermore, the explanatory power of the models shown in Figs. 3 and 4 also informs the predictability of the design metrics. Specifically, through the text and sketch representations, uniqueness presents the highest predictability (), while drawing quality, elegance, and usefulness exhibit moderate predictabilities (R2 values around 0.44), and creativity shows the lowest predictability (). According to the definitions of these metrics introduced in Sec. 3, the assessment of the metrics with higher predictabilities is more straightforward, while assessing creativity mingles more information and requires a deeper and more comprehensive understanding of the design concepts. These findings indicate that the current ANNs are less capable of more abstract tasks relying on a deeper and more comprehensive understanding of the contexts, such as predicting creativity. The development of AI for design evaluation needs to address this challenge in the future.
To evaluate the effectiveness of the proposed model, we compare the scores predicted by the AEMML model to the ground truth values evaluated by the design experts. Figure 5 shows the comparison of uniqueness, which is the metric exhibiting the highest predictability. All the data points in the plot are from the test set, which the model has not seen during the training process. When a prediction model is perfect, the predicted values are equal to the ground truth, i.e., the dots would fall on the diagonal line. In our case, the dots distribute along the diagonal line but fall into a wide region, which is in line with the moderate R2 value of 0.6. On the whole, a design is more likely to get a higher predicted value if it has a higher ground truth value, showing the effectiveness of the proposed model. However, the model tends to overestimate the designs with low ground truth values but underestimate designs with high ground truth values.

The “uniqueness” scores predicted by the AEMML model versus the ground truth values evaluated by the design experts. Around the plot are four milk frother design examples from the test set. The original solution sheets of these examples are also shown below the plot.
Four example milk frother designs from the test set are shown around the plot, in hope of helping understand the biases of the model. The examples 2 and 3 are predicted to have the lowest and highest uniqueness values, respectively, which are in line with the expert-evaluated results. The examples 1 and 4 are, respectively, overestimated and underestimated by the model. The observations suggest that the model tends to predict designs with simple sketches comprising common shapes and descriptions consisting of common words to be less unique. In contrast, it predicts designs with complicated sketches and descriptions using uncommon words to be more unique. Herein, common or uncommon is relative to the given dataset. This is consistent with how a model “understand” uniqueness in a statistical way. However, this statistical uniqueness could be from a unique way of representing a similar concept, causing biases. A human rater can differentiate concept uniqueness from representation uniqueness relatively easily, which is more challenging for the proposed model and ML models in general.
Then, we compare the performances between the AEMML model and the models developed in two prior studies [23,75] aiming at the same task, as shown in Fig. 6. One prior study [23] aimed at utilizing the less resource-demanding SVS [8] features to predict the more resource-demanding CAT design metrics. This still required significant human input to label the SVS features. The authors represented each design concept using a vector that is the one-hot encoding of the corresponding SVS features and employed three regression models to predict the design metrics. The first columns in Fig. 6 show the best performance among the regression models. The MML model from the other prior study [75] joined the best text model and the best sketch model through simple concatenation and showed that the MML model outperforms the unimodal models. The second columns in Fig. 6 present the performance of the MML model from Ref. [75]. In this study, the unimodal models are adapted from the best text and sketch models from Ref. [75], achieving better performances. The last two columns of Fig. 6 show the performances of the baseline MML model and the proposed AEMML model, which join the improved unimodal models through simple concatenation and the mutual cross-attention mechanism, respectively. Among all models, the AEMML model exhibits the highest explanatory power for the variability of the design metrics. The improvements from the first column to the last column in Fig. 6 demonstrate the efficacy of TL, MML, and the cross-attention mechanism.
In this study, testing the model on five different design metrics partly evidences the generalizability of the proposed model. Meanwhile, these five metrics are only a set of design metrics that we use to demonstrate and validate the proposed method. The method is expected to be applicable to a wider variety of design metrics. The proposed model is not restricted to the milk frother dataset as no domain-specific rules are employed. Other datasets with large sizes and diverse and accurate labels would be ideal candidates for applying the proposed multimodal learning model. Given the scope of the study, we focus our attention on rigorously validating the model using multiple metrics, multiple models, and multiple runs with this dataset. Future work can expand this method to more diverse design metrics and application domains as more datasets become available.
Moreover, the proposed model trained with the milk frother data is also anticipated to be partly transferable to a new domain with less training data for conceptual design evaluation. According to the definition, the criteria for assessing drawing quality and elegance are more generic and share more commonness across different design domains. It is relatively straightforward to transfer the knowledge learned by the model from one domain to another for evaluating these two metrics. Similarly, since the uniqueness of a design concept can be largely expressed by its visual and semantic features, its evaluation is more objective than the evaluation of usefulness and creativity. The capability of learning visual and semantic features for the uniqueness evaluation of the proposed model can also be transferred to another domain. In contrast, the evaluation of usefulness and creativity needs more comprehensive domain-specific knowledge, expertise, and understanding, which is challenging even for design experts. It is more difficult to transfer knowledge from the milk frother domain to a new design domain for predicting such metrics. Accordingly, when applying the model to a new domain, we need less data to fine-tune the model for predicting the more straightforward metrics but more data to retrain the model for evaluating the more complicated metrics. We will validate the transferability of the model across different design domains when other datasets are available in the future.
The findings from this study evidence the effectiveness of multimodal learning in comprehending multimodal design representations. This is anticipated to encourage the application of multimodal learning in broader contexts. Design information is often communicated in multiple data modes, such as products publicized on e-commerce websites and design precedents disclosed in patents. This model provides a supervised approach of learning joint embeddings of the multimodal design representations. On this basis, researchers and industrial design practitioners can fine-tune the proposed model with their own data for conceptual design evaluation, user preference prediction, multimodal information retrieval, and so forth. Being able to capture the complementarity and alignment between multimodal information, the proposed cross-attention mechanism can also be applied to other deep learning models for multimodal classification and regression problems theoretically. We will investigate and demonstrate the generalizability of the cross-attention mechanism in future studies.
Additionally, the magnitude of the explanatory power achieved by the AEMML model indicates that there is still plenty of room for improvement. The biases of the proposed model are mainly caused by three reasons. First, the data are noisy and the design metrics evaluated by the expert designers are not perfect ground truth. More accurately evaluated metrics are desired to train more effective ML models. Second, the text annotations in the drawing area are removed and not used by the current model, which causes information loss. Lastly, the current model itself is not highly effective in capturing complex information and relations in the representations. Basically, the proposed model, similar to most ML models, is good at learning simple and straightforward features but not as capable as humans in comprehending complex design information, resulting in biases in prediction. As a result, the proposed model performs better in predicting metrics relying more on straightforward features of the representations (e.g., uniqueness, drawing quality, and elegance), but is less capable of predicting complex metrics like creativity. We expect significant improvement in ML models to overcome these technical challenges in the future. In cases where high prediction accuracy is required, the level of explanatory power achieved by the proposed model would not be satisfactory. However, this level of explanatory power can still provide informative reference to human raters to reduce the time and effort needed for manual evaluation.
4.3 Challenges in Design Creativity Evaluation Through Sketches and Texts.
This subsection summarizes the challenges and potential opportunities in design metric prediction using AI from the aspects of both data and model.
4.3.1 Design Evaluation Needs to Consider Multiple Representation Modes.
First, it is difficult to comprehensively encode complex design ideas represented in multiple modes into one expressive embedding. This study makes an attempt to embed design ideas represented by text descriptions and sketches using an AEMML model. Further challenges reside in (1) including the textual annotations within the drawing area into the multimodal embedding, (2) adding representations in new modalities, such as 3D models, into the multimodal embedding, (3) relating the features fed to attention mechanisms with intuitive information (e.g., a region in an image) for easier interpretation and inspection of attention mechanisms, and (4) developing more advanced information fusion methods to better capture the complementarity and alignment between different modalities. Future studies can focus on the state-of-the-art methods for image segmentation and text annotation extraction, more advanced unimodal learning of different representation modalities, and more effective approaches to multimodal information fusion.
4.3.2 Most Design Data Are Noisy and May Lack Information.
Second, for the purpose of training deep learning models, sketches and text descriptions for representing design ideas are noisy. Different designers have varying abilities and preferences in expressing conceptual design ideas using sketches and text descriptions. Similar design ideas can be communicated in distinct styles and with distinct abstract and elaboration levels. Moreover, designers often represent designs in multiple modes, but a certain mode may be missing for some designs. The varieties in representation resolution, abstraction level, and elaboration level could be more substantial as the target product becomes more complex. It is still challenging for the current ML models to distinguish conceptual differences from representation varieties, make up for the missing mode, and identify the different abstraction and elaboration levels in design representations. In the future, more advanced ML models should be developed to comprehend design information expressed in different ways and at different levels to improve the robustness of the AI-based design evaluation approaches. Additionally, it is relatively easier for human raters to handle such representation inconsistencies and missing information. However, they are subject to fatigue. Accordingly, future efforts should also be made to build effective human–AI hybrid teams where humans and AI can learn from and augment each other for this challenging task.
4.3.3 Most Design Datasets With High-Quality Labels Are Small in Size.
The third challenge is the lack of large and high-quality labeled datasets to train high-performing AI. Although there exist a few large design datasets, such as ShapeNet, PartNet, and Fusion 360 Gallery, they all lack labels useful to engineering design. As neural networks go deep with huge amounts of trainable parameters, it is almost impossible to train high-performing AI models without large, closely related training datasets. This study has demonstrated the efficacy of TL in handling small datasets by transferring knowledge from large external datasets to our target dataset. However, the available large datasets (e.g., QuickDraw) from generic ML domains are not ideal sources from which the pre-trained models can learn sufficient design knowledge. Therefore, large labeled design datasets are needed to support the training of high-performing deep learning models and effective knowledge transfer for more complex and abstract design tasks. To construct such large datasets with high-quality labels, we encourage researchers and companies to (1) clarify the corresponding design requirements and goals when curating and sharing large datasets; (2) collect and store aligned design data represented in multiple data modalities (e.g., sketch and CAD model pairs); (3) label the datasets with design-related labels, such as functions, material, size, weight, evaluations metric, engineering performance, and application contexts; and (4) provide the pre-trained embeddings of designs in a dataset if available.
5 Conclusion
In this study, we develop and validate an AEMML model for conceptual design evaluation. In this model, a pre-trained BERT model and a pre-trained Inception net are employed to transfer knowledge from large external datasets to our milk frother design dataset. Then, the two unimodal modules are joined together through a mutual cross-attention mechanism. The AEMML model attributes more weight to the semantic and visual features that are more relevant to the other modality, enabling itself to capture more information regarding the interactions between the two modalities when performing multimodal information fusion. We study the efficacy of the AEMML model via the comparisons with the unimodal models, the baseline MML model, and the AET and AES models. By leveraging the power of AI, this model sheds light on efficient and scalable design evaluation. We train and validate the model using a set of milk frother design ideas represented by sketches and text descriptions to predict five design metrics: (i) drawing quality, (ii) uniqueness, (iii) usefulness, (iv) creativity, and (v) elegance.
The results of this study lead to three key findings: (1) Although the unimodal text model and sketch model, respectively, perform better for predicting different design metrics, the baseline MML model outperforms both unimodal models by 0.05–0.12 for evaluating all the design metrics through the utilization of complementary features between the two modalities. (2) The proposed mutual cross-attention mechanism improves the explanatory power of the baseline MML model by 0.05–0.09 by capturing more information regarding interactions between the two modalities. (3) Among all the design metrics, uniqueness presents the highest predictability (0.60), while creativity exhibits the lowest predictability (0.32). These findings evidence the efficacy of our AEMML model in multimodal design concept evaluation. Since engineering designs are commonly represented in multiple modes, MML is becoming a necessity to utilize information comprehensively and effectively when ML is applied for rapid and scalable design evaluation and other tasks. The proposed AEMML model architecture can be generalized to broader application contexts reliant on multimodal data, such as text-to-sketch, image, or 3D geometry generation.
Footnotes
The data and script of this study have been uploaded to Github: https://github.com/likeshine/sketch-text-multimodal-transfer-learning
The participants are first-year engineering design students.
See Note 2.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The data and information that support the findings of this article are freely available.5
Appendix A: Mean Squared Errors of the Models
We also compare the performance of the unimodal and multimodal models in terms of their MSEs in predicting the five design metrics. As shown in Fig. 7, the multimodal model achieves lower MSEs than both unimodal models for all five design metrics. Meanwhile, the sketch model obtains lower MSEs in predicting drawing quality and elegance, while the text model exhibits a lower MSE in predicting uniqueness. Since a relatively lower MSE indicates the strength of the corresponding model against the other models, these findings are in line with the findings from the comparison in R2 values.
Similarly, Fig. 8 shows the MSEs of the attention-enhanced models. For the prediction of all five design metrics, the AEMML model achieves lower MSEs than the AET and AES models, suggesting the strength of the multimodal model against the unimodal models. When the attention mechanism is applied, the differences between the AET and AES models in predicting different design metrics are smaller than those between the original text and sketch models in Fig. 7.
Additionally, the multimodal models exhibit the lowest MSE in predicting drawing quality but the highest MSE in predicting usefulness in both figures (Figs. 7 and 8). However, it is hard to conclude that the multimodal model performs better in predicting drawing quality because the variance of the predicted metrics also affects the MSE of the model. In our dataset, the variances of the metrics are as listed in Table 1. The R2 values are calculated based on both MSEs and variances, giving a better picture of the quality of the regression model.
Appendix B: Results of the Repeated Experiments
In this study, each experiment is repeated 15 times. The results from the repeated experiments are reported here. Figure 9 shows the comparison in the explanatory power between the unimodal and multimodal models, while Fig. 10 shows that among the AET, AES, and AEMML models. The explanatory power of each model varies from one run to another. However, the main findings of this study hold when we compare Figs. 9 and 10 with Figs. 3 and 4. The results of these repeated experiments show the robustness of the proposed model and the rigor of the findings.

The comparison in explanatory power between the unimodal and multimodal models (repeated experiments)