1 Introduction

The transformative impact of data-driven methods, which have revolutionized fields like image and text analysis, relies on the availability of adequately large and diverse datasets. These datasets have fueled breakthroughs in deep learning, enabling the development of useful artificial intelligence (AI) tools such as ChatGPT, Gemini, Llama, and Stable Diffusion. Similarly, in engineering design, data-driven methodologies are reshaping traditional paradigms—enhancing design theory, decision-making processes, optimization strategies, and educational curricula. By facilitating faster design exploration and automation, these methods are opening new frontiers in the field. Despite these advancements, the adoption of machine-learning and data-driven approaches in engineering design faces significant hurdles, primarily due to dataset-related challenges. Key among these are the scarcity of publicly available datasets, insufficient sample sizes and feature diversity in existing datasets, and the limited integration of critical dimensions such as functional performance. Moreover, the demand for high-quality data presents a persistent hurdle, as engineering applications require datasets that are robust, comprehensive, and tailored to the complexities of design tasks.

This editorial highlights a special collection of papers that share design datasets and examine the intersection of data-driven methodologies and engineering design. The selected works not only investigate novel approaches but also provide detailed discussions on the datasets employed, many of which are publicly released to encourage broader use and collaboration. These contributions span several critical areas of engineering, with topics ranging from engineering catalogs [1] and vehicle systems [2] to advanced materials [3,4] and manufacturing processes [5]. The datasets cover various domains including power systems [6], human-centered design [5,79], synthetic data generation [10], mechanical artifacts [11,12], and infrastructure monitoring [13]. Each dataset aims to fill a unique gap in current research and application capabilities, providing a valuable resource for future studies and developments in these fields [14,15]. For a comprehensive overview of the datasets discussed, refer to Table 1, which summarizes the application domains, data modalities, and scale of data included in this special issue.

Table 1

Summary of datasets in the special issue

Paper title, objective of dataset usage, and web link
Application domainData modalityNumber of samples
A dataset generation framework for symmetry-induced mechanical metamaterials [3]. Train machine-learning approaches for inverse design tools for truss materials. https://github.com/DreamLabUIC/Symmetry-Induced-Mechanical-Metamaterials
Metamaterial design3D model, tabular15,000
A dataset of 3M single-DOF planar 4-, 6-, and 8-bar linkage mechanisms with open and closed coupler curves for machine learning-driven path synthesis [11]. Train machine-learning models for path synthesis of linkage mechanisms. https://www.kaggle.com/datasets/purwarlab/four-six-and-eight-bar-mechanisms-with-curves
Mechanism designCAD3,000,000
A method for synthesizing ontology-based textual design datasets: evaluating the potential of LLM in domain-specific dataset generation [10]. Generate synthetic textual datasets within the conceptual design domain using large language models (LLMs). https://figshare.com/articles/dataset/Synthetic_Design-Related_Data_Generated_by_LLMs/26122543
Engineering designText7200 sequences
A visual representation of engineering catalogs using variational autoencoders [1]. Map material data onto a visualizable and generative latent space. https://github.com/UW-ERSL/CatalogsVAE
Material selectionTabular200+
Bi-level interturn short-circuit fault monitoring for wind turbine generators with benchmark dataset development [13]. Serve as a benchmark for condition monitoring fault diagnosis algorithms in wind turbine generators. https://doi.org/10.5281/zenodo.13625955
Energy systemsTime series76 time series
Customer segmentation and need analysis based on sentiment network of online reviewers and graph embedding [8]. Identify product attributes and customer needs from online reviews. https://github.com/wobuhuiya/Online-Review-Analysis
Product designText2986 reviews
Dataset on complex power systems: design for resilient transmission networks using a generative model [6]. Generate resilient power system network design. https://doi.org/10.5281/zenodo.14272556
Power systemsGraphs9 network datasets
DeepJEB: 3D deep learning-based synthetic jet engine bracket dataset [12]. Develop surrogate models for engineering design based on structural designs. https://www.narnia.ai/dataset
Jet engine bracket design3D model, tabular2138 designs
DrivAerNet: a parametric car dataset for data-driven aerodynamic design and prediction [16]. Support data-driven aerodynamic design and prediction. https://dataverse.harvard.edu/dataverse/DrivAerNet
Aerodynamic designTabular, image, 3D model4000 cars
Heterogeneous multi-source data fusion through input mapping and latent variable Gaussian process [14]. To improve predictive accuracy in heterogeneous data sources. https://doi.org/10.5281/zenodo.14681800
Engineering design and material optimizationTabular3 case studies
HM-SYNC: a multimodal dataset of human interactions with advanced manufacturing machinery [5]. Enhance human-centered manufacturing design by analyzing human activity in manufacturing environments. https://huggingface.co/datasets/saluslab/HM-SYNC
Human–machine interaction in manufacturingImages, text, time series, tabular1228 interactions
HUVER: The HyForm uncrewed vehicle engineering repository [2]. Develop predictive models and support human–AI collaboration in UAV design. https://huggingface.co/datasets/raiselab/HUVER
Unmanned aerial vehicle designText, image, 3D model, tabular6051 designs
Importance-induced customer segmentation using explainable machine learning [17]. Evaluate a customer segmentation methodology and analyze attributes of segmented customers in each network. https://doi.org/10.7910/DVN/3Z9BGT
Product designText, tabular3 review collections (2761, 16,258, 46,328)
Large language models for computer-aided design (LLM4CAD) fine-tuned: dataset and experiments [18]. Advancing research on text-to-CAD generation. https://doi.org/10.18738/T8/KV7HON
Computer-aided designImage, 3D model, text, tabular3325 parts
Large language models for predicting empathic accuracy between a designer and a user [7]. Evaluate the ability of large language models to predict empathic accuracy between designers and users. https://github.com/Oluwatoba-Fabunmi/Empathic-Accuracy
Collaborative designText, tabular443 user thoughts
Looking beyond self-reported cognitive load: comparing pupil diameter against self-reported cognitive load in design tasks [19]. Investigate the relationship between cognitive load and pupil diameter in design tasks. https://osf.io/xbczt
Engineering designText, tabular data, time series (eye tracking)10 human subjects
MSEval: a dataset for material selection in conceptual design to evaluate algorithmic models [4]. Evaluate the abilities of LLMs for the task of material selection in conceptual design. https://huggingface.co/datasets/cmudrc/Material_Selection_Eval
Material selectionText, tabular9648 answers
Presenting hackathon data for design research: a transcript dataset [9]. Investigate design, teamwork, and hackathon phenomena. https://doi.org/10.5683/SP3/7CCTYU
Collaborative designText908 speech segments
Topology-agnostic graph U-nets for scalar field prediction on unstructured meshes [15]. Identify regions on a part where a build may fail. https://github.com/kevinferg/graph-field-prediction
Additive manufacturing3D model, tabular24,880 parts
Paper title, objective of dataset usage, and web link
Application domainData modalityNumber of samples
A dataset generation framework for symmetry-induced mechanical metamaterials [3]. Train machine-learning approaches for inverse design tools for truss materials. https://github.com/DreamLabUIC/Symmetry-Induced-Mechanical-Metamaterials
Metamaterial design3D model, tabular15,000
A dataset of 3M single-DOF planar 4-, 6-, and 8-bar linkage mechanisms with open and closed coupler curves for machine learning-driven path synthesis [11]. Train machine-learning models for path synthesis of linkage mechanisms. https://www.kaggle.com/datasets/purwarlab/four-six-and-eight-bar-mechanisms-with-curves
Mechanism designCAD3,000,000
A method for synthesizing ontology-based textual design datasets: evaluating the potential of LLM in domain-specific dataset generation [10]. Generate synthetic textual datasets within the conceptual design domain using large language models (LLMs). https://figshare.com/articles/dataset/Synthetic_Design-Related_Data_Generated_by_LLMs/26122543
Engineering designText7200 sequences
A visual representation of engineering catalogs using variational autoencoders [1]. Map material data onto a visualizable and generative latent space. https://github.com/UW-ERSL/CatalogsVAE
Material selectionTabular200+
Bi-level interturn short-circuit fault monitoring for wind turbine generators with benchmark dataset development [13]. Serve as a benchmark for condition monitoring fault diagnosis algorithms in wind turbine generators. https://doi.org/10.5281/zenodo.13625955
Energy systemsTime series76 time series
Customer segmentation and need analysis based on sentiment network of online reviewers and graph embedding [8]. Identify product attributes and customer needs from online reviews. https://github.com/wobuhuiya/Online-Review-Analysis
Product designText2986 reviews
Dataset on complex power systems: design for resilient transmission networks using a generative model [6]. Generate resilient power system network design. https://doi.org/10.5281/zenodo.14272556
Power systemsGraphs9 network datasets
DeepJEB: 3D deep learning-based synthetic jet engine bracket dataset [12]. Develop surrogate models for engineering design based on structural designs. https://www.narnia.ai/dataset
Jet engine bracket design3D model, tabular2138 designs
DrivAerNet: a parametric car dataset for data-driven aerodynamic design and prediction [16]. Support data-driven aerodynamic design and prediction. https://dataverse.harvard.edu/dataverse/DrivAerNet
Aerodynamic designTabular, image, 3D model4000 cars
Heterogeneous multi-source data fusion through input mapping and latent variable Gaussian process [14]. To improve predictive accuracy in heterogeneous data sources. https://doi.org/10.5281/zenodo.14681800
Engineering design and material optimizationTabular3 case studies
HM-SYNC: a multimodal dataset of human interactions with advanced manufacturing machinery [5]. Enhance human-centered manufacturing design by analyzing human activity in manufacturing environments. https://huggingface.co/datasets/saluslab/HM-SYNC
Human–machine interaction in manufacturingImages, text, time series, tabular1228 interactions
HUVER: The HyForm uncrewed vehicle engineering repository [2]. Develop predictive models and support human–AI collaboration in UAV design. https://huggingface.co/datasets/raiselab/HUVER
Unmanned aerial vehicle designText, image, 3D model, tabular6051 designs
Importance-induced customer segmentation using explainable machine learning [17]. Evaluate a customer segmentation methodology and analyze attributes of segmented customers in each network. https://doi.org/10.7910/DVN/3Z9BGT
Product designText, tabular3 review collections (2761, 16,258, 46,328)
Large language models for computer-aided design (LLM4CAD) fine-tuned: dataset and experiments [18]. Advancing research on text-to-CAD generation. https://doi.org/10.18738/T8/KV7HON
Computer-aided designImage, 3D model, text, tabular3325 parts
Large language models for predicting empathic accuracy between a designer and a user [7]. Evaluate the ability of large language models to predict empathic accuracy between designers and users. https://github.com/Oluwatoba-Fabunmi/Empathic-Accuracy
Collaborative designText, tabular443 user thoughts
Looking beyond self-reported cognitive load: comparing pupil diameter against self-reported cognitive load in design tasks [19]. Investigate the relationship between cognitive load and pupil diameter in design tasks. https://osf.io/xbczt
Engineering designText, tabular data, time series (eye tracking)10 human subjects
MSEval: a dataset for material selection in conceptual design to evaluate algorithmic models [4]. Evaluate the abilities of LLMs for the task of material selection in conceptual design. https://huggingface.co/datasets/cmudrc/Material_Selection_Eval
Material selectionText, tabular9648 answers
Presenting hackathon data for design research: a transcript dataset [9]. Investigate design, teamwork, and hackathon phenomena. https://doi.org/10.5683/SP3/7CCTYU
Collaborative designText908 speech segments
Topology-agnostic graph U-nets for scalar field prediction on unstructured meshes [15]. Identify regions on a part where a build may fail. https://github.com/kevinferg/graph-field-prediction
Additive manufacturing3D model, tabular24,880 parts

Ultimately, this special issue seeks to spark dialogue around best practices for managing, curating, publishing, and using datasets within the engineering design community. It also aims to inspire further research and development by making high-quality datasets accessible and promoting transparency in data use. The discussions and findings presented below build upon our collective experiences from this special issue and broader efforts in the field, leading us to explore in-depth the challenges and offer recommendations for future design datasets.

2 Challenges in Datasets for Design

The deployment of machine-learning and data-driven methodologies in engineering design is impeded by specific, critical barriers associated with the quality, availability, and applicability of datasets. This section discusses these barriers, providing an exploration of each and suggesting possible strategies to address them.

Data Scarcity.

One of the most prominent challenges is the scarcity of publicly available, high-quality datasets tailored for engineering design research. While datasets in fields like computer vision and natural language processing have flourished, e.g., ImageNet [20], MNIST [21], and KITTI [22] to name a few, engineering design often deals with highly specialized data that is not readily accessible. Data scarcity is also the consequence of costly computational simulations [16], real-world scenarios [19], or required human involvement [7]. Additionally, there is often a lack of incentives for practitioners to share proprietary data, which could significantly enhance the richness and applicability of public datasets. Quantifying design quality may involve digital twin modeling and simulation [15], which can require substantial computational resources. If real-world experiments are conducted, the setup and processing times can also be significant. Furthermore, human annotations or manual expert labeling often limit the total number of experiments. This scarcity impedes the development and validation of data-driven models for design tasks.

Representation.

Engineering design encompasses a wide range of domains and subdisciplines, each with its unique data characteristics and representation formats [5,18]. Datasets need to account for this diversity, capturing not only geometric information but also material properties, functional requirements, manufacturing constraints, and user preferences. As shown in Table 1, design data in itself encompasses everything from sketches to physical artifacts, including also text, images, 3D representations (point cloud, mesh, voxel, parametric), manufacturing codes, and temporal data. The parametrization of similar problems can also vary greatly, representing a challenge for unification and consistency.

Furthermore, variations in target settings can result in differing performance metrics, making it difficult to conduct comparative evaluations across various design challenges. Often, efforts to standardize data representation for consistency across applications result in significant information loss too. This diversity and need for standardization present significant challenges in design data management.

Functional Performance and Constraints.

Including functional performance labels alongside a dataset of high-quality samples significantly enhances its value. For instance, historical geometric data for design artifacts offer insights into the design spaces and variations that have been explored. By adding functional performance metrics—such as design sensitivities that lead to more efficient parameterizations or the development of meta-models for optimization, or human evaluation results—the utility of these datasets is greatly enriched, as seen in the DeepJEB [12] and power systems datasets [6]. However, capturing this functional performance can be computationally intensive and expensive, often constrained by the capabilities and accuracy of the computational models used, such as resolution and simulation fidelity, or may require extensive human input and experimental setups. Detailed documentation of the simulation settings and constraints under which the performance data were gathered—whether through simulations, experiments, or evaluations—is essential to enhance dataset comparability and utility. This documentation not only helps in assessing the reliability of the data but also facilitates the enhancement and completion of datasets by addressing performance gaps identified after initial data collection. This challenge of capturing diverse and comprehensive performance data applies across various fields, impacting the scalability and applicability of the resulting datasets.

Data Quality, Validation, and Real-World Gap.

Ensuring the quality and reliability of datasets is paramount for the trustworthiness of research findings. Design datasets may suffer from inconsistencies, errors, or biases that can undermine the validity of data-driven models. Robust data validation and cleaning processes are essential to guarantee the accuracy and consistency of datasets. Furthermore, computational simulations, such as digital twins used in engineering design and product development, play a critical role in decision-making through extensive what-if analyses and optimizations. As the fidelity and computational power of these models increase, it is crucial to maintain rigorous verification of their parameters to ensure they accurately represent real-world conditions. Common errors in simulation data that need attention include unrealistic boundary conditions, numerical inaccuracies, oversimplified assumptions in physical modeling, and errors in data integration [14]. These errors can lead to significant discrepancies between simulated outcomes and actual performance, potentially leading to flawed conclusions in downstream applications of machine learning. This underscores the need for a comprehensive approach that includes not only data collection and curation but also detailed validation against empirical data to ensure simulations provide reliable and actionable insights.

Bias.

The avoidance of negative effects of biased data on statistics and machine learning is a typical challenge in the generation of datasets. For example, collecting design and performance data from computationally costly optimization runs usually reflects only that small corridor of the design and search space leading to an optimal design. Hence, downstream machine-learning models may provide a high accuracy in this space but lack generalization. Also, regional influences like datasets collected in the global north can lead to inherent inequities in trained models. When collecting data it is important to apply a reasonable design sampling strategy and design of experiments to balance between a fair distribution and the number of generated data samples. Data quality checks and outlier detection need to be applied to prevent biased results as best as possible.

Ethical Considerations, Data Confidentiality, and Privacy.

Datasets may contain sensitive information about individuals, companies, or intellectual property. Researchers must address ethical considerations related to data privacy, ensuring that datasets are collected, stored, and shared responsibly. Understanding legal and ethical frameworks surrounding data ownership, confidentiality, and usage is essential for ethical and responsible research practices. For instance, in Ref. [9], the authors obtained informed consent from all participants and ensured that the dataset was anonymized after collection and before publication. Furthermore, industrial data can rarely be disclosed due to confidentiality concerns particularly important in engineering. Developing methods to share data without disclosing private information is essential to tap into the knowledge owned by engineering companies. Synthetic data transformation and federated learning methods could play an important role in that context.

As we move from identifying challenges to proposing recommendations, it becomes apparent that addressing these barriers requires strategic action, not just technical solutions. The forthcoming recommendations, based on our discussions, aim to address many of these issues to foster the development of better, more effective datasets in engineering design research.

3 Recommendations for Design Datasets

Looking ahead, we would like to highlight important themes to continue pushing data-driven methods and datasets.

Emphasize Domain Relevance and Specificity.

Datasets should be tailored to the specific needs of the design research community, focusing on data directly relevant to tasks and challenges encountered in real-world design scenarios. Two papers that highlight this concept are DeepJEB [12] and the linkage mechanisms dataset [11]. Both of these papers discuss the importance of generating data for a specific domain, such as jet engine brackets or planar linkage mechanisms, and demonstrate how these datasets can be used to solve problems using deep learning techniques. Since data generation is usually a costly process, it is recommended to explore steps to record, calculate, and store additional data features that may be useful for future use cases.

To enhance domain-relevant dataset availability, the creation of synthetic datasets is also recommended [23]. These datasets can replicate complex, real-world conditions that are often costly or impractical to capture directly. By employing large multimodal foundation models, advanced simulation tools, and diverse sampling methods, synthetic data can cover extensive design variations, operational scenarios, customer data, or functional requirements [10,24]. Validation against real-world data ensures their practical applicability for training robust models. For example, in Ref. [16], the authors compared car drag coefficients obtained from simulations at three different mesh resolutions with experimental values and reference simulations. Such efforts are crucial in areas where experimental data are scarce or difficult to obtain, thereby supporting the development of predictive models and AI-driven design tools.

Detail Context and Design Constraints.

In engineering design applications, the credibility of data hinges significantly on design context, such as how well boundary conditions, solver settings, and modeling assumptions mirror real-world scenarios. It is crucial for authors to meticulously detail the conditions under which each dataset was generated—be it through physical experiments, simulations, or a combination of methods—and include any cross-validation or calibration steps undertaken. This vital contextual information not only aids researchers in assessing the dataset’s relevance and suitability for their specific needs but also highlights discrepancies between simulated outcomes and actual performance.

Simulation data are abundant in engineering design (e.g., aerodynamics, power grids, manufacturing), but purely simulated datasets may not fully capture real-world behavior. Including a “validation tier” or smaller subset of real-world measurements—such as wind tunnel tests for aerodynamic models—helps quantify the gap between simulated and actual performance. Several papers [6,16] in the special issue emphasize the significance of explicitly outlining the simulation methods and providing the necessary information for users to rerun or modify simulations. When multifidelity data (low-/mid-/high-fidelity simulations) are available, clearly indicate the conditions and solver settings for each tier, allowing future users to match their own fidelity requirements and assess downstream accuracy.

Additionally, engineering datasets should comprehensively encode specific constraints (e.g., material limits, safety margins, regulatory requirements) and objectives (e.g., cost, weight, performance). Clearly articulating these constraints and objectives, beyond mere geometric or operational parameters, renders the datasets more directly useful for optimization and decision-making processes [4]. For instance, specifying allowable stress ranges or mandated design codes enables researchers to evaluate constraint-handling strategies and foster realistic, applicable design solutions. This integrated approach ensures that datasets are both deeply informative and highly applicable across various design scenarios.

Enhance Complexity, Diversity, and Practicality in Data Usage.

With the advancements in algorithms and computational power, machine-learning models are capable of learning data of higher complexity, dimensionality, and multimodal characteristics. Research such as Ref. [14] demonstrates that integrating diverse data sources enhances model accuracy. Researchers and practitioners should seize opportunities to identify and incorporate additional features or performance indicators that extend beyond the current core application target. Balancing the data to avoid outliers or biases by applying state-of-the-art design of experiment methods will further increase the value of the dataset for various downstream data science tasks [3]. In addition, rich datasets also motivate researchers and scientists to extend existing methods and develop new models to broaden the application spectrum. We recommend adopting open-access practices for nonproprietary data and encouraging collaboration across research domains to overcome data scarcity and enhance dataset diversity. Following the lead of existing design data repositories,1,2 we encourage researchers and institutions to establish and support similar platforms. This approach promotes wider data access and fosters a culture of shared innovation within the research community.

Another recommendation is to balance granularity and dataset size for surrogate modeling [15]. High-resolution simulations generate massive datasets (e.g., millions of mesh nodes in computational fluid dynamics or finite element analysis). While these data can train powerful surrogate models, storing every simulation output may be impractical. Guidance should be provided on how each subset or “downsampled” version (e.g., coarser meshes, aggregated performance metrics) can be utilized for scaled analysis, and it should be clarified how closely they track the high-resolution truth. This ensures that others can choose the right balance of granularity and computational feasibility for their needs.

Capture Complete Design Processes and Cross-Stage Data.

In engineering design, capturing data throughout the entire process—from initial concept through to operational stages—is crucial. Several papers in this issue highlight data from human-centered activities such as team-based hackathons [9], user interviews [7], and collaborative assembly tasks [5], underscoring the importance of documenting not just the final design artifact but also the process itself. This includes key decision points, intermediate prototypes, and user feedback, providing a comprehensive view that enhances our understanding of collaborative design behaviors, design thinking strategies, and empathic accuracy. Similarly, for products like power grids [6], wind turbines [13], and manufacturing lines [5] that span multiple design, production, and operational stages, it is beneficial to link datasets from upstream decisions (e.g., conceptual choices) to downstream impacts (e.g., maintenance records, quality metrics). By ensuring data continuity through consistent component identifiers or timestamps, researchers can easily integrate and explore these sources, supporting robust life-cycle analyses and design methods. Collectively, these practices not only enrich the data’s detail and applicability but also facilitate a more holistic understanding of the entire design and production life-cycle.

Prepare for Domain-Specific Benchmarking.

Many articles introduce specialized datasets (e.g., CAD design [18], mechanical metamaterials [3], linkage mechanisms [11]). Support for fair comparison and acceleration of progress can be enhanced by specifying baseline tasks (e.g., standard optimization problems, regression, and classification tasks) and performance metrics. Providing baseline results—such as recommended customer segmentation metrics [8,17]—along with dataset test and train splits, helps researchers quickly gauge whether new algorithms outperform the current state of the art.

Standardize Formats, Version Control, and Licensing Information.

Adhering to the Findable, Accessible, Interoperable, and Reusable principles is essential for datasets, as emphasized in several studies [2,4,11]. Maintaining consistent file formats (e.g., CSV, JSON, or HDF5) and clearly defined metadata is critical for promoting usability and interoperability across design research projects. For instance, the authors of the power systems dataset [6] provide all the information required to run power flow calculations, which allows researchers to rerun the problem calculation after changing parts of the data. In addition, applying version control—via platforms like Harvard Dataverse and GitHub—allows researchers to cite specific dataset releases and track modifications over time. Using clear, permissive licenses (e.g., CC-BY or CC0) also helps foster broader reuse and collaboration, while clearly delineating any restrictions tied to intellectual property or confidentiality concerns.

Ensure Comprehensive and Complete Documentation.

Along with the dataset, a comprehensive documentation should be distributed comprising a detailed description of the experimental setup and process parameters. The most valuable datasets are those that provide clear guidance on their intended applications, limitations, and any assumptions that might impact their external validity. For instance, the wind turbine generators dataset [13] exemplifies this by offering complete descriptions of the dataset, model, and metadata, which are accessible not only in the publication but also on the matlab/Simulink wind turbine example webpage and in a Zenodo record. This is especially relevant for design applications where simulation might deviate substantially from real-world behavior. Sharing open-source code or interactive notebooks that showcase basic data loading and exploratory analysis can significantly lower the barrier to adoption. This information allows reproducibility using the same setup and also cross-checks with different but similar settings, e.g., using a different simulation tool. While detailed documentation was explicitly required for this special issue, we hope the accepted papers set an example for making this practice the norm in the design community.

Adopt a Datasheet for Design Datasets.

As a final note, we advocate for the appropriate documentation of datasets. Generating and curating datasets in engineering design is a significant effort. Ensuring that it can be utilized further is thus an important step to amortize this cost across the community. Inspired by the Datasheets for Datasets [25] practice, as demonstrated in Ref. [16], we propose extending this practice to engineering design ensuring that all datasets are accompanied by detailed documentation that outlines their creation, content, and intended use. Based on the specific characteristics of our community, we suggest that the following datasheet template be added to the metadata of published datasets whenever applicable.

Dataset Name

Motivation.

  • Why was the dataset created?

  • Who created the dataset (team, individual)?

  • Who funded the creation of the dataset?

Composition.

  • What do the instances that comprise the dataset represent?

  • How many instances are there in total?

  • What data does each instance consist of?

  • Is there any missing data? If so, how much?

  • How is the data associated with each instance organized?

Collection Process.

  • How was the data collected?

  • Who was involved in the data collection process (e.g., annotators, researchers)?

  • Over what timeframe was the data collected?

  • How was it decided which data to collect and which to exclude?

Preprocessing/cleaning/labeling.

  • Was any preprocessing, cleaning, or labeling of the data done (e.g., normalization, annotation)?

  • If so, how was this done and by whom?

Uses.

  • For what tasks in engineering design is the dataset suitable?

  • Has the dataset been used for any tasks already? If so, which ones?

  • What performance metrics are relevant for assessing tasks using this dataset?

Distribution.

  • How is the dataset distributed (e.g., Dropbox, Harvard Dataverse)?

  • Are there any restrictions or licenses on its use?

  • What is the duration of availability for the dataset as provided by the authors?

Maintenance.

  • Who is responsible for dataset maintenance?

  • How can individuals submit corrections or updates?

  • Is there a versioning system in place? If so, what is it?

4 Conclusion

The availability of information-rich and high-quality datasets is a prerequisite to advancing data-driven methods, including machine learning, for engineering design. In addition to novel data-driven approaches for various engineering domains, the research articles contributing to this special issue provide detailed discussions on the datasets employed, highlighting the effort required to overcome the challenges of dataset creation, curation, and application within design research. Similarly to other scientific fields, this calls for the design and management of such datasets to become an integral part of the research process supported by adequate resources, methodologies, and practices tailored to the specific needs of engineering design research. Further, with this special issue, we hope to promote the appreciation and recognition of datasets as scientific contributions of their own. Data sharing not only allows for the validation of results and collaboration with other researchers, but it also fosters future work across multiple scientific disciplines. Finally, this calls for the adoption of best practices and regulations regarding ethical considerations, data privacy, and data ownership to foster dataset creation and reuse between all involved stakeholders. We hope that this special issue has successfully initiated the discussion regarding the challenges behind the design and management of datasets and their role in the engineering design research process. We also hope that we have inspired further research and development of data-driven and machine-learning methods to advance engineering design.

Acknowledgment

We would like to thank Dr. Michael Kokkolaras for his guidance during the call for proposals and valuable feedback on the editorial. We also express our gratitude to Dr. Carolyn Seepersad for her invaluable guidance and input at every stage of this special issue.

Conflict of Interest

There are no conflicts of interest.

Data Availability Statement

No data, models, or code were generated or used for this article.

Footnotes

References

1.
Sridhara
,
S.
, and
Krishnan
,
S.
,
2025
, “
A Visual Representation of Engineering Catalogs Using Variational Autoencoders
,”
ASME J. Mech. Des.
,
147
(
4
), p.
041708
.
2.
Karri
,
A.
,
Stump
,
G.
,
McComb
,
C.
, and
Song
,
B.
,
2025
, “
HUVER: The HyForm Uncrewed Vehicle Engineering Repository
,”
ASME J. Mech. Des.
,
147
(
4
).
3.
Abu-Mualla
,
M.
, and
Huang
,
J.
,
2025
, “
A Dataset Generation Framework for Symmetry-Induced Mechanical Metamaterials
,”
ASME J. Mech. Des.
,
147
(
4
), p.
041705
.
4.
Jain
,
Y. P.
,
Grandi
,
D.
,
Groom
,
A.
,
Cramer
,
B.
, and
McComb
,
C.
,
2025
, “
MSEval: A Dataset for Material Selection in Conceptual Design to Evaluate Algorithmic Models
,”
ASME J. Mech. Des.
,
147
(
4
), p.
044502
.
5.
Martins
,
J.
,
Lin
,
C.
,
Flanigan
,
K. A.
, and
McComb
,
C.
,
2025
, “
HM-SYNC: A Multimodal Dataset of Human Interactions With Advanced Manufacturing Machinery
,”
ASME J. Mech. Des.
,
147
(
4
).
6.
Chung
,
I.-B.
, and
Wang
,
P.
,
2025
, “
Dataset on Complex Power Systems: Design for Resilient Transmission Networks Using a Generative Model
,”
ASME J. Mech. Des.
,
147
(
4
).
7.
Fabunmi
,
O.
,
Halgamuge
,
S.
,
Beck
,
D.
, and
Holtta-Otto
,
K.
,
2025
, “
Large Language Models for Predicting Empathic Accuracy Between a Designer and a User
,”
ASME J. Mech. Des.
,
147
(
4
), p.
041401
.
8.
Shen
,
M.
,
Feng
,
B.
,
Cheng
,
A.
, and
Bi
,
Y.
,
2025
, “
Customer Segmentation and Need Analysis Based on Sentiment Network of Online Reviewers and Graph Embedding
,”
ASME J. Mech. Des.
,
147
(
4
), p.
041706
.
9.
Flus
,
M.
,
Litster
,
G.
, and
Olechowski
,
A.
,
2025
, “
Presenting Hackathon Data for Design Research: A Transcript Dataset
,”
ASME J. Mech. Des.
,
147
(
4
).
10.
Qiu
,
Y.
, and
Jin
,
Y.
,
2025
, “
A Method for Synthesizing Ontology-Based Textual Design Datasets: Evaluating the Potential of LLM in Domain-Specific Dataset Generation
,”
ASME J. Mech. Des.
,
147
(
4
).
11.
Nurizada
,
A.
,
Dhaipule
,
R.
,
Lyu
,
Z.
, and
Purwar
,
A.
,
2025
, “
A Dataset of 3M Single-DOF Planar 4-, 6-, and 8-Bar Linkage Mechanisms With Open and Closed Coupler Curves for Machine Learning-Driven Path Synthesis
,”
ASME J. Mech. Des.
,
147
(
4
), p.
041702
.
12.
Hong
,
S.
,
Kwon
,
Y.
,
Shin
,
D.
,
Park
,
J.
, and
Kang
,
N.
,
2025
, “
DeepJEB: 3D Deep Learning-Based Synthetic Jet Engine Bracket Dataset
,”
ASME J. Mech. Des.
,
147
(
4
), p.
041703
.
13.
Yan
,
J.
,
Senemmar
,
S.
, and
Zhang
,
J.
,
2025
, “
Bi-Level Interturn Short-Circuit Fault Monitoring for Wind Turbine Generators With Benchmark Dataset Development
,”
ASME J. Mech. Des.
,
147
(
4
), p.
041704
.
14.
Comlek
,
Y.
,
Ravi
,
S. K.
,
Pandita
,
P.
,
Ghosh
,
S.
,
Wang
,
L.
, and
Chen
,
W.
,
2025
, “
Heterogeneous Multi-Source Data Fusion Through Input Mapping and Latent Variable Gaussian Process
,”
ASME J. Mech. Des.
,
147
(
4
).
15.
Ferguson
,
K.
,
Chen
,
Y.-h.
,
Chen
,
Y.
,
Gillman
,
A.
,
Hardin
,
J.
, and
Kara
,
L. B.
,
2025
, “
Topology-Agnostic Graph U-Nets for Scalar Field Prediction on Unstructured Meshes
,”
ASME J. Mech. Des.
,
147
(
4
), p.
041701
.
16.
Elrefaie
,
M.
,
Dai
,
A.
, and
Ahmed
,
F.
,
2025
, “
DrivAerNet: A Parametric Car Dataset for Data-Driven Aerodynamic Design and Prediction
,”
ASME J. Mech. Des.
,
147
(
4
).
17.
Park
,
S.
,
Jiang
,
Y.
, and
Kim
,
H.
,
2025
, “
Importance-Induced Customer Segmentation Using Explainable Machine Learning
,”
ASME J. Mech. Des.
,
147
(
4
), p.
044501
.
18.
Sun
,
Y.
,
Li
,
X.
, and
Sha
,
Z.
,
2025
, “
Large Language Models for Computer-Aided Design (LLM4CAD) Fine-Tuned: Dataset and Experiments
,”
ASME J. Mech. Des.
,
147
(
4
).
19.
Cass
,
M.
, and
Prabhu
,
R.
,
2025
, “
Looking Beyond Self-Reported Cognitive Load: Comparing Pupil Diameter Against Self-Reported Cognitive Load in Design Tasks
,”
ASME J. Mech. Des.
,
147
(
4
).
20.
Deng
,
J.
,
Dong
,
W.
,
Socher
,
R.
,
Li
,
L.-J.
,
Li
,
K.
, and
Fei-Fei
,
L.
,
2009
, “
Imagenet: A Large-Scale Hierarchical Image Database
,”
2009 IEEE Conference on Computer Vision and Pattern Recognition
,
Miami Beach, FL
,
June 20–25
, pp.
248
255
.
21.
Lecun
,
Y.
,
Bottou
,
L.
,
Bengio
,
Y.
, and
Haffner
,
P.
,
1998
, “
Gradient-Based Learning Applied to Document Recognition
,”
Proc. IEEE
,
86
(
11
), pp.
2278
2324
.
22.
Geiger
,
A.
,
Lenz
,
P.
, and
Urtasun
,
R.
,
2012
, “
Are We Ready for Autonomous Driving? The Kitti Vision Benchmark Suite
,”
Conference on Computer Vision and Pattern Recognition (CVPR)
,
Providence, RI
,
June 16–21
, pp.
3354
3361
.
23.
Picard
,
C.
,
Schiffmann
,
J.
, and
Ahmed
,
F.
,
2023
, “
Dated: Guidelines for Creating Synthetic Datasets for Engineering Design Applications
,”
International Design Engineering Technical Conferences and Computers and Information in Engineering Conference
,
Boston MA
,
Aug. 20–23
, Vol.
87301
,
American Society of Mechanical Engineers
, p.
V03AT03A015
.
24.
Rad
,
M. A.
,
Hajali
,
T.
,
Bonde
,
J. M.
,
Panarotto
,
M.
,
Wärmefjord
,
K.
,
Malmqvist
,
J.
, and
Isaksson
,
O.
,
2024
, “
Datasets in Design Research: Needs and Challenges and the Role of Ai and Gpt in Filling the Gaps
,”
Proc. Design Soc.
,
4
, pp.
1919
1928
.
25.
Gebru
,
T.
,
Morgenstern
,
J.
,
Vecchione
,
B.
,
Vaughan
,
J. W.
,
Wallach
,
H.
,
Daumé III
,
H.
, and
Crawford
,
K.
,
2021
, “
Datasheets for Datasets
,”
Commun. ACM
,
64
(
12
), pp.
86
92
.