Abstract

In the rapidly advancing field of multi-modal machine learning (MMML), the convergence of multiple data modalities has the potential to reshape various applications. This paper presents a comprehensive overview of the current state, advancements, and challenges of MMML within the sphere of engineering design. The review begins with a deep dive into five fundamental concepts of MMML: multi-modal information representation, fusion, alignment, translation, and co-learning. Following this, we explore the cutting-edge applications of MMML, placing a particular emphasis on tasks pertinent to engineering design, such as cross-modal synthesis, multi-modal prediction, and cross-modal information retrieval. Through this comprehensive overview, we highlight the inherent challenges in adopting MMML in engineering design, and proffer potential directions for future research. To spur on the continued evolution of MMML in engineering design, we advocate for concentrated efforts to construct extensive multi-modal design datasets, develop effective data-driven MMML techniques tailored to design applications, and enhance the scalability and interpretability of MMML models. MMML models, as the next generation of intelligent design tools, hold a promising future to impact how products are designed.

References

1.
Bengio
,
Y.
,
Courville
,
A.
, and
Vincent
,
P.
,
2012
, “
Representation Learning: A Review and New Perspectives
,”
IEEE Trans. Pattern Anal. Mach. Intell.
,
35
(
8
), pp.
1798
1828
.
2.
Bhattacharjee
,
K. S.
,
Singh
,
H. K.
, and
Ray
,
T.
,
2018
, “
Multiple Surrogate-Assisted Many-Objective Optimization for Computationally Expensive Engineering Design
,”
ASME J. Mech. Des.
,
140
(
5
), p.
051403
.
3.
Zhu
,
Q.
,
Zhang
,
X.
, and
Luo
,
J.
,
2023
, “
Biologically Inspired Design Concept Generation Using Generative Pre-Trained Transformers
,”
ASME J. Mech. Des.
,
145
(
4
), p.
041409
.
4.
Zhu
,
Q.
, and
Luo
,
J.
,
2023
, “
Generative Transformers for Design Concept Generation
,”
ASME J. Comput. Inf. Sci. Eng.
,
23
(
4
), pp.
1
61
.
5.
Nobari
,
A. H.
,
Chen
,
W.
, and
Ahmed
,
F.
,
2021
, “
PcDGAN: A Continuous Conditional Diverse Generative Adversarial Network for Inverse Design
,”
27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
,
Singapore
,
Aug. 14–18
, pp.
610
616
.
6.
Luo
,
J.
,
Sarica
,
S.
, and
Wood
,
K. L.
,
2021
, “
Guiding Data-Driven Design Ideation by Knowledge Distance
,”
Knowl. Based Syst.
,
218
, p.
106873
.
7.
Meltzer
,
P.
,
Lambourne
,
J. G.
, and
Grandi
,
D.
,
2024
, “
What’s in a Name? Evaluating Assembly-Part Semantic Knowledge in Language Models Through User-Provided Names in Computer Aided Design Files
,”
ASME J. Comput. Inf. Sci. Eng.
,
24
(
1
), p.
011002
.
8.
Song
,
B.
,
Miller
,
S.
, and
Ahmed
,
F.
,
2023
, “
Attention-Enhanced Multimodal Learning for Conceptual Design Evaluations
,”
ASME J. Mech. Des.
,
145
(
4
), p.
041410
.
9.
Feng
,
Y.
,
Li
,
M.
,
Lou
,
S.
,
Zheng
,
H.
,
Gao
,
Y.
, and
Tan
,
J.
,
2021
, “
A Digital Twin-Driven Method for Product Performance Evaluation Based on Intelligent Psycho-Physiological Analysis
,”
ASME J. Comput. Inf. Sci. Eng.
,
21
(
3
), p.
031002
.
10.
Nobari
,
A. H.
,
Chen
,
W.
, and
Ahmed
,
F.
,
2021
, “
Range-GAN: Range-Constrained Generative Adversarial Network for Conditioned Design Synthesis
,”
Proceedings of the ASME Design Engineering Technical Conference
, Vol.
85390
, p.
V03BT03A039
.
11.
Regenwetter
,
L.
,
Abu Obaideh
,
Y.
, and
Ahmed
,
F.
,
2023
, “
Counterfactuals for Design: A Model-Agnostic Method For Design Recommendations
,”
International Design Engineering Technical Conferences & Computers and Information in Engineering Conference
,
Boston, MA
,
Aug 20–23
.
12.
Song
,
B.
,
McComb
,
C.
, and
Ahmed
,
F.
,
2022
, “
Assessing Machine Learnability of Image and Graph Representations for Drone Performance Prediction
,”
Proc. Des. Soc.
,
2
, pp.
1777
1786
.
13.
Gero
,
J. S.
,
1990
, “
Design Prototypes: A Knowledge Representation Schema for Design
,”
AI Mag.
,
11
(
4
), p.
26
.
14.
Tseng
,
W. S.
, and
Ball
,
L. J.
,
2011
, “How Uncertainty Helps Sketch Interpretation in a Design Task,”
Design Creativity
,
T.
Taura
, and
Y.
Nagai
, eds.,
Springer
,
London
, pp.
257
264
.
15.
Häggman
,
A.
,
Tsai
,
G.
,
Elsen
,
C.
,
Honda
,
T.
, and
Yang
,
M. C.
,
2015
, “
Connections Between the Design Tool, Design Attributes, and User Preferences in Early Stage Design
,”
ASME J. Mech. Des.
,
137
(
7
), p.
071408
.
16.
Tsai
,
G.
, and
Yang
,
M. C.
,
2017
, “
How It Is Made Matters: Distinguishing Traits of Designs Created by Sketches, Prototypes, and CAD
,”
International Design Engineering Technical Conferences and Computers and Information in Engineering Conference
, Vol.
58219
, p.
V007T06A037
.
17.
Purcell
,
A. T.
, and
Gero
,
J. S.
,
1998
, “
Drawings and the Design Process: A Review of Protocol Studies in Design and Other Disciplines and Related Research in Cognitive Psychology
,”
Des. Stud.
,
19
(
4
), pp.
389
430
.
18.
Ullman
,
D. G.
,
Wood
,
S.
, and
Craig
,
D.
,
1990
, “
The Importance of Drawing in the Mechanical Design Process
,”
Comput. Graph.
,
14
(
2
), pp.
263
274
.
19.
Chang
,
Y. S.
,
Chien
,
Y. H.
,
Lin
,
H. C.
,
Chen
,
M. Y.
, and
Hsieh
,
H. H.
,
2016
, “
Effects of 3D CAD Applications on the Design Creativity of Students With Different Representational Abilities
,”
Comput. Human Behav.
,
65
, pp.
107
113
.
20.
Atilola
,
O.
,
Tomko
,
M.
, and
Linsey
,
J. S.
,
2016
, “
The Effects of Representation on Idea Generation and Design Fixation: A Study Comparing Sketches and Function Trees
,”
Des. Stud.
,
42
, pp.
110
136
.
21.
Hannibal
,
C.
,
Brown
,
A.
, and
Knight
,
M.
,
2016
, “
An Assessment of the Effectiveness of Sketch Representations in Early Stage Digital Design
,”
Int. J. Archit. Comput.
,
3
(
1
), pp.
107
125
.
22.
Atilola
,
O.
, and
Linsey
,
J.
,
2015
, “
Representing Analogies to Influence Fixation and Creativity: A Study Comparing Computer-Aided Design, Photographs, and Sketches
,”
Artif. Intell. Eng. Des. Anal. Manuf.
,
29
(
2
), pp.
161
171
.
23.
Reid
,
T. N.
,
MacDonald
,
E. F.
, and
Du
,
P.
,
2013
, “
Impact of Product Design Representation on Customer Judgment
,”
ASME J. Mech. Des.
,
135
(
9
), p.
091008
.
24.
Yang
,
M. C.
,
2005
, “
A Study of Prototypes, Design Activity, and Design Outcome
,”
Des. Stud.
,
26
(
6
), pp.
649
669
.
25.
McKoy
,
F. L.
,
Vargas-Hernández
,
N.
,
Summers
,
J. D.
, and
Shah
,
J. J.
,
2020
, “
Influence of Design Representation on Effectiveness of Idea Generation
,”
International Design Engineering Technical Conferences & Computers and Information in Engineering Conference
,
Virtual
,
Aug. 17–19
, Vol. 80258, pp.
39
48
.
26.
Grace
,
K.
,
Maher
,
M. L.
,
Fisher
,
D.
, and
Brady
,
K.
,
2014
, “
Data-Intensive Evaluation of Design Creativity Using Novelty, Value, and Surprise
,”
Int. J. Des. Creat. Innov.
,
3
(
3–4
), pp.
125
147
.
27.
Nomaguchi
,
Y.
,
Kawahara
,
T.
,
Shoda
,
K.
, and
Fujita
,
K.
,
2019
, “
Assessing Concept Novelty Potential With Lexical and Distributional Word Similarity for Innovative Design
,”
Proc. Des. Soc. Int. Conf. Eng. Des.
,
1
(
1
), pp.
1413
1422
.
28.
Xu
,
H.
,
Liu
,
R.
,
Choudhary
,
A.
, and
Chen
,
W.
,
2015
, “
A Machine Learning-Based Design Representation Method for Designing Heterogeneous Microstructures
,”
ASME J. Mech. Des.
,
137
(
5
), p.
051403
.
29.
Wood
,
K.
, and
Otto
,
K.
,
2001
,
Product Design: Techniques in Reverse Engineering and New Product Development.
,
Pearson
,
London, UK
.
30.
Ciavola
,
B. T.
,
Wu
,
C.
, and
Gershenson
,
J. K.
,
2015
, “
Integrating Function- and Affordance-Based Design Representations
,”
ASME J. Mech. Des.
,
137
(
5
), p.
051101
.
31.
Ulrich
,
K. T.
, and
Eppinger
,
S. D.
,
2000
,
Product Design and Development
,
McGraw-Hill
,
New York
.
32.
Fiorineschi
,
L.
,
Frillici
,
F. S.
, and
Rotini
,
F.
,
2018
, “
Issues Related to Missing Attributes in Aposteriori Novelty Assessments
,”
Proc. Int. Des. Conf.
,
3
(
1
), pp.
1067
1078
.
33.
Rosen
,
D. W.
,
Dixon
,
J. R.
, and
Finger
,
S.
,
1994
, “
Conversions of Feature-Based Design Representations Using Graph Grammar Parsing
,”
ASME J. Mech. Des.
,
116
(
3
), pp.
785
792
.
34.
Yukish
,
M. A.
,
Stump
,
G. M.
, and
Miller
,
S. W.
,
2020
, “
Using Recurrent Neural Networks to Model Spatial Grammars for Design Creation
,”
ASME J. Mech. Des.
,
142
(
10
), p.
104501
.
35.
Wyatt
,
D. F.
,
Wynn
,
D. C.
, and
John Clarkson
,
P.
,
2014
, “
A Scheme for Numerical Representation of Graph Structures in Engineering Design
,”
ASME J. Mech. Des.
,
136
(
1
), p.
011010
.
36.
Saadi
,
J. I.
, and
Yang
,
M. C.
,
2023
, “
Generative Design: Reframing the Role of the Designer in Early-Stage Design Process
,”
ASME J. Mech. Des.
,
145
(
4
), p.
041411
.
37.
Veisz
,
D.
,
Namouz
,
E. Z.
,
Joshi
,
S.
, and a
Summers
,
J. D.
,
2012
, “
Computer-Aided Design Versus Sketching: An Exploratory Case Study
,”
Artif. Intell. Eng. Des. Anal. Manuf.
,
26
(
3
), pp.
317
335
.
38.
Babapour
,
M.
,
Ornas
,
V. H. A.
,
Rexfelt
,
O.
, and
Rahe
,
U.
,
2014
, “
Media and Representations in Product Design Education
,”
International Conference on Engineering and Product Design Education
,
The Netherlands
,
Sept. 4–5
, pp.
42
47
.
39.
Kokko
,
E. J.
,
Martz
,
H. E.
,
Chinn
,
D. J.
,
Childs
,
H. R.
,
Jackson
,
J. A.
,
Chambers
,
D. H.
,
Schneberk
,
D. J.
, and
Clark
,
G. A.
,
2006
, “
As-Built Modeling of Objects for Performance Assessment
,”
ASME J. Comput. Inf. Sci. Eng.
,
6
(
4
), pp.
405
417
.
40.
Zhang
,
X.
,
Liu
,
L.
,
Wan
,
X.
, and
Feng
,
B.
,
2021
, “
Tool Wear Online Monitoring Method Based on DT and SSAE-PHMM
,”
ASME J. Comput. Inf. Sci. Eng.
,
21
(
3
), p.
034501
.
41.
Baltrusaitis
,
T.
,
Ahuja
,
C.
, and
Morency
,
L. P.
,
2019
, “
Multimodal Machine Learning: A Survey and Taxonomy
,”
IEEE Trans. Pattern Anal. Mach. Intell.
,
41
(
2
), pp.
423
443
.
42.
Zhang
,
C.
,
Yang
,
Z.
,
He
,
X.
, and
Deng
,
L.
,
2019
, “
Multimodal Intelligence: Representation Learning, Information Fusion, and Applications
,”
IEEE J. Select. Top. Signal Process.
,
14
(
3
), pp.
478
493
.
43.
Cui
,
C.
,
Yang
,
H.
,
Wang
,
Y.
,
Zhao
,
S.
,
Asad
,
Z.
,
Coburn
,
L. A.
,
Wilson
,
K. T.
,
Landman
,
B. A.
, and
Huo
,
Y.
,
2022
, “
Deep Multi-Modal Fusion of Image and Non-Image Data in Disease Diagnosis and Prognosis: A Review
,”
Progr. Biomed. Eng.
,
5
(
2
), p.
022001
.
44.
Li
,
X.
,
Wang
,
Y.
, and
Sha
,
Z.
,
2023
, “
Deep-Learning Methods of Cross-Modal Tasks for Conceptual Design of Product Shapes: A Review
,”
ASME J. Mech. Des.
,
145
(
4
), p.
041401
.
45.
Dhariwal
,
P.
, and
Nichol
,
A.
,
2021
, “
Diffusion Models Beat GANs on Image Synthesis
,”
Adv. Neural Inf. Process. Syst.
,
11
, pp.
8780
8794
.
46.
Nichol
,
A. Q.
,
Dhariwal
,
P.
,
Ramesh
,
A.
,
Shyam
,
P.
,
Mishkin
,
P.
,
Mcgrew
,
B.
,
Sutskever
,
I.
, and
Chen
,
M.
,
2022
, “
GLIDE: Towards Photorealistic Image Generation and Editing With Text-Guided Diffusion Models
,”
39 th International Conference on Machine Learning
,
Baltimore, MD
,
July 17– 23
, Vol. 162, pp.
16784
16804
.
47.
Kim
,
G.
,
Kwon
,
T.
, and
Ye
,
J. C.
,
2021
, “
DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation
,”
Computer Vision and Pattern Recognition Conference (CVPR)
,
New Orleans, LA
,
June 19–23
, pp.
2426
2435
.
48.
Frome
,
A.
,
Corrado
,
G. S.
,
Shlens
,
J.
,
Bengio
,
S.
,
Dean
,
J.
,
Ranzato
,
M.
, and
Mikolov
,
T.
,
2013
, “
DeViSE: A Deep Visual-Semantic Embedding Model
,”
26th International Conference on Neural Information Processing Systems
,
Lake Tahoe, NV
,
Dec. 5–10
.
49.
Rajendran
,
J.
,
Khapra
,
M. M.
,
Chandar
,
S.
, and
Ravindran
,
B.
,
2016
, “
Bridge Correlational Neural Networks for Multilingual Multimodal Representation Learning
,”
2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
,
San Diego, CA
,
June
, pp.
171
178
.
50.
Srivastava
,
N.
, and
Salakhutdinov
,
R. R.
,
2012
, “
Multimodal Learning With Deep Boltzmann Machines
,”
Advances in Neural Information Processing Systems 26
,
Lake Tahoe, NV
,
Dec. 5–10
, pp.
171
178
.
51.
Duc Tuan
,
N. M.
, and
Quang Nhat Minh
,
P.
,
2021
, “
Multimodal Fusion With BERT and Attention Mechanism for Fake News Detection
,”
2021 RIVF International Conference on Computing and Communication Technologies
,
Hanoi, Vietnam
,
May 6–8
.
52.
Song
,
B.
,
Miller
,
S.
, and
Ahmed
,
F.
,
2022
, “
Hey, AI! Can You See What I See? Multimodal Transfer Learning-Based Design Metrics Prediction for Sketches With Text Descriptions
,”
International Design Engineering Technical Conferences and Computers and Information in Engineering Conference
, Vol.
86267
,
American Society of Mechanical Engineers
, p.
V006T06A017
.
53.
Yuan
,
C.
,
Marion
,
T.
, and
Moghaddam
,
M.
,
2022
, “
Leveraging End-User Data for Enhanced Design Concept Evaluation: A Multimodal Deep Regression Model
,”
ASME J. Mech. Des.
,
144
(
2
), p.
021403
.
54.
Nguyen
,
D. K.
, and
Okatani
,
T.
,
2018
, “
Multi-Task Learning of Hierarchical Vision-Language Representation
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Salt Lake City, UT
,
June 18–22
, pp.
10484
10493
.
55.
Li
,
G.
,
Duan
,
N.
,
Fang
,
Y.
,
Gong
,
M.
, and
Jiang
,
D.
,
2020
, “
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training
,”
AAAI Conference on Artificial Intelligence
,
New York
,
Feb. 7–12
, pp.
11336
11344
.
56.
Su
,
W.
,
Zhu
,
X.
,
Cao
,
Y.
,
Li
,
B.
,
Lu
,
L.
,
Wei
,
F.
, and
Dai
,
J.
,
2019
, “
VL-BERT: Pre-Training of Generic Visual-Linguistic Representations
.”
57.
Li
,
L. H.
,
Yatskar
,
M.
,
Yin
,
D.
,
Hsieh
,
C.-J.
, and
Chang
,
K.-W.
,
2019
, “
VisualBERT: A Simple and Performant Baseline for Vision and Language
.”
58.
Alberti
,
C.
,
Ling
,
J.
,
Collins
,
M.
, and
Reitter
,
D.
,
2019
, “
Fusion of Detected Objects in Text for Visual Question Answering
,”
2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing
,
Hongkong, China
,
Nov. 3–7
, Vol. 8, pp.
2131
2140
.
59.
Sun
,
C.
,
Myers
,
A.
,
Vondrick
,
C.
,
Murphy
,
K.
, and
Schmid
,
C.
,
2019
, “
VideoBERT: A Joint Model for Video and Language Representation Learning
,”
IEEE International Conference on Computer Vision
,
Seoul, South Korea
,
Oct. 27– Nov. 2
, Vol. 4, pp.
7463
7472
.
60.
Ngiam
,
J.
,
Khosla
,
A.
,
Kim
,
M.
,
Nam
,
J.
,
Lee
,
H.
, and
Ng
,
A. Y.
,
2011
, “
Multimodal Deep Learning
,”
28th International Conference on Machine Learning
,
Bellevue, WA
,
June 28– July 2
.
61.
Silberer
,
C.
, and
Lapata
,
M.
,
2014
, “
Learning Grounded Meaning Representations With Autoencoders
,”
52nd Annual Meeting of the Association for Computational Linguistics
,
Baltimore, MD
,
June 22–27
, Vol. 1, pp.
721
732
.
62.
Feng
,
F.
,
Wang
,
X.
, and
Li
,
R.
,
2014
, “
Cross-Modal Retrieval With Correspondence Autoencoder
,”
2014 ACM Conference on Multimedia
,
Orlando, FL
,
Nov. 3–7
, pp.
7
16
.
63.
Radford
,
A.
,
Kim
,
J. W.
,
Hallacy
,
C.
,
Ramesh
,
A.
,
Goh
,
G.
,
Agarwal
,
S.
,
Sastry
,
G.
,
Askell
,
A.
,
Mishkin
,
P.
,
Clark
,
J.
,
Krueger
,
G.
, and
Sutskever
,
I.
,
2021
, “
Learning Transferable Visual Models From Natural Language Supervision
,”
International Conference on Machine Learning
,
Virtual
,
July 18–24
, pp.
8748
8763
.
64.
Andrew
,
G.
,
Arora
,
R.
,
Bilmes
,
J.
, and
Livescu
,
K.
,
2013
, “
Deep Canonical Correlation Analysis
,”
30th International Conference on Machine Learning
,
Atlanta, GA
,
June 16–21
, pp.
1247
1255
.
65.
Yang
,
X.
,
Ramesh
,
P.
,
Chitta
,
R.
,
Madhvanath
,
S.
,
Bernal
,
E. A.
, and
Luo
,
J.
,
2017
, “
Deep Multimodal Representation Learning From Temporal Data
,”
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
,
Honolulu, HI
,
July 21–26
, pp.
5447
5455
.
66.
Bachman
,
P.
,
Hjelm
,
D.
, and
Buchwalter
,
W.
,
2019
, “
Learning Representations by Maximizing Mutual Information Across Views
,”
33rd International Conference on Neural Information Processing Systems
,
Vancouver, Canada
,
Dec. 8–14
, pp.
15535
15545
.
67.
Zhang
,
Y.
,
Jiang
,
H.
,
Miura
,
Y.
,
Manning
,
C. D.
, and
Langlotz
,
C. P.
,
2020
, “
Contrastive Learning of Medical Visual Representations From Paired Images and Text
,”
Proc. Mach. Learn. Res.
,
182
, pp.
1
24
.
68.
Kiros
,
R.
,
Salakhutdinov
,
R.
, and
Zemel
,
R. S.
,
2014
, “
Unifying Visual-Semantic Embeddings With Multimodal Neural Language Models
.”
69.
Huang
,
P.-S.
,
He
,
X.
,
Gao
,
J.
,
Deng
,
L.
,
Acero
,
A.
, and
Heck
,
L.
,
2013
, “
Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data
,”
22nd ACM International Conference on Information & Knowledge Management
,
San Francisco, CA
,
Oct. 27–Nov. 1
, pp.
2333
2338
.
70.
Karpathy
,
A.
, and
Fei-Fei
,
L.
,
2014
, “
Deep Visual-Semantic Alignments for Generating Image Descriptions
,”
IEEE Trans. Pattern Anal. Mach. Intell.
,
39
(
4
), pp.
664
676
.
71.
Karpathy
,
A.
,
Joulin
,
A.
, and
Fei-Fei
,
L.
,
2014
, “
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
,”
Adv. Neural Inf. Process. Syst.
,
3
(
Jan.
), pp.
1889
1897
.
72.
Wu
,
H.
,
Mao
,
J.
,
Zhang
,
Y.
,
Jiang
,
Y.
,
Li
,
L.
,
Sun
,
W.
, and
Ma
,
W. Y.
,
2019
, “
Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Long Beach, CA
,
June 15–20
, pp.
6602
6611
.
73.
Plummer
,
B. A.
,
Wang
,
L.
,
Cervantes
,
C. M.
,
Caicedo
,
J. C.
,
Hockenmaier
,
J.
, and
Lazebnik
,
S.
,
2015
, “
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
,”
Int. J. Comput. Vision
,
123
(
1
), pp.
74
93
.
74.
Tan
,
H.
, and
Bansal
,
M.
,
2019
, “
LXMERT: Learning Cross-Modality Encoder Representations From Transformers
,”
2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing
,
Hongkong, China
,
Nov. 3–7
, pp.
5100
5111
.
75.
Lu
,
J.
,
Batra
,
D.
,
Parikh
,
D.
, and
Lee
,
S.
,
2019
, “
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
,”
33rd Conference on Neural Information Processing Systems
,
Vancouver, Canada
,
Dec. 8–14
.
76.
Pramanik
,
S.
,
Agrawal
,
P.
, and
Hussain
,
A.
,
2019
, “
OmniNet: A Unified Architecture for Multi-modal Multi-task Learning
.
77.
Sbrolli
,
C.
,
Cudrano
,
P.
,
Frosi
,
M.
, and
Matteucci
,
M.
,
2022
, “
IC3D: Image-Conditioned 3D Diffusion for Shape Generation
.”
78.
Nojavanasghari
,
B.
,
Gopinath
,
D.
,
Koushik
,
J.
,
Baltrušaitis
,
T.
, and
Morency
,
L. P.
,
2016
, “
Deep Multimodal Fusion for Persuasiveness Prediction
,”
18th ACM International Conference on Multimodal Interaction
,
Tokyo, Japan
,
Nov. 10–18
, pp.
284
288
.
79.
Anastasopoulos
,
A.
,
Kumar
,
S.
, and
Liao
,
H.
,
2019
, “
Neural Language Modeling With Visual Features
,”
Undefined.
80.
Vielzeuf
,
V.
,
Lechervy
,
A.
,
Pateux
,
S.
, and
Jurie
,
F.
,
2019
, “
CentralNet: A Multilayer Approach for Multimodal Fusion
,”
Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, Vol.
11134
, LNCS, pp.
575
589
.
81.
Liu
,
B.
,
Liu
,
X.
,
Yang
,
Z.
, and
Wang
,
C. C. L.
,
2022
, “
Concise and Effective Network for 3D Human Modeling From Orthogonal Silhouettes
,”
ASME J. Comput. Inf. Sci. Eng.
,
22
(
5
), p.
051004
.
82.
Shutova
,
E.
,
Kiela
,
D.
, and
Maillard
,
J.
,
2016
, “
Black Holes and White Rabbits: Metaphor Identification With Visual Features
,”
2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
,
San Diego, CA
,
June 12–17
, pp.
160
170
.
83.
Cao
,
Y.
,
Long
,
M.
,
Wang
,
J.
,
Yang
,
Q.
, and
Yuy
,
P. S.
,
2016
, “
Deep Visual-Semantic Hashing for Cross-Modal Retrieval
,”
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
,
San Francisco, CA
,
Aug. 13–17
, pp.
1445
1454
.
84.
Sikka
,
K.
,
Dykstra
,
K.
,
Sathyanarayana
,
S.
,
Littlewort
,
G.
, and
Bartlett
,
M.
,
2013
, “
Multiple Kernel Learning for Emotion Recognition in the Wild
,”
2013 ACM International Conference on Multimodal Interaction
,
Sydney, Australia
,
Dec. 9–13
, pp.
517
524
.
85.
Morvant
,
E.
,
Habrard
,
A.
, and
Ayache
,
S.
,
2014
,
Majority Vote of Diverse Classifiers for Late Fusion
, Vol.
8621 LNCS
,
Springer Verlag
,
Berlin/Heidelberg
, pp.
153
162
.
86.
Perez-Rua
,
J. M.
,
Vielzeuf
,
V.
,
Pateux
,
S.
,
Baccouche
,
M.
, and
Jurie
,
F.
,
2019
, “
MFAS: Multimodal Fusion Architecture Search
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Long Beach, CA
,
June 15–20
, pp.
6959
6968
.
87.
Zhou
,
T.
,
Thung
,
K. H.
,
Zhu
,
X.
, and
Shen
,
D.
,
2019
, “
Effective Feature Learning and Fusion of Multimodality Data Using Stage-Wise Deep Neural Network for Dementia Diagnosis
,”
Human Brain Map.
,
40
(
3
), pp.
1001
1016
.
88.
Zoph
,
B.
, and
Le
,
Q. V.
,
2016
, “
Neural Architecture Search With Reinforcement Learning
,”
5th International Conference on Learning Representations
,
Toulon, France
,
Apr. 24–26
.
89.
Tenenbaum
,
J. B.
, and
Freeman
,
W. T.
,
2000
, “
Separating Style and Content With Bilinear Models
,”
Neur. Comput.
,
12
(
6
), pp.
1247
1283
.
90.
Zadeh
,
A.
,
Chen
,
M.
,
Cambria
,
E.
,
Poria
,
S.
, and
Morency
,
L. P.
,
2017
, “
Tensor Fusion Network for Multimodal Sentiment Analysis
,”
Conference on Empirical Methods in Natural Language Processing
,
Copenhagen, Denmark
,
Sept. 7– 11
, pp.
1103
1114
.
91.
Chen
,
R. J.
,
Lu
,
M. Y.
,
Wang
,
J.
,
Williamson
,
D. F.
,
Rodig
,
S. J.
,
Lindeman
,
N. I.
, and
Mahmood
,
F.
,
2019
, “
Pathomic Fusion: An Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis
,”
IEEE Trans. Med. Imag.
,
41
(
4
), pp.
757
770
.
92.
Kim
,
J.-H.
,
On
,
K.-W.
,
Lim
,
W.
,
Kim
,
J.
,
Ha
,
J.-W.
, and
Zhang
,
B.-T.
,
2017
, “
Hadamard Product for Low-Rank Bilinear Pooling
.”
93.
Yu
,
Z.
,
Yu
,
J.
,
Fan
,
J.
, and
Tao
,
D.
,
2017
, “
Multi-Modal Factorized Bilinear Pooling With Co-Attention Learning for Visual Question Answering
,”
IEEE International Conference on Computer Vision
,
Venice, Italy
,
Oct. 22–29
, pp.
1839
1848
.
94.
Yu
,
Z.
,
Yu
,
J.
,
Xiang
,
C.
,
Fan
,
J.
, and
Tao
,
D.
,
2017
, “
Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering
,”
IEEE Trans. Neur. Netw. Learn. Syst.
,
29
(
12
), pp.
5947
5959
.
95.
Gao
,
Y.
,
Beijbom
,
O.
,
Zhang
,
N.
, and
Darrell
,
T.
,
2015
, “
Compact Bilinear Pooling
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Boston, MA
,
June 7–12
, pp.
317
326
.
96.
Fukui
,
A.
,
Park
,
D. H.
,
Yang
,
D.
,
Rohrbach
,
A.
,
Darrell
,
T.
, and
Rohrbach
,
M.
,
2016
, “
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
,”
Conference on Empirical Methods in Natural Language Processing
,
Austin, TX
,
Nov. 1–5
, pp.
457
468
.
97.
Ben-Younes
,
H.
,
Cadene
,
R.
,
Cord
,
M.
, and
Thome
,
N.
,
2017
, “
MUTAN: Multimodal Tucker Fusion for Visual Question Answering
,”
IEEE International Conference on Computer Vision
,
Venice, Italy
,
Oct. 22–29
, pp.
2631
2639
.
98.
Tucker
,
L. R.
,
1966
, “
Some Mathematical Notes on Three-Mode Factor Analysis
,”
Psychometrika
,
31
(
3
), pp.
279
311
.
99.
Ben-Younes
,
H.
,
Cadene
,
R.
,
Thome
,
N.
, and
Cord
,
M.
,
2019
, “
BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection
,”
33rd AAAI Conference on Artificial Intelligence
,
Honolulu, HI
,
Jan. 27–Feb. 1
, pp.
8102
8109
.
100.
Jiang
,
S.
,
Hu
,
J.
,
Magee
,
C. L.
, and
Luo
,
J.
,
2022
, “
Deep Learning for Technical Document Classification
,”
IEEE Trans. Eng. Manage.
, pp.
1
17
.
101.
Parisot
,
S.
,
Ktena
,
S. I.
,
Ferrante
,
E.
,
Lee
,
M.
,
Guerrero
,
R.
,
Glocker
,
B.
, and
Rueckert
,
D.
,
2018
, “
Disease Prediction Using Graph Convolutional Networks: Application to Autism Spectrum Disorder and Alzheimer’s Disease
,”
Med. Image Anal.
,
48
, pp.
117
130
.
102.
Cao
,
M.
,
Yang
,
M.
,
Qin
,
C.
,
Zhu
,
X.
,
Chen
,
Y.
,
Wang
,
J.
, and
Liu
,
T.
,
2021
, “
Using DeepGCN to Identify the Autism Spectrum Disorder From Multi-site Resting-state Data
,”
Biomed. Signal Process. Contr.
,
70
, p.
103015
.
103.
Baltrusaitis
,
T.
,
Ahuja
,
C.
, and
Morency
,
L. P.
,
2017
, “
Multimodal Machine Learning: A Survey and Taxonomy
,”
IEEE Trans. Pattern Anal. Mach. Intell.
,
41
(
2
), pp.
423
443
.
104.
Vaswani
,
A.
,
Shazeer
,
N.
,
Parmar
,
N.
,
Uszkoreit
,
J.
,
Jones
,
L.
,
Gomez
,
A. N.
,
Kaiser
,
L.
, and
Polosukhin
,
I.
,
2017
, “
Attention is All You Need
,”
Advances in Neural Information Processing Systems
,
Long Beach, CA
,
Dec. 4–9
, pp.
5999
6009
.
105.
Graves
,
A.
,
Wayne
,
G.
, and
Danihelka
,
I.
,
2014
, “
Neural Turing Machines
.”
106.
Bahdanau
,
D.
,
Cho
,
K.
, and
Bengio
,
Y.
,
2014
, “
Neural Machine Translation by Jointly Learning to Align and Translate
,”
3rd International Conference on Learning Representations
,
San Diego, CA
,
May 7–9
.
107.
Zhu
,
Y.
,
Groth
,
O.
,
Bernstein
,
M.
, and
Fei-Fei
,
L.
,
2016
, “
Visual7W: Grounded Question Answering in Images
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Las Vegas, NV
,
June 27–30
, pp.
4995
5004
.
108.
Shih
,
K. J.
,
Singh
,
S.
, and
Hoiem
,
D.
,
2015
, “
Where To Look: Focus Regions for Visual Question Answering
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Boston, MA
,
June 7–12
, pp.
4613
4621
.
109.
Xu
,
H.
, and
Saenko
,
K.
,
2015
, “
Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
,”
Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, Vol.
9911
, LNCS, pp.
451
466
.
110.
Anderson
,
P.
,
He
,
X.
,
Buehler
,
C.
,
Teney
,
D.
,
Johnson
,
M.
,
Gould
,
S.
, and
Zhang
,
L.
,
2017
, “
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Honolulu, HI
,
July 21–26
, pp.
6077
6086
.
111.
Mansimov
,
E.
,
Parisotto
,
E.
,
Ba
,
J. L.
, and
Salakhutdinov
,
R.
,
2015
, “
Generating Images From Captions With Attention
,”
4th International Conference on Learning Representations
,
San Juan, Puerto Rico
,
May 2–4
.
112.
Xu
,
T.
,
Zhang
,
P.
,
Huang
,
Q.
,
Zhang
,
H.
,
Gan
,
Z.
,
Huang
,
X.
, and
He
,
X.
,
2018
, “
AttnGAN: Fine-Grained Text to Image Generation With Attentional Generative Adversarial Networks
,”
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
,
Salt Lake City, UT
,
June 18–23
, pp.
1316
1324
.
113.
Li
,
W.
,
Zhang
,
P.
,
Zhang
,
L.
,
Huang
,
Q.
,
He
,
X.
,
Lyu
,
S.
, and
Gao
,
J.
,
2019
, “
Object-Driven Text-to-Image Synthesis Via Adversarial Training
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Long Beach, CA
,
June 15–20
, pp.
12166
12174
.
114.
Nam
,
H.
,
Ha
,
J.-W.
, and
Kim
,
J.
,
2017
, “
Dual Attention Networks for Multimodal Reasoning and Matching
,”
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
,
Honolulu, HI
,
July 21–26
, pp.
2156
2164
..
115.
Elsen
,
C.
,
Häggman
,
A.
,
Honda
,
T.
, and
Yang
,
M. C.
,
2016
, “
Hierarchical Question-Image Co-Attention for Visual Question Answering
,”
30th International Conference on Neural Information Processing Systems
,
Barcelona Spain
,
Dec. 5–10
, pp.
737
747
.
116.
Osman
,
A.
, and
Samek
,
W.
,
2018
, “
Dual Recurrent Attention Units for Visual Question Answering
,”
Comput. Vision Imag. Understand.
,
185
, pp.
24
30
.
117.
Schwartz
,
I.
,
Schwing
,
A. G.
, and
Hazan
,
T.
,
2017
, “
High-Order Attention Models for Visual Question Answering
,”
Advances in Neural Information Processing Systems
,
Long Beach, CA
,
Dec. 4– 9
, pp.
3665
3675
.
118.
Yang
,
Z.
,
He
,
X.
,
Gao
,
J.
,
Deng
,
L.
, and
Smola
,
A.
,
2015
, “
Stacked Attention Networks for Image Question Answering
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Boston, MA
,
June 7– 12
, pp.
21
29
.
119.
Fan
,
H.
, and
Zhou
,
J.
,
2018
, “
Stacked Latent Attention for Multimodal Reasoning
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Salt Lake City, UT
,
June 18–22
, pp.
1072
1080
.
120.
Xiong
,
C.
,
Merity
,
S.
, and
Socher
,
R.
,
2016
, “
Dynamic Memory Networks for Visual and Textual Question Answering
,”
33rd International Conference on Machine Learning
,
New York
,
June 20–22
, Vol. 5, pp.
3574
3583
.
121.
Ren
,
S.
,
He
,
K.
,
Girshick
,
R.
, and
Sun
,
J.
,
2015
, “
Faster R-CNN: Towards Real-Time Object Detection With Region Proposal Networks
,”
Advances in Neural Information Processing Systems
,
Montreal, Canada
,
Dec. 7–12
, p.
6
.
122.
Lu
,
P.
,
Li
,
H.
,
Zhang
,
W.
,
Wang
,
J.
, and
Wang
,
X.
,
2018
, “
Co-Attending Free-Form Regions and Detections With Multi-modal Multiplicative Feature Embedding for Visual Question Answering
,”
32nd AAAI Conference on Artificial Intelligence
,
New Orleans, LA
,
Feb. 2–7
, pp.
7218
7225
.
123.
Rombach
,
R.
,
Blattmann
,
A.
,
Lorenz
,
D.
,
Esser
,
P.
, and
Ommer
,
B.
,
2021
, “
High-Resolution Image Synthesis With Latent Diffusion Models
,”
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
,
Virtual
,
June 19–25
, pp.
10674
10685
.
124.
Baevski
,
A.
,
Hsu
,
W.-N.
,
Xu
,
Q.
,
Babu
,
A.
,
Gu
,
J.
, and
Auli
,
M.
,
2022
, “
Data2vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language
.”
125.
Kim
,
J. H.
,
Lee
,
S. W.
,
Kwak
,
D.
,
Heo
,
M. O.
,
Kim
,
J.
,
Ha
,
J. W.
, and
Zhang
,
B. T.
,
2016
, “
Multimodal Residual Learning for Visual QA
,”
Advances in Neural Information Processing Systems
, Vol.
6
, pp.
361
369
.
126.
Arevalo
,
J.
,
Solorio
,
T.
,
Montes-Y-Gómez
,
M.
, and
González
,
F. A.
,
2017
, “
Gated Multimodal Units for Information Fusion
,”
5th International Conference on Learning Representations, ICLR 2017 – Workshop Track Proceedings
,
2
.
127.
Noh
,
H.
,
Seo
,
P. H.
, and
Han
,
B.
,
2015
, “
Image Question Answering Using Convolutional Neural Network With Dynamic Parameter Prediction
,”
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, pp.
30
38
.
128.
Oh
,
S.
,
Jung
,
Y.
,
Kim
,
S.
,
Lee
,
I.
, and
Kang
,
N.
,
2019
, “
Deep Generative Design: Integration of Topology Optimization and Generative Models
,”
ASME J. Mech. Des.
,
141
(
11
), p.
111405
.
129.
Chen
,
Q.
,
Wang
,
J.
,
Pope
,
P.
,
Chen
,
W.
, and
Fuge
,
M.
,
2022
, “
Inverse Design of Two-Dimensional Airfoils Using Conditional Generative Models and Surrogate Log-Likelihoods
,”
ASME J. Mech. Des.
,
144
(
2
), p.
021712
.
130.
Tolstikhin
,
I.
,
Bousquet
,
O.
,
Schölkopf
,
B.
,
Thierbach
,
K.
,
Bazin
,
P. L.
,
de Back
,
W.
, and
Gavriilidis
,
F.
, et al
,
2014
, “
Generative Adversarial Networks
,”
Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), NeurIPS
, Vol.
11046
, LNCS, pp.
1
9
.
131.
Mirza
,
M.
, and
Osindero
,
S.
,
2014
, “
Conditional Generative Adversarial Nets
.”
132.
Reed
,
S.
,
Akata
,
Z.
,
Yan
,
X.
,
Logeswaran
,
L.
,
Schiele
,
B.
, and
Lee
,
H.
,
2016
, “
Generative Adversarial Text to Image Synthesis
,”
33rd International Conference on Machine Learning
,
New York
,
June 20–22
, pp.
1681
1690
.
133.
Zhang
,
H.
,
Xu
,
T.
,
Li
,
H.
,
Zhang
,
S.
,
Wang
,
X.
,
Huang
,
X.
, and
Metaxas
,
D.
,
2016
, “
StackGAN: Text to Photo-Realistic Image Synthesis With Stacked Generative Adversarial Networks
,” pp.
5908
5916
.
134.
Zhang
,
H.
,
Xu
,
T.
,
Li
,
H.
,
Zhang
,
S.
,
Wang
,
X.
,
Huang
,
X.
, and
Metaxas
,
D. N.
,
2019
, “
StackGAN++: Realistic Image Synthesis With Stacked Generative Adversarial Networks
,”
IEEE Trans. Pattern Anal. Mach. Intell.
,
41
(
8
), pp.
1947
1962
.
135.
Zhu
,
M.
,
Pan
,
P.
,
Chen
,
W.
, and
Yang
,
Y.
,
2019
, “
DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Long Beach, CA
,
June 15–20
, Vol. 4, pp.
5795
5803
.
136.
Zhang
,
Z.
,
Xie
,
Y.
, and
Yang
,
L.
,
2018
, “
Photographic Text-to-Image Synthesis With a Hierarchically-nested Adversarial Network
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Salt Lake City, UT
,
June 18–22
, Vol. 2, pp.
6199
6208
.
137.
Dash
,
A.
,
Gamboa
,
J. C. B.
,
Ahmed
,
S.
,
Liwicki
,
M.
, and
Afzal
,
M. Z.
,
2017
, “
TAC-GAN – Text Conditioned Auxiliary Classifier Generative Adversarial Network
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Honolulu, HI
,
June 21–26
.
138.
Cha
,
M.
,
Gwon
,
Y.
, and
Kung
,
H. T.
,
2019
, “
Adversarial Learning of Semantic Relevance in Text to Image Synthesis
,”
Thirty-Third AAAI Conference on Artificial Intelligence
,
Honolulu, HI
,
Jan. 27–Feb. 1
.
139.
Qiao
,
T.
,
Zhang
,
J.
,
Xu
,
D.
, and
Tao
,
D.
,
2019
, “
MirrorGAN: Learning Text-to-Image Generation by Redescription
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Long Beach, CA
,
June 15–20
, Vol. 3, pp.
1505
1514
.
140.
Reed
,
S.
,
Akata
,
Z.
,
Mohan
,
S.
,
Tenka
,
S.
,
Schiele
,
B.
, and
Lee
,
H.
,
2016
, “
Learning What and Where to Draw
,”
Advances in Neural Information Processing Systems
,
Barcelona, Spain
,
Dec. 5–10
, Vol. 10, pp.
217
225
.
141.
Zhao
,
B.
,
Meng
,
L.
,
Yin
,
W.
, and
Sigal
,
L.
,
2018
, “
Image Generation From Layout
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Salt Lake City, UT
,
June 18–22
, Vol. 11, pp.
8576
8585
.
142.
Hinz
,
T.
,
Heinrich
,
S.
, and
Wermter
,
S.
,
2019
, “
Generating Multiple Objects at Spatially Distinct Locations
,”
th International Conference on Learning Representations
,
New Orleans, LA
,
May 6–9
.
143.
Hong
,
S.
,
Yang
,
D.
,
Choi
,
J.
, and
Lee
,
H.
,
2018
, “
Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Salt Lake City, UT
,
June 18–22
, Vol. 1, pp.
7986
7994
.
144.
Johnson
,
J.
,
Gupta
,
A.
, and
Fei-Fei
,
L.
,
2018
, “
Image Generation From Scene Graphs
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Salt Lake City, UT
,
June 18–22
, Vol. 4, pp.
1219
1228
.
145.
Mao
,
J.
,
Xu
,
W.
,
Yang
,
Y.
,
Wang
,
J.
,
Huang
,
Z.
, and
Yuille
,
A.
,
2014
, “
Deep Captioning With Multimodal Recurrent Neural Networks (m-RNN)
,”
3rd International Conference on Learning Representations
,
San Diego, CA
,
May 7– 9
.
146.
van den Oord
,
A.
,
Vinyals
,
O.
, and
Kavukcuoglu
,
K.
,
2017
, “
Neural Discrete Representation Learning
,”
Advances in Neural Information Processing Systems
,
Long Beach, CA
,
Dec. 4– 9
.
147.
Sanghi
,
A.
,
Chu
,
H.
,
Lambourne
,
J. G.
,
Wang
,
Y.
,
Cheng
,
C.-Y.
,
Fumero
,
M.
, and
Malekshan
,
K. R.
,
2022
, “
Clip-Forge: Towards Zero-Shot Text-to-Shape Generation
,”
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
,
New Orleans, LA
,
June 18–24
, pp.
18582
18592
.
148.
Shetty
,
R.
,
Rohrbach
,
M.
,
Hendricks
,
L. A.
,
Fritz
,
M.
, and
Schiele
,
B.
,
2017
, “
Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training
,”
IEEE International Conference on Computer Vision
,
Venice, Italy
,
Oct. 22–29
, Vol. 3, pp.
4155
4164
.
149.
Ajit
,
A.
,
Acharya
,
K.
, and
Samanta
,
A.
,
2020
, “
A Review of Convolutional Neural Networks
,”
International Conference on Emerging Trends in Information Technology and Engineering
,
Vellore, India
,
Feb. 24–25
.
150.
Li
,
Z.
,
Liu
,
F.
,
Yang
,
W.
,
Peng
,
S.
, and
Zhou
,
J.
,
2021
, “
A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects
,”
IEEE Trans. Neur. Netw. Learning Syst.
,
33
(
12
), pp.
6999
7019
.
151.
Fathi
,
E.
, and
Maleki Shoja
,
B.
,
2018
, “
Deep Neural Networks for Natural Language Processing
,”
Handb. Statist.
,
38
, pp.
229
316
.
152.
Mikolov
,
T.
,
Chen
,
K.
,
Corrado
,
G. S.
,
Dean
,
J.
,
Sutskever
,
I.
,
Chen
,
K.
,
Corrado
,
G. S.
, and
Dean
,
J.
,
2013
, “
Distributed Representations of Words and Phrases and Their Compositionality
,”
Advances in Neural Information Processing Systems
,
Lake Tahoe, NV
,
Dec. 5–10
, pp.
1
9
.
153.
Yagcioglu
,
S.
,
Erdem
,
E.
,
Erdem
,
A.
, and
Çakici
,
R.
,
2015
, “
A Distributed Representation Based Query Expansion Approach for Image Captioning
,”
53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Beijing, China, July 26–31.
154.
Cordonnier
,
J.-B.
,
Loukas
,
A.
, and
Jaggi
,
M.
,
2020
, “
On the Relationship Between Self-Attention and Convolutional Layers
,”
International Conference on Learning Representations
,
Addis Ababa, Ethiopia
,
Apr. 26–30
.
155.
Dosovitskiy
,
A.
,
Beyer
,
L.
,
Kolesnikov
,
A.
,
Weissenborn
,
D.
,
Zhai
,
X.
,
Unterthiner
,
T.
, and
Dehghani
,
M.
,
2021
, “
An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
,”
International Conference on Learning Representations
,
Vienna, Austria
,
May 3–7
.
156.
Wang
,
Y.
,
Xu
,
J.
, and
Sun
,
Y.
,
2022
, “
End-to-End Transformer Based Model for Image Captioning
,”
Proc. AAAI Conf. Artif. Intell.
,
36
(
3
), pp.
2585
2594
.
157.
Han
,
X.
,
Wang
,
Y.-T.
,
Feng
,
J.-L.
,
Deng
,
C.
,
Chen
,
Z.-H.
,
Huang
,
Y.-A.
,
Su
,
H.
,
Hu
,
L.
, and
Hu
,
P.-W.
,
2023
, “
A Survey of Transformer-Based Multimodal Pre-Trained Modals
,”
Neurocomputing
,
515
, pp.
89
106
.
158.
Sohl-Dickstein
,
J.
,
Weiss
,
E. A.
,
Maheswaranathan
,
N.
, and
Ganguli
,
S.
,
2015
, “
Deep Unsupervised Learning Using Nonequilibrium Thermodynamics
,”
32nd International Conference on Machine Learning
,
Lille, France
,
July 6–11
, pp.
2246
2255
.
159.
Purwar
,
A.
, and
Chakraborty
,
N.
,
2023
, “
Deep Learning-Driven Design of Robot Mechanisms
,”
ASME J. Comput. Inf. Sci. Eng.
,
23
(
6
), p.
060811
.
160.
Ho
,
J.
,
Jain
,
A.
, and
Abbeel
,
P.
,
2020
, “
Denoising Diffusion Probabilistic Models
,”
Advances in Neural Information Processing Systems
,
Virtual
,
Dec. 6–12
.
161.
Song
,
J.
,
Meng
,
C.
, and
Ermon
,
S.
,
2021
, “
Denoising Diffusion Implicit Models
,”
International Conference on Learning Representations
,
Virtual
,
May 3–7
.
162.
Song
,
Y.
,
Sohl-Dickstein
,
J.
,
Kingma
,
D. P.
,
Kumar
,
A.
,
Ermon
,
S.
, and
Poole
,
B.
,
2020
, “
Score-Based Generative Modeling Through Stochastic Differential Equations
.”
163.
Vahdat
,
A.
,
Kreis
,
K.
, and
Kautz
,
J.
,
2021
, “
Score-Based Generative Modeling in Latent Space
,”
Advances in Neural Information Processing Systems
,
Virtual
,
Dec. 6–14
, pp.
11287
11302
.
164.
Luo
,
S.
, and
Hu
,
W.
,
2021
, “
Diffusion Probabilistic Models for 3D Point Cloud Generation
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Virtual
,
June 19–25
, Vol. 3, pp.
2836
2844
.
165.
Zhou
,
L.
,
Du
,
Y.
, and
Wu
,
J.
,
2021
, “
3D Shape Generation and Completion Through Point-Voxel Diffusion
,”
IEEE International Conference on Computer Vision
,
Virtual
,
Oct. 11–17
, Vol. 4, pp.
5806
5815
.
166.
Zeng
,
X.
,
Vahdat
,
A.
,
Williams
,
F.
,
Gojcic
,
Z.
,
Litany
,
O.
,
Fidler
,
S.
, and
Kreis
,
K.
,
2022
, “
LION: Latent Point Diffusion Models for 3D Shape Generation
,”
Neural Information Processing Systems
,
New Orleans, LA
,
Nov. 28–Dec. 9
.
167.
Liu
,
Z.
,
Tang
,
H.
,
Lin
,
Y.
, and
Han
,
S.
,
2019
, “
Point-Voxel CNN for Efficient 3D Deep Learning
,”
Advances in Neural Information Processing Systems
,
Vancouver, BC, Canada
,
Dec. 8–14
, Vol. 32, p.
7
.
168.
Ho
,
J.
, and
Salimans
,
T.
,
2022
, “
Classifier-Free Diffusion Guidance
,”
NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications
,
Virtual
,
Dec. 6– 14
, Vol. 7.
169.
Nichol
,
A.
,
Jun
,
H.
,
Dhariwal
,
P.
,
Mishkin
,
P.
, and
Chen
,
M.
,
2022
, “
Point-E: A System for Generating 3D Point Clouds From Complex Prompts
.”
170.
Ramesh
,
A.
,
Dhariwal
,
P.
,
Nichol
,
A.
,
Chu
,
C.
, and
Chen
,
M.
,
2022
, “
Hierarchical Text-Conditional Image Generation With CLIP Latents
.”
171.
Mao
,
J.
,
Huang
,
J.
,
Toshev
,
A.
,
Camburu
,
O.
,
Yuille
,
A.
, and
Murphy
,
K.
,
2015
, “
Generation and Comprehension of Unambiguous Object Descriptions
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Boston, MA
,
June 7–12
, Vol. 11, pp.
11
20
.
172.
Vinyals
,
O.
,
Toshev
,
A.
,
Bengio
,
S.
, and
Erhan
,
D.
,
2014
, “
Show and Tell: A Neural Image Caption Generator
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Columbus, OH
,
June 23–28
, Vol. 11, pp.
3156
3164
.
173.
Rohrbach
,
A.
,
Rohrbach
,
M.
, and
Schiele
,
B.
,
2015
, “
The Long-Short Story of Movie Description
,”
Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, Vol.
9358
, pp.
209
221
.
174.
Zheng
,
Y.
,
Bao
,
X.
,
Zhao
,
F.
,
Chen
,
C.
,
Liu
,
Y.
,
Sun
,
B.
, and
Wang
,
H.
,
2022
, “
Prediction of Remaining Useful Life Using Fused Deep Learning Models: A Case Study of Turbofan Engines
,”
ASME J. Comput. Inf. Sci. Eng.
,
22
(
5
), p.
054501
.
175.
Yu
,
J.
,
Xu
,
Y.
,
Koh
,
J. Y.
,
Luong
,
T.
,
Baid
,
G.
,
Wang
,
Z.
, and
Vasudevan
,
V.
,
2022
, “
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
,”
ArXiv
.
176.
Ding
,
M.
,
Yang
,
Z.
,
Hong
,
W.
,
Zheng
,
W.
,
Zhou
,
C.
,
Yin
,
D.
, and
Lin
,
J.
,
2021
, “
CogView: Mastering Text-to-Image Generation Via Transformers
,”
Advances in Neural Information Processing Systems
,
Virtual
,
Dec. 6–14
, Vol. 24, pp.
19822
19835
.
177.
Desai
,
K.
, and
Johnson
,
J.
,
2020
, “
VirTex: Learning Visual Representations From Textual Annotations
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Seattle, WA
,
June 16–18
, Vol. 6, pp.
11157
11168
.
178.
Bulent Sariyildiz
,
M.
,
Perez
,
J.
,
Larlus
,
D.
,
Sariyildiz
,
M. B.
,
Perez
,
J.
, and
Larlus
,
D.
,
2020
, “
Learning Visual Representations With Caption Annotations
,”
European Conference on Computer Vision (ECCV)
,
Virtual
,
Aug. 23–28
, pp.
153
170
.
179.
Dinh
,
L.
,
Sohl-Dickstein
,
J.
, and
Bengio
,
S.
,
2017
, “
Density Estimation Using Real NVP
,”
International Conference on Learning Representations
,
International Conference on Learning Representations
,
Apr. 24–26
.
180.
Wei
,
Y.
,
Vosselman
,
G.
, and
Yang
,
M. Y.
,
2022
, “
Flow-Based GAN for 3D Point Cloud Generation From a Single Image
.”
181.
Chen
,
Z.
, and
Zhang
,
H.
,
2018
, “
Learning Implicit Fields for Generative Shape Modeling
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Salt Lake City, UT
,
June 18–22
, Vol. 12, pp.
5932
5941
.
182.
Liu
,
S.
,
Saito
,
S.
,
Chen
,
W.
, and
Li
,
H.
,
2019
, “
Learning to Infer Implicit Surfaces Without 3D Supervision
,”
Advances in Neural Information Processing Systems
,
Vancouver, BC, Canada
,
Dec. 8–14
, Vol. 32, p.
11
.
183.
Park
,
J. J.
,
Florence
,
P.
,
Straub
,
J.
,
Newcombe
,
R.
, and
Lovegrove
,
S.
,
2019
, “
DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Long Beach, CA
,
June 15–20
, pp.
165
174
.
184.
Salimans
,
T.
,
Goodfellow
,
I.
,
Zaremba
,
W.
,
Cheung
,
V.
,
Radford
,
A.
, and
Chen
,
X.
,
2016
, “
Improved Techniques for Training GANs
,”
Advances in Neural Information Processing Systems
,
Barcelona, Spain
,
Dec. 5–10
, Vol. 6, pp.
2234
2242
.
185.
Heusel
,
M.
,
Ramsauer
,
H.
,
Unterthiner
,
T.
,
Nessler
,
B.
, and
Hochreiter
,
S.
,
2017
, “
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
,”
Advances in Neural Information Processing Systems
,
Long Beach, CA
,
Dec. 4–9
, 6,pp. 6627–6638.
186.
Odena
,
A.
,
Olah
,
C.
, and
Shlens
,
J.
,
2016
, “
Conditional Image Synthesis With Auxiliary Classifier GANs
,”
34th International Conference on Machine Learning
,
Sydney, Australia
,
Aug. 6–11
, Vol. 6, pp.
4043
4055
.
187.
Li
,
B.
,
Qi
,
X.
,
Lukasiewicz
,
T.
, and
Torr
,
P. H.
,
2019
, “
ManiGAN: Text-Guided Image Manipulation
,”
Conference on Computer Vision and Pattern Recognition (CVPR)
,
Long Beach, CA
,
June 15–20
, Vol. 12, pp.
7877
7886
.
188.
Achlioptas
,
P.
,
Diamanti
,
O.
,
Mitliagkas
,
I.
, and
Guibas
,
L.
,
2017
, “
Learning Representations and Generative Models for 3D Point Clouds
,”
Google 35th International Conference on Machine Learning
,
Stockholm, Sweden
,
July 11–15
, Vol. 1, pp.
67
85
.
189.
Shu
,
D.
,
Park
,
S. W.
, and
Kwon
,
J.
,
2019
, “
3D Point Cloud Generative Adversarial Network Based on Tree Structured Graph Convolutions
,”
IEEE International Conference on Computer Vision
,
Seoul, South Korea
,
Oct. 27–Nov. 2
, Vol. 5, pp.
3858
3867
.
190.
Ibing
,
M.
,
Lim
,
I.
, and
Kobbelt
,
L.
,
2021
, “
3D Shape Generation With Grid-Based Implicit Functions
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Virtual
,
June 19–25
, Vol. 7, pp.
13554
13563
.
191.
Socher
,
R.
,
Ganjoo
,
M.
,
Sridhar
,
H.
,
Bastani
,
O.
,
Manning
,
C. D.
, and
Ng
,
A. Y.
,
2013
, “
Zero-Shot Learning Through Cross-Modal Transfer
,”
1st International Conference on Learning Representations
,
Scottsdale, AZ
,
May 2–4
, Vol. 1.
192.
Tsai
,
Y.-H. H.
,
Liang
,
P. P.
,
Zadeh
,
A.
,
Morency
,
L.-P.
, and
Salakhutdinov
,
R.
,
2018
, “
Learning Factorized Multimodal Representations
,”
7th International Conference on Learning Representations
,
New Orleans, LA
,
May 6– 9
, Vol. 6.
193.
Ba
,
L. J.
,
Swersky
,
K.
,
Fidler
,
S.
, and
Salakhutdinov
,
R.
,
2015
, “
Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions
,”
2015 IEEE International Conference on Computer Vision, ICCV 2015
, Santiago, Chile, Dec. 7–13, pp.
4247
4255
.
194.
Reed
,
S.
,
Akata
,
Z.
,
Lee
,
H.
, and
Schiele
,
B.
,
2016
, “
Learning Deep Representations of Fine-Grained Visual Descriptions
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Las Vegas, NV
,
June 27–30
, Vol. 5, pp.
49
58
.
195.
Nakov
,
P.
, and
Ng
,
H. T.
,
2009
, “
Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages
.”
196.
Hendricks
,
L. A.
,
Venugopalan
,
S.
,
Rohrbach
,
M.
,
Mooney
,
R.
,
Saenko
,
K.
, and
Darrell
,
T.
,
2015
, “
Deep Compositional Captioning: Describing Novel Object Categories Without Paired Training Data
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Boston, MA
,
June 7–12
, Vol. 11, pp.
1
10
.
197.
Socher
,
R.
, and
Fei-Fei
,
L.
,
2010
, “
Connecting Modalities: Semi-Supervised Segmentation and Annotation of Images Using Unaligned Text Corpora
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
San Francisco, CA
,
June 13–18
, pp.
966
973
.
198.
Socher
,
R.
,
Karpathy
,
A.
,
Le
,
Q. V.
,
Manning
,
C. D.
, and
Ng
,
A. Y.
,
2014
, “
Grounded Compositional Semantics for Finding and Describing Images with Sentences
,”
Trans. Assoc. Comput. Linguist.
,
2
, pp.
207
218
.
199.
Feng
,
Y.
, and
Lapata
,
M.
,
2010
, “
Visual Information in Semantic Representation
,”
June.
200.
Bruni
,
E.
,
Boleda
,
G.
,
Baroni
,
M.
, and
Tran
,
N.-K.
,
2012
, “
Distributional Semantics in Technicolor
,”
July.
201.
Kottur
,
S.
,
Vedantam
,
R.
,
Moura
,
J. M. F.
, and
Parikh
,
D.
,
2016
, “
VisualWord2Vec (Vis-W2V): Learning Visually Grounded Word Embeddings Using Abstract Scenes
,”
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
,
Las Vegas, NV
,
June 27–30
, Vol. 6, pp.
4985
4994
.
202.
Gupta
,
T.
,
Schwing
,
A.
, and
Hoiem
,
D.
,
2019
, “
ViCo: Word Embeddings From Visual Co-occurrences
,”
IEEE International Conference on Computer Vision
,
Seoul, South Korea
,
Oct. 27–Nov. 2
, Vol. 8, pp.
7424
7433
.
203.
Mori
,
Y.
,
Takahashi
,
H.
, and
Oka
,
R.
,
1999
, “
Image-to-Word Transformation Based on Dividing and Vector Quantizing Images With Words
,”
MISRM’99 First International Workshop on Multimedia Intelligent Storage and Retrieval Management
,
Orlando, FL
.
204.
Quattoni
,
A.
,
Collins
,
M.
, and
Darrell
,
T.
,
2007
, “
Learning Visual Representations Using Images With Captions
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Minneapolis, MN
,
June 17–22
.
205.
Joulin
,
A.
,
van Der Maaten
,
L.
,
Jabri
,
A.
, and
Vasilache
,
N.
,
2015
, “
Learning Visual Features From Large Weakly Supervised Data
,”
ECCV 2016: Computer Vision – ECCV
, Vol.
9911
, pp.
67
84
.
206.
Li
,
A.
,
Jabri
,
A.
,
Joulin
,
A.
, and
Maaten
,
L. V. D.
,
2016
, “
Learning Visual N-Grams From Web Data
,”
IEEE International Conference on Computer Vision
,
Las Vegas, NV
,
June 27–30
, pp.
4193
4202
.
207.
Mahajan
,
D.
,
Girshick
,
R.
,
Ramanathan
,
V.
,
He
,
K.
,
Paluri
,
M.
,
Li
,
Y.
,
Bharambe
,
A.
, and
van der Maaten
,
L.
,
2018
, “
Exploring the Limits of Weakly Supervised Pretraining
,”
Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, Vol.
11206
, LNCS, pp.
185
201
.
208.
Kiela
,
D.
,
Bulat
,
L.
, and
Clark
,
S.
,
2015
, “
Grounding Semantics in Olfactory Perception
,”
53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing
,
Beijing, China
,
July 26–31
, Vol. 2, pp.
231
236
.
209.
Blum
,
A.
, and
Mitchell
,
T.
,
1998
, “
Combining Labeled and Unlabeled Data With Co-Training
,”
Annual ACM Conference on Computational Learning Theory
,
Santa Cruz, CA
,
July 7–9
, pp.
92
100
.
210.
Levin
,
A.
,
Viola
,
P.
, and
Freund
,
Y.
,
2003
, “
Unsupervised Improvement of Visual Detectors Using Co-Training
,”
IEEE International Conference on Computer Vision
,
Washington, DC
,
Oct. 13–16
, Vol. 1, pp.
626
633
.
211.
Christoudias
,
C. M.
,
Urtasun
,
R.
, and
Darrell
,
T.
,
2012
, “
Multi-View Learning in the Presence of View Disagreement
.”
212.
Girshick
,
R.
,
2015
, “
Fast R-CNN
,”
IEEE International Conference on Computer Vision (ICCV)
, pp.
1440
1448
.
213.
Cornia
,
M.
,
Baraldi
,
L.
, and
Cucchiara
,
R.
,
2022
, “
Explaining Transformer-Based Image Captioning Models: An Empirical Analysis
,”
AI Commun.
,
35
(
2
), pp.
111
129
.
214.
Herdade
,
S.
,
Kappeler
,
A.
,
Boakye
,
K.
, and
Soares
,
J.
,
2019
, “
Image Captioning: Transforming Objects Into Words
,”
Advances in Neural Information Processing Systems
,
Vancouver, Canada
,
Dec. 8–14
, Vol. 32.
215.
Huang
,
L.
,
Wang
,
W.
,
Chen
,
J.
, and
Wei
,
X.-Y.
,
2019
, “
Attention on Attention for Image Captioning
,”
IEEE/CVF International Conference on Computer Vision
,
Seoul, South Korea
,
Oct. 27–Nov. 2
, pp.
4633
4642
.
216.
He
,
S.
,
Liao
,
W.
,
Tavakoli
,
H. R.
,
Yang
,
M.
,
Rosenhahn
,
B.
, and
Pugeault
,
N.
,
2020
, “
Image Captioning through Image Transformer
,”
Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, Vol.
12625
, LNCS, pp.
153
169
.
217.
Li
,
G.
,
Zhu
,
L.
,
Liu
,
P.
, and
Yang
,
Y.
,
2019
, “
Entangled Transformer for Image Captioning
,”
IEEE/CVF International Conference on Computer Vision
,
Seoul, South Korea
,
Oct. 27–Nov. 2
, Vol. 10, pp.
8927
8936
.
218.
Aneja
,
J.
,
Deshpande
,
A.
, and
Schwing
,
A. G.
,
2017
, “
Convolutional Image Captioning
,”
IEEE Conference on Computer Vision and Pattern Recognition
,
Honolulu, HI
,
July 21–26
, Vol. 11, pp.
5561
5570
.
219.
Deshpande
,
A.
,
Aneja
,
J.
,
Wang
,
L.
,
Schwing
,
A. G.
, and
Forsyth
,
D.
,
2018
, “
Fast, Diverse and Accurate Image Captioning Guided By Part-of-Speech
,”
IEEE/CVF Conference on Computer Vision and Pattern Recognition
,
Long Beach, CA
,
June 15–20
, Vol. 5, pp.
10687
10696
.
220.
Li
,
B.
,
Qi
,
X.
,
Lukasiewicz
,
T.
, and
Torr
,
P. H.
,
2019
, “
Controllable Text-to-Image Generation
,”
33rd Conference on Neural Information Processing Systems
,
Vancouver, Canada
, Vol. 32, p.
9
.
221.
Tao
,
M.
,
Tang
,
H.
,
Wu
,
F.
,
Jing
,
X.
,
Bao
,
B.-K.
, and
Xu
,
C.
,
2022
, “
Df-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis
,”
IEEE/CVF Conference on Computer Vision and Pattern Recognition
,
New Orleans, LA
,
June 18–24
, pp.
16494
16504
.
222.
Karras
,
T.
,
Laine
,
S.
, and
Aila
,
T.
,
2018
, “
A Style-Based Generator Architecture for Generative Adversarial Networks
,”
IEEE/CVF Conference on Computer Vision and Pattern Recognition
,
Salt Lake City, UT
,
June 18–23
.
223.
Patashnik
,
O.
,
Wu
,
Z.
,
Shechtman
,
E.
,
Cohen-Or
,
D.
, and
Lischinski
,
D.
,
2021
, “
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
,”
IEEE/CVF International Conference on Computer Vision
,
Montreal, BC, Canada
,
Oct. 11–17
, Vol. 3, pp.
2065
2074
.
224.
Gal
,
R.
,
Patashnik
,
O.
,
Maron
,
H.
,
Bermano
,
A. H.
,
Chechik
,
G.
, and
Cohen-Or
,
D.
,
2021
, “
StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
,”
ACM Trans. Graph.
,
41
(
4
), p.
8
.
225.
Chefer
,
H.
,
Benaim
,
S.
,
Paiss
,
R.
, and
Wolf
,
L.
,
2022
, “
Image-Based Clip-Guided Essence Transfer
,”
Computer Vision – ECCV
,
Tel Aviv, Israel
,
Oct. 23–27
, pp.
695
711
.
226.
Ramesh
,
A.
,
Pavlov
,
M.
,
Goh
,
G.
,
Gray
,
S.
,
Voss
,
C.
,
Radford
,
A.
,
Chen
,
M.
, and
Sutskever
,
I.
,
2021
, “
Zero-Shot Text-to-Image Generation
,”
International Conference on Machine Learning
,
Virtual
,
July 18–24
, Vol. 2, pp.
8821
8831
.
227.
Crowson
,
K.
,
Biderman
,
S.
,
Kornis
,
D.
,
Stander
,
D.
,
Hallahan
,
E.
,
Castricato
,
L.
, and
Raff
,
E.
,
2022
, “
Vqgan-clip: Open Domain Image Generation and Editing With Natural Language Guidance
,”
Computer Vision – ECCV 2022
,
Tel Aviv, Israel
,
Oct. 23–27
, pp.
88
105
.
228.
Yu
,
J.
,
Li
,
X.
,
Koh
,
J. Y.
,
Zhang
,
H.
,
Pang
,
R.
,
Qin
,
J.
,
Ku
,
A.
,
Xu
,
Y.
,
Baldridge
,
J.
, and
Wu
,
Y.
,
2022
, “
Vector-Quantized Image Modeling With Improved VQGAN
,”
International Conference on Learning Representations
,
Virtual
,
Apr. 25–29
.
229.
Saharia
,
C.
,
Chan
,
W.
,
Saxena
,
S.
,
Li
,
L.
,
Whang
,
J.
,
Denton
,
E.
, and
Ghasemipour
,
S. K. S.
,
2022
, “
Photorealistic Text-to-Image Diffusion Models With Deep Language Understanding
,”
Advances in Neural Information Processing Systems 35
,
New Orleans, LA
,
Dec. 6–14
.
230.
Frans
,
K.
,
Soros
,
L.
, and
Witkowski
,
O.
,
2022
, “
CLIPDraw: Exploring Text-to-Drawing Synthesis Through Language-Image Encoders
,”
Advances in Neural Information Processing Systems
,
New Orleans, LA
,
Dec. 6– 14
.
231.
Ma
,
S.
,
Tang
,
Q.
,
Liu
,
Y.
, and
Feng
,
Q.
,
2022
, “
Prediction of Mechanical Properties of Three-Dimensional Printed Lattice Structures Through Machine Learning
,”
ASME J. Comput. Inf. Sci. Eng.
,
22
(
3
), p.
031008
.
232.
Nguyen
,
C. H. P.
, and
Choi
,
Y.
,
2019
, “
Triangular Mesh and Boundary Representation Combined Approach for 3D CAD Lightweight Representation for Collaborative Product Development
,”
ASME J. Comput. Inf. Sci. Eng.
,
19
(
1
), p.
011009
.
233.
Tucker
,
T. M.
, and
Kurfess
,
T. R.
,
2006
, “
Point Cloud to CAD Model Registration Methods in Manufacturing Inspection
,”
ASME J. Comput. Inf. Sci. Eng.
,
6
(
4
), pp.
418
421
.
234.
Mata
,
M. P.
,
Ahmed-Kristensen
,
S.
, and
Shea
,
K.
,
2019
, “
Implementation of Design Rules for Perception Into a Tool for Three-Dimensional Shape Generation Using a Shape Grammar and a Parametric Model
,”
ASME J. Mech. Des.
,
141
(
1
), p.
011101
.
235.
Toscano
,
J. D.
,
Zuniga-Navarrete
,
C.
,
Siu
,
W. D. J.
,
Segura
,
L. J.
, and
Sun
,
H.
,
2023
, “
Teeth Mold Point Cloud Completion Via Data Augmentation and Hybrid RL-GAN
,”
ASME J. Comput. Inf. Sci. Eng.
,
23
(
4
), p.
041008
.
236.
Choy
,
C. B.
,
Xu
,
D.
,
Gwak
,
J. Y.
,
Chen
,
K.
, and
Savarese
,
S.
,
2016
, “
3D-R2N2: A Unified Approach for Single and Multi-View 3D Object Reconstruction
,”
Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, Vol.
9912
, LNCS, pp.
628
644
.
237.
Gkioxari
,
G.
,
Johnson
,
J.
, and
Malik
,
J.
,
2019
, “
Mesh R-CNN
,”
IEEE International Conference on Computer Vision
,
Seoul, South Korea
,
Oct. 27–Nov. 2
, Vol. 6, pp.
9784
9794
.
238.
Shrestha
,
R.
,
Fan
,
Z.
,
Su
,
Q.
,
Dai
,
Z.
,
Zhu
,
S.
, and
Tan
,
P.
,
2021
, “
MeshMVS: Multi-View Stereo Guided Mesh Reconstruction
,”
International Conference on 3D Vision
,
London, UK
,
Dec. 1–3
, Vol. 10, pp.
1290
1300
.
239.
Fan
,
H.
,
Su
,
H.
, and
Guibas
,
L.
,
2016
, “
A Point Set Generation Network for 3D Object Reconstruction From a Single Image
,”
30th IEEE Conference on Computer Vision and Pattern Recognition
,
Honolulu, HI
,
July 21–26
, pp.
2463
2471
.
240.
Groueix
,
T.
,
Fisher
,
M.
,
Kim
,
V. G.
,
Russell
,
B. C.
, and
Aubry
,
M.
,
2018
, “
A Papier-Mache Approach to Learning 3D Surface Generation
,”
IEEE/CVF Conference on Computer Vision and Pattern Recognition
,
Salt Lake City, UT
,
June 18–23
, pp.
216
224
.
241.
Li
,
X.
,
Xie
,
C.
, and
Sha
,
Z.
,
2022
, “
A Predictive and Generative Design Approach for Three-Dimensional Mesh Shapes Using Target-Embedding Variational Autoencoder
,”
ASME J. Mech. Des.
,
144
(
11
), p.
114501
.
242.
Wu
,
J.
,
Zhang
,
C.
,
Xue
,
T.
,
Freeman
,
W. T.
, and
Tenenbaum
,
J. B.
,
2016
, “
Learning a Probabilistic Latent Space of Object Shapes Via 3D Generative-Adversarial Modeling
,”
Advances in Neural Information Processing Systems
,
Barcelona, Spain
,
Dec. 5–10
.
243.
Khan
,
S. H.
,
Guo
,
Y.
,
Hayat
,
M.
, and
Barnes
,
N.
,
2019
, “
Unsupervised Primitive Discovery for Improved 3D Generative Modeling
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Long Beach, CA
,
June 15–20
, Vol. 6, pp.
9731
9740
.
244.
Lin
,
H.
,
Xu
,
Q.
,
Xu
,
H.
,
Xu
,
Y.
,
Zheng
,
Y.
,
Zhong
,
Y.
, and
Nie
,
Z.
,
2024
, “
Three-Dimensional-Slice-Super-Resolution-Net: A Fast Few Shooting Learning Model for 3D Super-Resolution Using Slice-Up and Slice-Reconstruction
,”
ASME J. Comput. Inf. Sci. Eng.
,
24
(
1
), p.
011005
.
245.
Maron
,
H.
,
Galun
,
M.
,
Aigerman
,
N.
,
Trope
,
M.
,
Dym
,
N.
,
Yumer
,
E.
,
Kim
,
V. G.
, and
Lipman
,
Y.
,
2017
, “
Convolutional Neural Networks on Surfaces Via Seamless Toric Covers
,”
ACM Trans. Graph. (TOG)
,
36
(
4
), p.
7
.
246.
Ben-Hamu
,
H.
,
Maron
,
H.
,
Kezurer
,
I.
,
Avineri
,
G.
, and
Lipman
,
Y.
,
2018
, “
Multi-chart Generative Surface Modeling
,”
SIGGRAPH Asia 2018
,
Tokyo, Japan
,
Dec. 4–7
.
247.
Saquil
,
Y.
,
Xu
,
Q. C.
,
Yang
,
Y. L.
, and
Hall
,
P.
,
2020
, “
Rank3DGAN: Semantic Mesh Generation Using Relative Attributes
,”
34th AAAI Conference on Artificial Intelligence
,
New York
,
Feb. 7–12
, pp.
5586
5594
.
248.
Alhaija
,
H. A.
,
Dirik
,
A.
,
Knorig
,
A.
,
Fidler
,
S.
, and
Shugrina
,
M.
,
2022
, “
Xdgan: Multi-modal 3D Shape Generation in 2D Space
,”
British Machine Vision Conference
,
London, UK
,
Nov. 21–24
.
249.
Fu
,
R.
,
Zhan
,
X.
,
Chen
,
Y.
,
Ritchie
,
D.
, and
Sridhar
,
S.
,
2022
, “
Shapecrafter: A Recursive Text-Conditioned 3d Shape Generation Model
,”
Advances in Neural Information Processing Systems
,
New Orleans, LA
,
Nov. 28–Dec. 9
.
250.
Liu
,
Z.
,
Feng
,
Y.
,
Black
,
M. J.
,
Nowrouzezahrai
,
D.
,
Paull
,
L.
, and
Liu
,
W.
,
2023
, “
Meshdiffusion: Score-Based Generative 3D Mesh Modeling
,”
The Eleventh International Conference on Learning Representations
,
Kigali, Rwanda
,
May 1–5
.
251.
Alwala
,
K. V.
,
Gupta
,
A.
, and
Tulsiani
,
S.
,
2022
, “
Pre-Train, Self-Train, Distill: A Simple Recipe for Supersizing 3D Reconstruction
,” pp.
3763
3772
.
252.
Liu
,
Z.
,
Dai
,
P.
,
Li
,
R.
,
Qi
,
X.
, and
Fu
,
C.-W.
,
2022
, “
ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops 2022
,
New Orleans, LA
,
June 18–24
.
253.
Nam
,
G.
,
Khlifi
,
M.
,
Rodriguez
,
A.
,
Tono
,
A.
,
Zhou
,
L.
, and
Guerrero
,
P.
,
2022
, “
3D-LDM: Neural Implicit 3D Shape Generation With Latent Diffusion Models
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
,
New Orleans, LA
,
June 18–24
.
254.
Cheng
,
Z.
,
Chai
,
M.
,
Ren
,
J.
,
Lee
,
H.-Y.
,
Olszewski
,
K.
,
Huang
,
Z.
,
Maji
,
S.
, and
Tulyakov
,
S.
,
2022
, “Cross-Modal 3D Shape Generation and Manipulation,”
Computer Vision – ECCV 2022
,
S.
Avidan
,
G.
Brostow
,
M.
Cissé
,
G. M.
Farinella
, and
T.
Hassner
, eds.,
Springer Nature
,
Switzerland
, pp.
303
321
.
255.
Wang
,
N.
,
Zhang
,
Y.
,
Li
,
Z.
,
Fu
,
Y.
,
Liu
,
W.
, and
Jiang
,
Y. G.
,
2018
, “
Pixel2Mesh: Generating 3D Mesh Models From Single RGB Images
,”
Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, Vol.
11215
, LNCS, pp.
55
71
.
256.
Michel
,
O.
,
Bar-On
,
R.
,
Liu
,
R.
,
Benaim
,
S.
, and
Hanocka
,
R.
,
2021
, “
Text2Mesh: Text-Driven Neural Stylization for Meshes
,”
IEEE/CVF Conference on Computer Vision and Pattern Recognition
,
Nashville, TN
,
June 20–25
, pp.
134926
14502
.
257.
Jetchev
,
N.
,
2021
, “
Clipmatrix: Text-Controlled Creation of 3D Textured Meshes
,”
ArXiv
. https://arxiv.org/abs/2109.12922
258.
Malhan
,
R.
, and
Gupta
,
S. K.
,
2023
, “
The Role of Deep Learning in Manufacturing Applications: Challenges and Opportunities
,”
ASME J. Comput. Inf. Sci. Eng.
,
23
(
6
), p.
060816
.
259.
Mai
,
S.
,
Zeng
,
Y.
,
Zheng
,
S.
, and
Hu
,
H.
,
2022
, “
Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis
,”
IEEE Trans. Affect. Comput.
,
14
(
3
), pp.
2267
2289
.
260.
Zhou
,
Y.
,
Yang
,
Y.
,
Ying
,
Q.
,
Qian
,
Z.
, and
Zhang
,
X.
,
2023
, “
Multimodal Fake News Detection Via Clip-Guided Learning
,” July.
261.
Deng
,
Y.
,
Xu
,
X.
,
Qiu
,
Y.
,
Xia
,
J.
,
Zhang
,
W.
, and
Liu
,
S.
,
2020
, “
A Multimodal Deep Learning Framework for Predicting Drug-Drug Interaction Events
,”
Bioinformatics
,
36
(
15
), pp.
4316
4322
.
262.
Pakdamanian
,
E.
,
Sheng
,
S.
,
Baee
,
S.
,
Heo
,
S.
,
Kraus
,
S.
, and
Feng
,
L.
,
2021
, “
Deeptake: Prediction of Driver Takeover Behavior Using Multimodal Data
,”
CHI Conference on Human Factors in Computing Systems
,
Yokohama, Japan
,
May 8–13
.
263.
Yuan
,
C.
,
Marion
,
T.
, and
Moghaddam
,
M.
,
2023
, “
DDE-GAN: Integrating a Data-Driven Design Evaluator Into Generative Adversarial Networks for Desirable and Diverse Concept Generation
,”
ASME J. Mech. Des.
,
145
(
4
), p.
041407
.
264.
Ordonez
,
V.
,
Kulkarni
,
G.
, and
Berg
,
T.
,
2011
, “
Im2text: Describing Images Using 1 Million Captioned Photographs
,”
Advances in Neural Information Processing Systems
,
Granada, Spain
,
Dec. 12–15
.
265.
Devlin
,
J.
,
Cheng
,
H.
,
Fang
,
H.
,
Gupta
,
S.
,
Deng
,
L.
,
He
,
X.
,
Zweig
,
G.
, and
Mitchell
,
M.
,
2015
, “
Language Models for Image Captioning: The Quirks and What Works
,”
53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing
,
Beijing, China
,
July 26–31
, Vol. 2, pp.
100
105
.
266.
Kwon
,
E.
,
Huang
,
F.
, and
Goucher-Lambert
,
K.
,
2022
, “
Enabling Multi-modal Search for Inspirational Design Stimuli Using Deep Learning
,”
Artif. Intell. Eng. Des. Anal. Manuf.
,
36
, p.
e22
.
267.
Farhadi
,
A.
,
Hejrati
,
M.
,
Sadeghi
,
M. A.
,
Young
,
P.
,
Rashtchian
,
C.
,
Hockenmaier
,
J.
, and
Forsyth
,
D.
,
2010
, “
Every Picture Tells a Story: Generating Sentences From Images
,”
Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, Vol.
6314
LNCS, (PART 4), pp.
15
29
.
268.
Xu
,
R.
,
Xiong
,
C.
,
Chen
,
W.
, and
Corso
,
J. J.
,
2015
, “
Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework
,”
AAAI Conference on Artificial Intelligence
,
Austin, TX
,
Jan. 25–30
, pp.
2346
2352
.
269.
Hodosh
,
M.
,
Young
,
P.
, and
Hockenmaier
,
J.
,
2013
, “
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics
,”
J. Artif. Intell. Res.
,
47
, pp.
853
899
.
270.
Gero
,
J.
, and
Milovanovic
,
J.
,
2021
, “
The Situated Function-Behavior-Structure Co-Design Model
,”
CoDesign
,
17
(
2
), pp.
211
236
.
271.
Lin
,
T. Y.
,
Maire
,
M.
,
Belongie
,
S.
,
Hays
,
J.
,
Perona
,
P.
,
Ramanan
,
D.
,
Dollár
,
P.
, and
Zitnick
,
C. L.
,
2014
, “
Microsoft COCO: Common Objects in Context
,”
Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, Vol.
8693
, LNCS, (PART 5), pp.
740
755
.
272.
Krishna
,
R.
,
Zhu
,
Y.
,
Groth
,
O.
,
Johnson
,
J.
,
Hata
,
K.
,
Kravitz
,
J.
, and
Chen
,
S.
, et al.,
2016
, “
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
,”
Int. J. Comput. Vision
,
123
(
1
), pp.
32
73
.
273.
Thomee
,
B.
,
Shamma
,
D. A.
,
Friedland
,
G.
,
Elizalde
,
B.
,
Ni
,
K.
,
Poland
,
D.
,
Borth
,
D.
, and
Li
,
L.-J.
,
2015
, “
YFCC100M: The New Data in Multimedia Research
,”
Commun. ACM
,
59
(
2
), pp.
64
73
.
274.
Sun
,
C.
,
Shrivastava
,
A.
,
Singh
,
S.
, and
Gupta
,
A.
,
2017
, “
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
,”
IEEE International Conference on Computer Vision
,
Venice, Italy
,
Oct. 22–29
, Vol. 7, pp.
843
852
. .
275.
Murray
,
N.
,
Marchesotti
,
L.
, and
Perronnin
,
F.
,
2012
, “
AVA: A Large-Scale Database for Aesthetic Visual Analysis
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Providence, RI
,
June 16–21
, pp.
2408
2415
.
276.
Chen
,
K.
,
Choy
,
C. B.
,
Savva
,
M.
,
Chang
,
A. X.
,
Funkhouser
,
T.
, and
Savarese
,
S.
,
2018
, “
Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings
,”
Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, Vol.
11363
, LNCS, pp.
100
116
.
277.
Jahan
,
N.
,
Nesa
,
A.
, and
Layek
,
M. A.
,
2021
, “
Parkinson’s Disease Detection Using CNN Architectures With Transfer Learning
,”
International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES)
,
Chennai, India
,
Sept. 24–25
, pp.
1
5
.
278.
Regenwetter
,
L.
,
Srivastava
,
A.
,
Gutfreund
,
D.
, and
Ahmed
,
F.
,
2023
, “
Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design
.”
279.
Nabian
,
M. A.
, and
Meidani
,
H.
,
2020
, “
Physics-Driven Regularization of Deep Neural Networks for Enhanced Engineering Design and Analysis
,”
ASME J. Comput. Inf. Sci. Eng.
,
20
(
1
), p.
011006
.
280.
Xu
,
P.
,
Hospedales
,
T. M.
,
Yin
,
Q.
,
Song
,
Y.-Z.
,
Xiang
,
T.
, and
Wang
,
L.
,
2022
, “
Deep Learning for Free-Hand Sketch: A Survey
,”
IEEE Trans. Pattern Anal. Mach. Intell.
,
45
(
1
), pp.
285
312
.
281.
Ghadai
,
S.
,
Lee
,
X. Y.
,
Balu
,
A.
,
Sarkar
,
S.
, and
Krishnamurthy
,
A.
,
2019
, “
Multi-Level 3D CNN for Learning Multi-Scale Spatial Features
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
,
Long Beach, CA
,
June 16–17
, Vol. 6, pp.
1152
1156
.
282.
Kong
,
C.
,
Lin
,
D.
,
Bansal
,
M.
,
Urtasun
,
R.
, and
Fidler
,
S.
,
2014
, “
What Are You Talking About? Text-to-Image Coreference
,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
,
Columbus, OH
,
June 23–28
, Vol. 9, pp.
3558
3565
.
283.
Wu
,
F.
,
Lin
,
Y. C.
, and
Lu
,
P.
,
2022
, “
Research on the Design Strategy of Healing Products for Anxious Users During COVID-19
,”
Int. J. Environ. Res. Public Health
,
19
(
10
), p.
5
.
284.
Linardatos
,
P.
,
Papastefanopoulos
,
V.
, and
Kotsiantis
,
S.
,
2021
, “
Explainable AI: A Review of Machine Learning Interpretability Methods
,”
Entropy
,
23
(
1
), pp.
1
45
.
285.
Barredo Arrieta
,
A.
,
Díaz-Rodríguez
,
N.
,
Del Ser
,
J.
,
Bennetot
,
A.
,
Tabik
,
S.
,
Barbado
,
A.
,
Garcia
,
S.
, et al.
,
2020
, “
Explainable Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges Toward Responsible AI
,”
Inf. Fusion
,
58
, pp.
82
115
.
You do not currently have access to this content.