Abstract

The condition under which the data wrangling process is undertaken has a profound impact on the quality of the results of the data wrangling and analysis. This paper presents the results of the analysis of the sociotechnical aspects of a data wrangling activity in a large, multi-site global manufacturer. This activity was technically demanding, as operational data from multiple sources and formats needed to be integrated, but also involved interaction with multiple stakeholders in different parts of the world with their own ways of collecting and structuring the data. The data had been captured previously for a different purpose. The clients were not aware that the data followed a different logic in the various sites and in some cases needed to be manually extracted and interpreted. The paper describes the data wrangling process and analyses the assumptions, goals, and biases of the different stakeholders. The analysis raises questions and insights about how data can be trusted and suggests that human intervention with data along the data wrangling process is often un-intentional, tacit, and easily overlooked. It is suggested that contextual factors, such as data quality and assessment of consequences when acting/making decisions on the new data set are given higher attention during the specification of data wrangling assignments. The paper concludes with recommendations for data wrangling practitioners.

References

1.
Isaksson
,
O.
, and
Eckert
,
C.
,
2020
, “
Product Development 2040: Technologies are Just as Good as the Designer’s Ability to Integrate Them
,” Design Society Report DS107.
2.
Statista Research Department
,
2016
, “
Internet of Things (IoT) Connected Devices Installed Base Worldwide From 2015 to 2025
,” https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/, Accessed April 5, 2022.
3.
Byabazaire
,
J.
,
O’Hare
,
G.
, and
Delaney
,
D.
,
2020
, “
Data Quality and Trust: Review of Challenges and Opportunities for Data Sharing in IoT
,”
Electronics
,
9
(
12
), p.
2083
.
4.
Gregory
,
K. M.
,
Cousijn
,
H.
,
Groth
,
P.
,
Scharnhorst
,
A.
, and
Wyatt
,
S.
,
2020
, “
Understanding Data Search as a Socio-Technical Practice
,”
J. Inf. Sci.
,
46
(
4
), pp.
459
475
.
5.
Hazelrigg
,
G. A.
,
1998
, “
A Framework for Decision-Based Engineering Design
,”
ASME J. Mech. Des.
,
120
(
4
), pp.
653
658
.
6.
Mader
,
D. P.
,
2002
, “
Design for Six Sigma
,”
Qual. Prog.
,
35
(
7
), pp.
82
86
.
7.
Cai
,
L.
, and
Zhu
,
Y.
,
2015
, “
The Challenges of Data Quality and Data Quality Assessment in the Big Data
,”
Era. Data Sci. J.
,
14
(
2
).
8.
ISO. 8000-2:2020(en), “
Data Quality—Part 2: Vocabulary
”.
9.
Woodall
,
P.
,
2017
, “
The Data Repurposing Challenge: New Pressures From Data Analytics
,”
J. Data Inf. Qual.
,
8
(
3–4
), pp.
1
4
.
10.
Eckert
,
C.
,
Isaksson
,
O.
,
Eckert
,
C.
,
Coeckelbergh
,
M.
, and
Hagström
,
M. H.
,
2020
, “
Data Fairy in Engineering Land: The Magic of Data Analysis as a Sociotechnical Process in Engineering Companies
,”
ASME J. Mech. Des.
,
142
(
12
), p.
121402
.
11.
Coeckelbergh
,
M.
,
2017
,
New Romantic Cyborgs: Romanticism, Information Technology, and the End of the Machine
,
MIT Press
,
Cambridge, MA
.
12.
Bucciarelli
,
L. L.
,
1994
,
Designing Engineers
,
MIT Press
,
Cambridge, MA
.
13.
De Weck
,
O. L.
,
Roos
,
D.
, and
Magee
,
C. L.
,
2011
,
Engineering Systems: Meeting Human Needs in a Complex Technological World
,
MIT Press
,
Cambridge, MA
.
14.
Elish
,
M. C.
, and
Boyd
,
D.
,
2018
, “
Situating Methods in the Magic of Big Data and AI
,”
Commun. Monogr.
,
85
(
1
), pp.
57
80
.
15.
O'Neil
,
C.
,
2016
,
Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy
,
Broadway Books
,
New York
.
16.
Domingos
,
P.
,
2012
, “
A Few Useful Things to Know About Machine Learning
,”
Commun. ACM
,
55
(
10
), pp.
78
87
.
17.
Rose
,
D.
, and
Agile
,
C.
,
2016
,
Data Science: Create Teams That Ask the Right Questions and Deliver Real Value
,
Apress
,
New York
, pp.
3
251
.
18.
Minneman
,
S. L.
,
1991
, “
The Social Construction of a Technical Reality: Empirical Studies of Group Engineering Design Practice
,”
Doctoral dissertation
,
Stanford University
.
19.
Clark
,
K. B.
, and
Fujimoto
,
T.
,
1991
,
Product Development Performance
,
Harvard Business School Press
,
Boston, MA
.
20.
Luhmann
,
N.
,
1992
, “
What is Communication?
,”
Commun. Theo.
,
2
(
3
), pp.
251
259
.
21.
Luhmann
,
N.
,
1995
,
Social Systems
,
Stanford University Press
,
Stanford, CA
.
22.
Craig
,
R. T.
,
1999
, “
Communication Theory as a Field
,”
Commun. Theo
,
9
(
2
), pp.
119
161
.
23.
Krippendorff
,
K.
,
1971
, “
Communication and the Genesis of Structure
,”
General Syst.
,
16
, p.
171
.
24.
Bordonaba-Juste
,
V.
, and
Cambra-Fierro
,
J. J.
,
2009
, “
Managing Supply Chain in the Context of SMEs: A Collaborative and Customized Partnership With the Suppliers as the Key for Success
,”
Supply Chain Manage.: Int. J.
,
14
(
5
), pp.
393
402
.
25.
Hales
,
C.
, and
Gooch
,
S.
,
2004
,
Managing Engineering Design
,
Springer
,
London
.
26.
Star
,
L. S.
,
2010
, “
This is Not a Boundary Object: Reflections on the Origin of a Concept
,”
Sci., Technol. Human Values
,
35
(
5
), pp.
601
617
.
27.
Stacey
,
M.
, and
Eckert
,
C.
,
2003
, “
Against Ambiguity
,”
Comput. Support. Coop. Work
,
12
(
2
), pp.
153
183
.
28.
Shields
,
M.
,
2005
, “
Information Literacy, Statistical Literacy, Data Literacy
,”
IASSIST Quart.
,
28
(
2–3
), pp.
6
11
.
29.
Gal
,
I.
,
2002
, “
Adults’ Statistical Literacy: Meanings, Components, Responsibilities
,”
Int. Stat. Rev.
,
70
(
1
), pp.
1
25
.
30.
Wallman
,
K. K.
,
1993
, “
Enhancing Statistical Literacy: Enriching Our Society
,”
J. Am. Stat. Assoc.
,
88
(
421
), pp.
1
8
.
31.
Francois
,
K.
,
Monteiro
,
C.
, and
Allo
,
P.
,
2020
, “
Big-Data Literacy as a New Vocation for Statistical Literacy
,”
Stat. Educ. Res. J.
,
19
(
1
), pp.
194
205
.
32.
Wolff
,
A.
,
Gooch
,
D.
,
Montaner
,
J. J. C.
,
Rashid
,
U.
, and
Kortuem
,
G.
,
2016
, “
Creating an Understanding of Data Literacy for a Data-Driven Society
,”
J.. Commun. Infor.
,
12
(
3
), pp.
9
16
.
33.
Giese
,
T. G.
,
Wende
,
M.
,
Bulut
,
S.
, and
Anderl
,
R.
,
2020
, “
Introduction of Data Literacy in the Undergraduate Engineering Curriculum
,”
2020 IEEE Global Engineering Education Conference (EDUCON)
,
Porto, Portugal
,
Apr. 27–30
, IEEE, pp.
1237
1245
.
34.
Karpatne
,
A.
,
Atluri
,
G.
,
Faghmous
,
J. H.
,
Steinbach
,
M.
,
Banerjee
,
A.
,
Ganguly
,
A.
,
Shekhar
,
S.
,
Samatova
,
N.
, and
Kumar
,
V.
,
2017
, “
Theory-Guided Data Science: A New Paradigm for Scientific Discovery From Data
,”
IEEE Trans. Knowl. Data Eng.
,
29
(
10
), pp.
2318
2331
.
35.
Provost
,
F.
, and
Fawcett
,
T.
,
2013
, “
Data Science and Its Relationship to Big Data and Data-Driven Decision Making
,”
Big Data
,
1
(
1
), pp.
51
59
.
36.
Shmueli
,
G.
,
2010
, “
To Explain or to Predict?
,”
Stat. Sci.
,
25
(
3
), pp.
289
310
.
37.
Rahm
,
E.
, and
Do
,
H. H.
,
2000
, “
Data Cleaning: Problems and Current Approaches
,”
IEEE Data Eng. Bull.
,
23
(
4
), pp.
3
13
.
38.
Kandel
,
S.
,
Heer
,
J.
,
Plaisant
,
C.
,
Kennedy
,
J.
,
Van Ham
,
F.
,
Riche
,
N. H.
,
Weaver
,
C.
,
Lee
,
B.
,
Brodbeck
,
D.
, and
Buono
,
P.
,
2011
, “
Research Directions in Data Wrangling: Visualizations and Transformations for Usable and Credible Data
,”
Inf. Vis.
,
10
(
4
), pp.
271
288
.
39.
Baxter
,
D.
,
Gao
,
J.
,
Case
,
K.
,
Harding
,
J.
,
Young
,
B.
,
Cochrane
,
S.
, and
Dani
,
S.
,
2007
, “
An Engineering Design Knowledge Reuse Methodology Using Process Modelling
,”
Res. Eng. Des.
,
18
(
1
), pp.
37
48
.
40.
Mendekar
,
V.
Machine Learning—It’s All About Assumptions
”. https://www.kdnuggets.com/2021/02/machine-learning-assumptions.html, Accessed April 5, 2022.
41.
Terrizzano
,
I. G.
,
Schwarz
,
P. M.
,
Roth
,
M.
, and
Colino
,
J. E.
,
2015
, “
Data Wrangling: The Challenging Journey From the Wild to the Lake
,”
CIDR
.
42.
Gunasekaran
,
A.
,
Yusuf
,
Y. Y.
,
Adeleye
,
E. O.
, and
Papadopoulos
,
T.
,
2018
, “
Agile Manufacturing Practices: The Role of Big Data and Business Analytics With Multiple Case Studies
,”
Int. J. Prod. Res.
,
56
(
1–2
), pp.
385
397
.
43.
Malinowski
,
B.
,
1929
,
The Sexual Life of Savages in North-Western Melanesia
,
Halcyon House
,
New York
.
44.
Mead
,
M.
,
1928
,
Coming of Age in Samoa: A Psychological Study of Primitive Youth for Western Civilisation
,
William Morrow & Co
,
New York
.
45.
Hickey
,
S.
,
2004
,
Participation: From Tyranny to Transformation: Exploring New Approaches to Participation in Development
,
Zed Books
,
London
.
46.
Clark
,
A.
,
Holland
,
C.
,
Katz
,
J.
, and
Peace
,
S.
,
2009
, “
Learning to See: Lessons From a Participatory Observation Research Project in Public Spaces
,”
Int. J. Soc. Res. Meth.
,
12
(
4
), pp.
345
360
.
47.
Liker
,
J. K.
,
1997
,
Becoming Lean: Inside Stories of US Manufacturers
,
CRC Press
,
Portland, OR
.
48.
Stobierski
,
T.
,
2021
, “
Data Wrangling: What It Is & Why It’s Important
”,
HBR On Line
, https://online.hbs.edu/blog/post/data-wrangling?fbclid=IwAR05KkHdXfEnCyHa5QJugJ–ktIC_vA3SXqER8rSeXgt6CqnE1dCQ9O4ipY, Accessed August 6, 2022.
49.
Cowgill
,
B.
,
Dell'Acqua
,
F.
,
Deng
,
S.
,
Hsu
,
D.
,
Verma
,
N.
, and
Chaintreau
,
A.
,
2020
, “
Biased Programmers? or Biased Data? A Field Experiment in Operationalizing AI Ethics
,”
Proceedings of the 21st ACM Conference on Economics and Computation
,
online
, pp.
679
681
.
50.
McAuliffe
,
K.
, and
Dunham
,
Y.
,
2016
, “
Group Bias in Cooperative Norm Enforcement
,”
Philos. Trans. R. Soc., B
,
371
(
1686
), p.
20150073
.
51.
Crocker
,
J.
,
1982
, “
Biased Questions in Judgment of Covariation Studies,”
Pers. Soc. Psychol. Bull.
,
8
(
2
), pp.
214
220
.
52.
Interview With Pete Jones
,
2019
, “
Unconscious Bias—Avoidable or Inevitable?
,”
European Research Council Magazine
, https://erc.europa.eu/news-events/magazine/unconscious-bias-%E2%80%93-avoidable-or-inevitable, Accessed April 5, 2022.
53.
Hogan
,
R.
, and
Kaiser
,
R. B.
,
2005
, “
What We Know About Leadership
,”
Rev. Gen. Psychol.
,
9
(
2
), pp.
169
180
.
54.
Kauermann
,
G.
, and
Kuechenhoff
,
H.
,
2010
,
Stichproben: Methoden und praktische Umsetzung mit R
,
Springer-Verlag
,
Berlin
.
55.
Thomas
,
2020
, “
10 Reasons Why Data Science Projects Fail
”, https://fastdatascience.com/why-do-data-science-projects-fail/, Accessed April 5, 2022. 
56.
Laing
,
D.
,
2017
,
Communication in Data Science: More than Just the Final Report
”, https://ubc-mds.github.io/2017-11-10-DSCI-542-communication/, Accessed April 5, 2022.
57.
Day
,
R.
,
2020
,
Communication Can Make or Break a Data Science Project - 4 Key Communication Skills to Make You a Better Data Scientist
”, https://towardsdatascience.com/communication-can-make-or-break-a-data-science-project-75ce3952de89, Accessed April 5, 2022.
58.
Wheelwright
,
S. C.
, and
Clark
,
K. B.
,
1992
,
Revolutionizing Product Development: Quantum Leaps in Speed, Efficiency, and Quality
,
Simon and Schuster
,
New York
.
59.
Comuzzi
,
M.
, and
Patel
,
A.
,
2016
, “
How Organisations Leverage Big Data: a Maturity Model
,”
Ind. Manage. Data Syst.
,
116
(
8
), pp.
1468
1492
.
60.
Liedtka
,
J.
,
2018
, “
Why Design Thinking Works
,”
Harv. Bus. Rev.
,
96
(
5
), pp.
72
79
.
You do not currently have access to this content.