This article focuses on human skill understanding in the context of surgical assessment and training which has enormous and immediate application potential to enhance healthcare delivery. Surgical procedural performance involves interplay of a highly dynamic system of inter-coupled perceptual, sensory, and cognitive components. Computer-Integrated Surgery systems are a quintessential part of modern surgical workflow owing to developments in miniaturization, sensors and computation. Robotic Minimally Invasive Surgery, and the engendered computer-integration, offers unique opportunities for quantitative computer-based surgical-performance evaluation. The skill evaluation metrics as discussed need a variety of sensory data that limits the application to very specific robotic devices. The ability to couple quantitative, validated and stable metrics for surgical performance would lead to improvements in assessment and subsequently, training methods. Cognitive assessment can now be extended to also include sensorimotor assessment, with capacity to monitor and track skill across time.
Skilled performance of manipulation tasks, especially in conjunction with innovative tools “to extend the reach of humans” has been instrumental to human progress. Numerous learning traditions have evolved over millennia to help characterize human sensorimotor skill for performing complex manipulation tasks while simultaneously developing the modeling techniques to capture skill acquisition and retention. For example, the careful assessment, nurturing and refinement of sensorimotor task performance has proven equally pertinent to the skilled operation of machinery as well as mellifluous musical performance. Yet there are major gaps in our understanding of the human-operator interactions with tools in complex environments.
In this article, we focus specifically on human skill understanding in the context of surgical assessment and training which has enormous and immediate application potential to enhance healthcare delivery. Surgical procedural performance involves interplay of a highly dynamic system of inter-coupled perceptual, sensory, and cognitive components. Traditional surgeries produced limited quantifiable data and, as a consequence, the skill acquisition and assessment was reliant on the philosophy of ‘See one, Do one, Teach one’.
Computer-Integrated Surgery (CIS) systems are a quintessential part of modern surgical workflow owing to developments in miniaturization, sensors and computation. The tremendous proliferation of such devices ensues from their enhanced benefits: (i) reduced recovery time for patients; (ii) augmentation of sensorimotor and cognitive capabilities of operating surgeon; and (iii) potential cost-savings for healthcare-systems. However, these devices constitute additional ‘intervening’ layers between the operating surgeon and the patient resulting in loss of physical feedback pathways and potentially compromising the performance. Very similar research and development issues arose during the emergence of teleoperator- and haptic-systems  leading up to the development of insightful R&D roadmaps. There is significant value in building upon these roadmaps to characterize the extent of the attenuation and to study the role of design and control in enhancing overall system-level performance.
It must be noted that the overall surgical task performance is dependent on bi-directional interaction between the neuromuscular system and its dynamic environment (human-machine interface + task dynamics) as shown in Figure 1. Given the vast spatiotemporal data from CIS systems, there is tremendous interest in generating atomic level indicators of skill acquisition. Researchers have pursued both ‘constituent-element-based compositional modeling’ as well as ‘data-driven system-dynamics identification’ perspectives, both of which we will discuss later.
The rest of the article is organized as follows. We begin with a basic historical perspective on surgical skill assessment both from clinical and research points of view. The Skill Metrics section outlines the important properties of a skill metric which in independent of the type of surgery, device, interface and surgical environments. As newer robotic devices are being introduced in Operating Room (OR), more surgical tasks are being automated; the interface between patient and surgeon is continuously evolving. Hence, an abstract treatment of skill is essential to develop quantitative metrics which can be applied to a variety of surgical tasks, devices and environment. The Quantititative Performance Assessment section describes recent research that uses a variety of sensor data such as video and kinematics, and their combinations. And last, we outline some of the open research issues with the current skill evaluation techniques.
In 1904, Dr. William Halsted created the first residency program in United States, applying the ‘See one, Do one, Teach one’ paradigm in which novitiate clinicians learn to perform procedures by observing experienced surgeons. The challenges to creation of successful surgical training regimen arise both from the complexity of cognitive- and sensorimotor skill-sets to be trained as well as the mission critical setting (literally life-and-death). Medical education relied on subjective or at best semi-quantitative metrics like Likert-scale proctored by experts, in order to assess the graduation requirements or skill of a surgeon. In select disciplines, such as general surgery and obstetrics, an Objective Structured Assessment of Technical Skills (OSATS) has been developed as an assessment tool – surgical task performance is rated by anonymous experts using task-specific checklists and a global-rating scale of performance with demonstrated inter-rater reliability and validity.
The requirement of an experienced surgeon during training and evaluation (in this apprenticeship-based model) places enormous constraints on the number of operation/trials by a trainee. Over the years, training of skills such as suturing, that are common across multiple procedures, has slowly led to transitioning from an apprenticeship-based to criterion-based models. Over the past decade, the ACGME (Accreditation Council for Graduate Medical Education) has espoused development of a cost-efficient proficiency-based advancement to bypass the limitations in the current apprenticeship-based system. Numerous objective methods for assessing technical skills are being considered for use in many surgical training programs today. OSATS as well as Objective Structured Clinical Examination (OSCE) emphasize the quantitative assessment processes without relying on expert evaluators using appropriate hardware (measurement device) such as Imperial College Surgical Assessment Device (ICSAD) and Advanced Dundee Endoscopic Psychomotor Trainer (ADEPT). But in the vast majority of other sub-specialties, such as pediatric nephrology, ACGME merely requires logging of performed procedures.
The next phase of research in surgical skill evaluation was enabled by the use of simulation methodologies and quantitative skills-assessment tools using the data collected during simulation. Such surgical simulators are relatively cheaper for hospital to operate and maintain and enabled novice surgeons to practice and sharpen specific skills. Computer-assisted surgical simulators (virtual-, physical- or augmented-reality) provide significant opportunities to sharpen the skills via developing different what-if scenarios and repeated/systematic training. Such systems exploit the quantitative recording and user-feedback capabilities of computer-based instrumentation (video and sensors). These simulators give various aggregate measures such as time to completion, path length etc. to rate surgical skill. An imperfect and incomplete understanding of the underlying relationships, coupled with insufficient computational support has led to an assessment regimen focused on easy-to-measure, quantitative but simplistic spatially- and temporally-aggregated measures. However, the use of such aggregated metrics (without repeatability, stability and potentially validity) to steer entire training regimen may lead to undesirable and unforeseen consequences.
The growth of computer integration in minimally-invasive-surgery (MIS) especially in the form of Robotic minimally invasive surgery (rMIS) now offers a unique set of opportunities to comprehensively address this situation. Arguably, the growth in MIS (and especially rMIS) has allowed a sheltering of the erstwhile fundamental challenge of “Nothing can come between a surgeon and his/her scalpel”. A range of physical variables can now be transparently monitored via instrumented tool-usage in both simulated and real-life scenarios.
While the collection of quantitative raw physical measurements is growing (in this era of Big Data), the oversimplification inherent in using aggregated measures often results in loss of desirable user-specific discriminative characteristics. Key challenges to assessment and accreditation of surgeons in such a scenario include (1) creating appropriate clinically relevant scenarios and settings and (2) developing uniform, repeatable, stable, verifiable performance metrics; at manageable financial levels for ever increasing cohorts of trainees.
Skill Metrics: Desired Features and Challenges
Skill acquisition is fundamental to human experience enabling us to learn from the experience of people who have already mastered a task. Our education curriculum inherently involves various forms of testing skill acquisition in order to locate and correct various skill specific deficiencies. Entire traditions such as playing instruments, singing rely on the pedagogical approach of having an expert with “trained-ear” to find mistakes. Specific to surgical training, a variety of cognitive and sensorimotor skills must be learned in order to perform a variety of surgical interventions.
Critical impediments to unified skill representation and estimation arise due to the variety of: (i) surgical procedures, (ii) surgical devices; and (iii) anatomical complexity. Figure 2 depicts the evolution of surgery: from open-to minimally-invasive (MIS) and further to robotic minimally-invasive-surgery (rMIS) which has redefined the surgeon-patient relationship in terms of available sensory feedback.
Given this dynamic relationship, it becomes imperative to consider abstracted/generalized treatment of skill assessment. Over the ages, several guidelines on the design of skill metrics have emerged (with clear implications to surgical education and accreditation)
Repeatability and stability (under controlled environments): The skill metric should converge to a predictable set (law of large numbers) under repeated execution of the same task within an environment.
Gradated Feedback Mechanism: Fundamentally skill evaluation needs to pinpoint the areas of improvement. Hence, in addition to a binary answer (Yes/No), the metrics need to provide a gradated scale. This enables the skill metric to be useful not just an accreditation mechanism but also improve specific skill deficiencies of trainees.
Real-time: Feedback to a trainee/intermediary needs to be provided in (as close to) real-time conditions to enable course corrections.
Surgical Outcomes: The skill levels either binary, discrete or continuous afford a comparison between the skill levels. However, another key characteristic is the need for correlation of skill-levels to actual surgical outcomes.
The aforementioned properties of skill metrics are quite broad, eventually there are some tasks that must be designed to evaluate the skill indicators of a trainee. As food for thought, we note that often analogies are made to liken surgical-task performance to another complex learned cognitive and sensorimotor behavior: automotive-driving.
As in surgery, driving capability can be assessed at a variety of spatial/temporal/hierarchical scales. For example: are we trying to assess a driver's ability to stay in middle of road (local short space-time scale task)? (ii) Or is the ability to get from City A to B under all types of road, traffic and weather (global spatial- and temporal-scale task)? Or are we trying to assess the ability to reject distraction e.g. cell-phones (cognitive vs spatiotemporal) during task performance? Despite these multi-scale issues, various road-transportation authorities have instituted a driving-test to assess performance. Often the test involves ‘controlled performance of experiments’ e.g. parallel parking test, three-point-turn test, which are scored by a driving examiner. While the manual assessment process is slowly making room for computerized diagnostic- and assessment-programs, this process is by no means complete. However, it may provide a useful roadmap of challenges and considerations for research and development of computer-aided/computer-enhanced surgical assessment.
Quantititative Performance Assessment
Many of the quantitative skill metrics currently available use acquired physical-measurement data from real surgeries as well as simulators. We will further elaborate on the type of data used to generate these metrics.
Contemporary surgical simulators use spatially- and temporally- aggregated measures such as MScore used in Intuitive Surgical's Skills Simulator . The MScore provides a binary (Yes/No) qualification answer and a continuous score to evaluate a trainee on various elementary tasks such as camera targeting, peg board manipulation tasks. Mscore and other similar scores integrate a variety of acquired sensory-data such as tool-drop, master-manipulator range, instrument collisions etc. Yet the MScore's ability - as a normalized weighted combination of multiple physical measurements (time-to-task-completion (TTC)), distance traveled) - to adequately capture subtle task-performance variations to form the discriminative basis between individuals and/or classes remains unclear. Other limitations including uncharacterized reliability, stability and repeatability of the employed metrics, hinder progress towards the final stages of validity.
Robotic Minimally Invasive Surgery, and the engendered computer-integration, offers unique opportunities for quantitative computer-based surgical-performance evaluation. In our work, we examine an alternate method of manipulative skill evaluation using micro-motion studies, having deep roots in performance evaluation in manufacturing industries . The well-established micro-motion studies’ methodology, originated in twentieth century, emphasizes on: (1) a top-down segmentation of a primary task into basic motion elements (‘Therbligs’); (2) recording of elements and key subtask performance in process-charts; and (3) obtaining metrics of performance for skill evaluation. Any of the performance metrics of macro-motions—from motion economy, tool motion measurements to handed-symmetry—can now be extended over the micro-motion temporal segments.
Apart from considering representative manipulation exercises from da Vinci surgical (SKILLS) simulator, real surgical videos were also analyzed with a list of predefined ‘Therbligs’ in order to validate the clinical relevance of this method. This affords relatively controlled and standardized test-scenarios for surgeons with varied experience-levels. The resulting performance metrics over each sub-procedure enabled intra- and inter-user comparative studies.
Language Of Surgery
Colleagues in the Computational Interaction and Robotics Lab (CIRL) [3,5] have studied skill assessment and gesture detection in both training and live patient surgical motions focusing on minimally invasive surgeries (such as robotic hysterectomy, functional endoscopic sinus surgeries (FESS), and septoplasty). Their idea is based on the fact that humans performing dexterous tasks follow a sequence of identifiable recurring motions (motifs) with some variability. To extract the surgical motifs, they designed a new technique that first translates the raw motion data into a domain that highlights the similarities and suppresses many factors of variability and then build a dictionary of important motifs (weighted statically based on their appearance in a particular surgery). The transformation function can be applied to streaming data, does not require manual processing, and is invariant to rigid transformation, cropping, and sampling frequency.
They also designed a similarity function that can measure the similarity between two motion trajectories by comparing them against a dictionary of motifs. They report accuracies about 80 to 90% for different surgical tasks. Besides learning surgical skill by demonstration of expert trajectories, they built a robotic planning system to generate an optimal expert trajectory based on a cost function and anatomical constraints for FESS. They showed that the optimal trajectory is more similar to the ones demonstrated by experts than novices, an indication that experts are probably optimizing their motions against the constraints of the environment.
Video Based Semantic Understanding
The skill evaluation metrics as discussed in the previous sub-sections need a variety of sensory data that limits the application to very specific robotic devices. The lure of using monocular/stereo data for assessment and training is due to the wide-spread availability and quintessential requirement of such modality for tele-operation. Such systems fundamentally rely solely on the surgeon-in-the-loop to ensure safe operation amidst a host of real-world uncertainties and complexities, e.g. the finite life and slack in the cables of the passive-robotic surgical instruments lead to tool-positioning inaccuracies, requiring surgeons to compensate for this error.
However, rich information content in video stream can also be used to automatedly assess surgeries. Specifically, in Kumar et.al  they leverage real-time video-based understanding for improved situational awareness and context-based decision support in robotic surgeries. Efforts at video understanding of real and virtual surgeries is pursued with a 2-fold objective: (i) Better understanding of skill of operator and (ii) Have a cascaded framework which can be useful from multiple perspectives – surgical guidance, safety, tracking and skill assessment. Though this preliminary study is restrictive (considers only two tools and two attributes), it can be easily extended to multiple tools and attributes.
Ultimately any automated skill assessment algorithm needs to rate and classify surgeon for which the algorithms need ground truth data. Jun et. al  classified surgeons into 3 different categories of expert, novices and intermediate which were pre-classified based on their experience. However this approach did not give any continuous scores. Ahmidi et al  used scores generated from human experts such as OSATS which were then classified as expert, novices or intermediates based on the resulting score. However both these works did not provide any sort of continuous/discrete measure of skill.
To our knowledge, published validity studies to benchmark against clinician-skill levels do not exist – although many ipso- facto studies are underway. Nonetheless, many surgical residency programs/hospital administrations are proposing to use such measures on training simulators to help pre-qualify trainees prior to actual wet-lab usage. As the Intuitive Surgical White-Paper notes “universally-accepted and validated” metrics are key to deployment of a staged and calibrated robotic-surgery training curriculum.
The field is replete with numerous open-problems that remain to be tackled. A series of workshops [ICRA2013 https://sites.google. com/site/ieeerassurgrobstandards/], [IROS2014 https://sites. google.com/site/ieeerasmedicalrobotics/] organized by the authors have sought to highlight the critical gaps in both fundamental research and technology development efforts especially as pertains to benchmarking performance of human users; training and accreditation; safety and risk-assessment. A few of them are listed below in no particular order:
Benchmarked data with standardized metrics for better algorithm development: Surgical robotics hardware is prohibitively expensive to acquire, operate and maintain. To enable better algorithm development, the community needs to ensure open-source standardized information acquisition across various devices.
Relating metrics to surgical outcomes and using these metrics for certification: Human anatomy shows wide variations and any skill metric should be related to surgical outcomes. The consensus in the community suggests “universally-accepted and validated” metrics are key to deployment of a staged and calibrated robotic-surgery training curriculum.
Specific feedback such as “Your motion is not efficient while suturing” instead of “Sorry, you need to practice that motion again”: Thin slices skill assessment is necessary to have person and operation specific skill assessment and feedback. This would allow one to focus on a particular area of concern such as manipulation, coordination etc.
Presenting feedback to surgeon to improve safety based on skill: Can we provide real-time guidelines during occlusions or instructions to help doctor get his tools in view?
Success in understanding manifestation of human skill within the context of surgical tele-operated systems will have implications for a much broader arena of sensorimotor skill assessment and training, in particular ones assisted by robotic systems. Improved understanding of human manipulatory skills would be critical to designing a broad range of interactive robotic-manipulation systems, from telesurgical systems to various teleoperated vehicles and more generally to human user control of complex machinery.
From a broader scientific perspective, it will give us insights into organization of neuro-musculoskeletal interactions within the brain, including applications involving improvements of sensorimotor performance. The ability to couple quantitative, validated and stable metrics for surgical performance would lead to improvements in assessment and subsequently, training methods. Cognitive assessment can now be extended to also include sensorimotor assessment, with capacity to monitor and track skill across time. They would help usher in the next generation of virtual procedural simulators with significant impact on patient safety by providing a ready means to learn, maintain and improve surgical procedural skills. Specifically for the teleoperated surgical systems, skill understanding will also provide a quantitative method for surgical education assessment.
This work was partially supported by the National Science Foundation (NSF) Awards IIS-1319084, CNS-1314484 and the UB Bruce-Holm Catalyst Fund.