Abstract
Empathic design research aims to gain deep and accurate user understanding. We can measure the designer's empathic ability as empathic accuracy (EA) in understanding the user's thoughts and feelings during an interview. However, the EA measure currently relies on human rating and is thus time-consuming, making the use of large language models (LLMs) an attractive alternative. It is essential to consider two significant constraints when implementing LLMs as a solution: the choice of LLM and the impact of domain-specific datasets. Datasets of the interactions between the designer and the user are not generally available. We present such a dataset consisting of the EA task employed in user interviews to measure empathic understanding. It consists of over 400 pairs of user thoughts or feelings matched with a designer's guess of the same and the human ratings of the accuracy. We compared the performance of six sentence embedding state-of-the-art (SOTA) LLMs with different pooling techniques on the EA task. We used the LLMs to extract semantic information before and after fine-tuning. We conclude that directly using LLMs based on their reported performance in general language tasks could result in errors when judging a designer's empathic ability. We also found that fine-tuning LLMs on our dataset improved their performance, but the model's ability to fit the EA task and pooling method also determined the LLM's performance. The results will provide insight for other LLM-based similarity analyses in design.