Understanding user satisfaction in conversational AI

In the rapidly evolving landscape of artificial intelligence, chatbots have become a cornerstone of customer service, support, and engagement across various industries. Despite their widespread adoption, one of the persistent challenges remains: accurately gauging user satisfaction. Understanding whether users are truly satisfied with their interactions is crucial for refining chatbot performance, enhancing user experience and trust, and ensuring business objectives are met.

However, measuring satisfaction in conversational AI systems is far from straightforward. Traditional metrics like response accuracy or completion rates often fail to capture the nuanced sentiments and perceptions of users. As a result, businesses struggle to determine whether their chatbots are genuinely meeting user needs or merely providing superficially acceptable interactions. This gap underscores the need for more reliable, comprehensive methods to assess user satisfaction and inform continuous improvement.

The Limits of Direct User Feedback

One straightforward approach is to ask users directly to report their satisfaction through surveys or feedback prompts. However, this method often falls short in practice. Users typically overlook these questions, provide biased responses to please the system or avoid confrontation, or simply be reluctant to share honest opinions.  This approach simply does not work.

A better approach is to try and automatically gauge user satisfaction based on the conversation itself. This has been the subject of many previous efforts, including recently the work by Lin et al[1].

The main idea there is to develop a set of criteria, or rubrics by which to score the satisfaction and dissatisfaction of users from the conversations. If, during a conversation, the user tries to ask the same question several times while attempting various rephrasing variants, this might be a good indication of dissatisfaction. On the other hand, a user who relies on details gleaned from previous chatbot responses to ask follow-up questions, should be more satisfied. Rubrics for satisfaction might include clarity and relevance, while repetitions or inconsistencies could constitute rubrics for dissatisfaction.

SPUR: A supervised approach to satisfaction scoring

However, the rubrics are task-specific and usually vary from system to system. An example provided in the paper is a travel agency chatbot. A good measure of user satisfaction in this case is whether the user ended up booking a flight. However, this is a very poor measure of satisfaction from other systems, e.g. general-purpose chatbots like ChatGPT. Given a new chatbot (perhaps under development by our organization), how do we select the correct rubrics for the particular use case?

Lin et al describe a methodology they term SPUR (Supervised Prompting for User satisfaction Rubrics), which as its name suggests, is based on a labeled dataset of conversations. In all the cases they describe, more than 1,000 such labeled conversations were used. Their training proceeds in two steps, each using an LLM with an appropriate prompt:

  • In step 1, Supervised Extraction, each conversation labeled as satisfactory is used to extract three (a configurable parameter) candidate satisfaction rubrics, and similarly, three dissatisfaction rubric candidates are extracted from each dissatisfying conversation.
  • In step 2, Rubric Summarization, the trios of candidates for satisfaction rubrics from all conversations are clustered together and ten (another configurable parameter) most prevalent rubrics after eliminating variations in naming etc. emerge as the final satisfaction rubrics. A similar process occurs for dissatisfied conversations, generating the ten dissatisfaction rubrics.

Once training is done, the ten satisfaction and ten dissatisfaction rubrics can be used in inference for scoring. A third (inference) prompt is used to provide a score between 1-10 per rubric for any new conversation. The 10 satisfaction scores are summed up and the sum of the 10 dissatisfaction scores is deducted from that to arrive at a final score per conversation between [-100,100].

This algorithm works well but heavily relies on approximately 1,000 labeled conversations which is significant. Moreover – if the system is modified or improved and we would like to retrain the rubrics on the new system – we inevitably need to label another batch of 1,000 conversations to do so – not very scalable.

Bootstrapping SPUR without labels

At Check Point, we found a way around this issue which allows us to bootstrap the training process without any labeled data. The idea is simple: in essence we modify the Supervised Extraction step and “lie to the LLM”. We run it on each unlabeled conversation twice, once with the claim that it is satisfactory and once with the opposite claim.

The lie is not so blatant in any case: in each run, the LLM is charged with generating up to three candidate rubrics. Since most conversations have satisfactory and dissatisfactory elements, the LLM is often successful in both runs but we also allow it to return fewer than three candidate rubrics per run (or none at all). Moreover, the candidate rubrics are then clustered and sifted during the Rubric Summarization step, so any outliers which would be generated during the first phase are then thrown out of the second.

The diagram below depicts the entire training cycle for the improved SPUR:

Real-World Results from Check Point Infinity Copilot

This improved SPUR algorithm was used successfully on conversations from Infinity Copilot, Check Point’s generative AI security assistant which is used to accelerate cyber security operations. A sample of its satisfaction scoring results was validated using human annotation, and good correlation was noted between the automatically generated scores and the human one. Since its validation, the generated scores are used to monitor all Infinity Copilot conversations, and its maximization has become one of the important objectives for any ongoing development efforts of the system. Indeed, using this input the overall satisfaction from the system has been continuously improving over time. Following these results, a patent application for the improved SPUR algorithm was recently submitted.

Conclusion

As conversational AI becomes a core part of cyber security operations, understanding user satisfaction is no longer a nice-to-have. The improved SPUR method developed at Check Point offers a scalable, label-free way to measure satisfaction directly from conversations, enabling smarter, faster, and more user-aligned AI systems. For security practitioners, this means better adoption, more trust in AI tools, and ultimately, stronger operational outcomes. Learn more about Infinity AI Copilot.

[1] Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models, (Lin et al., ACL 2024)

You may also like