Scorebuddy Labs is our innovation center, a team that experiments with new technologies to build a better platform. It’s a great place to be, discovering new possibilities, validating them with key clients and partners, and establishing these innovations as part of our product roadmap.
While the implementation of generative AI for contact centers is still in its infancy, the potential of tools like ChatGPT is clear to see. At Scorebuddy Labs we are focused on experimentation, innovation, and, importantly, validation. We want to deliver more than promises to our customers—we want to deliver tried and tested solutions.
Emmanuel Doubinsky, Head of Innovation, Scorebuddy Labs
So we experimented with ChatGPT for a little while. We were genuinely impressed with the early GPT-3 model, although we could see some shortcomings in its understanding of our prompts, and in the accuracy and relevance of its answers. As a side note, if you’re interested in learning more about AI prompting, we recommend this free, open source course.
Then GPT-4 arrived, and things got better—much better. The AI understood more complex prompts and the quality and accuracy of answers improved.
We tested some common use cases relevant to the QA programs of our large contact center customers:
The first three appeared to work well at a sample level. ChatGPT does a great job at checking spelling and grammatical mistakes, it provides good summaries of complex conversations, and it sometimes performs even better than human evaluators when it comes to detecting customer anger and proper agent empathy responses.
At Scorebuddy Labs, we’re undertaking extensive testing with customers and partners in order to determine, on a larger scale, whether or not ChatGPT is truly a revolutionary tool for contact center quality assurance.
Perhaps the most impressive feature was the quasi “ready to use” nature of the AI. ChatGPT processed all of these agent-customer interactions without any prior training.
This training aspect has killed many AI projects in the past, as it takes weeks or months of configuration and testing before initial deployments and even then further issues can arise, requiring additional training.
This “ready to go” capability of ChatGPT is a total game changer in AI.
ChatGPT is not a cure-all for contact center quality assurance. Some other use cases didn’t work as well, and it’s worth understanding what ChatGPT can and cannot do, so everyone wins when adopting it.
Monitoring correct business practices and regulatory compliance is not simple—for a human or a robot. Sometimes you have to check if an agent took the correct action on another platform, or if the agent mentioned the correct disclaimer for the specific context of the conversation in question.
Here, we hit some roadblocks. GPT-3 would answer all of these questions without hesitation, even though it sometimes had absolutely no context or background information to give such answers. This would be a major “no-no” for compliance checks, so we began to dismiss the technology for these use cases, relying more on our in-house AI and NLP capabilities.
GPT-4 improved things somewhat. It began to realize when it didn’t have the necessary context, which in itself is really impressive, and the new-ish model-tuning capabilities might just be able to bridge that gap.
The question is how much tuning is needed? And for how long? This will be the focus of our next experiment alongside our customers and partners, and we’re hopeful that we can come to a positive conclusion.
One last challenge—and it’s a big one—is that ChatGPT is prone to “hallucinations”. What do we mean by this? Well, sometimes ChatGPT invents facts. Furthermore, it won’t tell you that these facts are invented, nor is it even aware that it made them up. This problem was a significant factor in the failure of our compliance use case, as this area requires total accuracy around its answers.
It also led to the failure of our final use case, wherein we asked ChatGPT to give us coaching recommendations that would help a particular agent improve on a specific weakness.
Coaching is a key focus of contact center QA. It supports the idea QA is not here to micromanage agents, but instead to help them become better at what they already do well. Unfortunately, this message doesn’t quite ring true when the coaching recommendations given by ChatGPT sound auto generated and unrelated to the issue you want to address.
It’s difficult to blame ChatGPT for this, but it’s concerning that the AI is totally unaware when it spits out fabricated information. This being said, it is possible that fine-tuning a model will help close this gap. For now, however, the risk remains that ChatGPT does not know when it’s having one of these “hallucinations”.
It is capable of giving sincere answers, very well formulated, that appear as credible as those given by a human expert. Unfortunately, these answers are sometimes total gibberish.
I asked ChatGPT the following question about my old car (which I have since upgraded to a newer model, thanks to my excellent work and contribution to Scorebuddy):
“How do I change the battery of my Mercedes E Class 2004?”
I was served with a complete answer detailing each step involved in this operation. Feel free to skip to the end (unless you drive a 2004 Mercedes E Class of course), but I want to share the deep level of credible details ChatGPT generated for me:
“Here are the general steps to change the battery on a Mercedes E Class 2004:
Please note that these steps are general and may vary slightly depending on the specific model and year of your Mercedes E Class. It is also important to take safety precautions, such as wearing gloves and safety glasses, when handling a car battery.”
I was impressed by the level of detail. “Use a 10 mm wrench”—so precise. “Clean the battery tray with a wire brush”—I never thought of doing that, this ChatGPT guy must work in a great garage. And I really loved the safety precaution at the end, it made the answer appear genuine and trustworthy.
But wait, if I open the hood of my beloved 2004 Mercedes E Class, there’s no sign of a battery? As it turns out, step 2 was wrong. The battery is not in the hood, it’s in the trunk (or the “boot” for us Europeans).
Asking the same question on Google instantly returned a video describing the correct steps, beginning with “open the trunk”.
I asked ChatGPT that very question and its long, honest answer contained this interesting section:
“It's important to always verify information that you receive, whether it's from an AI language model like myself or from any other source. Double-checking the facts with reliable sources and seeking out multiple perspectives can help you make informed decisions and avoid spreading misinformation.”
This is nice to say, but while my Google search gave me a verifiable source for the information, ChatGPT does not give you any sources—because it doesn’t know what its sources are.
Fine tuning, testing, more fine tuning, and more testing is probably the answer. Not everything in ChatGPT is ready to go right out of the box, but there’s more than enough there to set it apart as a potential game changer for contact center quality assurance.
Scorebuddy Labs will continue experimenting with this promising technology and let you know how our real-world deployments go. Unless we’re all replaced by robots—in which case, I’m sure they’ll keep you posted.