In my latest personal research and development project, I embarked on a fascinating journey to explore the potential of Voice-to-Voice communication with Large Language Models (LLMs). The goal was to create a system that takes voice input, converts it into text, feeds it into a tokenized LLM, and then translates the text response back into voice. This blog post aims to discuss my experiences, methods, and the results obtained through the exploration of this innovative concept.
Voice to Text Conversion
The initial challenge of the project was converting voice into text. This was accomplished using the SpeechRecognition API, a free tool that allowed me to capture the spoken words and transform them into textual form. By utilizing this API, I was able to create a seamless bridge between the spoken word and the LLM.
Text to Voice Translation
The next step was to convert the text output from the LLM back into a human-like voice response. For this purpose, I leveraged the local voices available in Windows, native to most devices.
I achieved high-quality results without incurring any cost by utilizing the SpeechSynthesis API. This process was a significant success, providing realistic vocal responses. However, this approach did come with a notable downside.
Every operating system has different voices, and there are no defaults. This meant that a mapping step was necessary, where the user would need to pair different AI models with specific voices they wanted to use. While not overly complex, this step introduced an additional layer of complexity and customization.
Google Speech Synthesize Service
I also experimented with Google’s Speech Synthesize service, particularly with the female voice en-US-Studio-O. The results were astonishing. The tone, emotional depth, and overall quality were indistinguishable from a real phone call. The synthesized voice was lifelike and genuinely impressive.
However, the main drawback of this service was the cost. While the results were breathtakingly realistic, using Google’s service was prohibitively expensive for daily use. Therefore, it was not a viable solution for continual experimentation or production.
Exploring Azure Speech Services
In my relentless pursuit of perfection and innovation, I turned my attention to Microsoft’s Azure Speech Services. I was particularly drawn to its REST API, which allowed me to integrate it seamlessly with my existing TypeScript functions. Not only that, I built an Angular service to consume it, much in the same way I did with Google’s service.
The results were not merely comparable to what I achieved with Google; they were, in many ways, even more impressive.
Dynamic Tones and Expressions
What truly sets Azure apart is the ability to add various tones to the voices. Some voices offer up to 17 different tones, ranging from sad and happy to unfriendly and even whispering. This allowed for an extraordinary level of customization, enhancing the overall realism and emotional resonance of the synthesized speech.
By leveraging the Speech Synthesis Markup Language (SSML) in the API requests, I could construct sentences with varied tones, pauses, and other subtleties. This feature has brought an additional layer of sophistication to the Voice-to-Voice communication channel, making it even more engaging.
Implementing the en-US-SaraNeural Voice
One of the standout experiences was working with the en-US-SaraNeural voice. Through experimentation with different tones, including whispering, unfriendly, and friendly, I found that the results outperformed even Google’s service. The refined quality and flexibility of Azure’s offering have enabled a richer, more interactive experience.
Cost and Scalability
While Azure’s service provides fantastic quality, it is worth noting that the pricing is comparable to Google’s service. The balance between cost and quality will continue to be a vital consideration as I seek scalable solutions.
Conclusion
This research project has opened up a new avenue of communication that might reshape how we interact with AI models in the future. By combining freely available tools with the power of LLMs, I was able to create a seamless Voice-to-Voice communication channel.
The blend of technology, creativity, and innovation has enabled me to achieve impressive results. Despite some challenges, including mapping voices and the prohibitive cost of some third-party services, the outcome has been inspiring.
The work continues as I explore more efficient, affordable, and scalable solutions. The learnings from this project can be applied across various domains, making Voice-to-Voice communication with Large Language Models a promising frontier in human-AI interaction.
If you’re interested in exploring this further or have any thoughts on this project, feel free to reach out. Let’s shape the future together!