Amazon researchers have unveiled a groundbreaking large language model (LLM) named BASE TTS, designed for text-to-speech applications, showcasing what they describe as "emergent" capabilities. With an impressive 980 million parameters, this model stands out as the most extensive text-to-speech model to date. In a comprehensive study, the research team explored models of various sizes, subjecting them to up to 100,000 hours of public domain speech data to uncover potential performance improvements often associated with the growth of natural language processing models.
Surprisingly, the researchers found that their medium-sized 400 million parameter model, trained on 10,000 hours of audio, demonstrated notable advancements in versatility and robustness, especially when faced with challenging test sentences. Crafted with intricate lexical, syntactic, and paralinguistic elements like compound nouns, emotions, foreign words, and punctuation—known challenges for traditional text-to-speech systems—BASE TTS showcased superior performance by making significantly fewer errors in stress, intonation, and pronunciation compared to existing models.
It's essential to note that BASE TTS was not explicitly trained for the specific tasks presented in the test sentences, emphasizing its adaptive capabilities and the ability to handle unforeseen challenges. Interestingly, the larger 980 million parameter version, trained on an extensive 100,000 hours of audio, did not demonstrate additional capabilities beyond the 400 million parameter version.
While the development of BASE TTS remains an experimental process, it underscores the potential for these models to achieve newfound versatility as they scale—a promising prospect for the field of conversational AI. The researchers are committed to further exploration, aiming to identify the optimal model size for unlocking emergent abilities.
Moreover, BASE TTS is engineered to be lightweight and streamable, featuring separate packaging for emotional and prosodic data. This innovative design could facilitate the transmission of natural-sounding spoken audio across low-bandwidth connections, broadening the practical applications of this cutting-edge text-to-speech technology.
For the latest advancements in text-to-speech technology and emergent capabilities, BASE TTS sets a new standard, positioning itself as a pivotal player in the evolving landscape of conversational AI. Stay tuned for more updates as researchers delve deeper into optimising model size for enhanced performance.