VALL-E2 is a Text to Speech Model designed and developed by Microsoft, which attains a potential to mimic and talk like humans. VALL-E2 is basically a technology which is an upgraded version of VALL-E Model of Microsoft. This technology is basically developed in keeping the drawbacks and short comings of VALL-E Model.
What is Text to Speech Model?
Virtual assistant like Siri and Alexa, are some of the celebrated examples which are based on the functionality of Text to Speech Model. A Text to Speech Model, commonly attributed as TTS model, is a type of artificial intelligence technology that helps in particular converting written word into speech. A Test to Speech system is also frequently referred as “read aloud” system.
This technology was initially developed by Norika Umeda in 1968, for assisting visually disabled and impaired person in reading written text. Today tis technology has achieved such a milestone that with a small amount of human voice sample it can train its model to speak like that particular human in its language as well as in its tone. It is really become indistinguishable to verify humans voice between the machines voice.
VALL-E 2 Model- An Overview:
VALL-E2 is recent upgraded model of Microsoft base on Text to Speech Technology. The experts of Microsoft are claiming that they have achieve astonishing accuracy in mimicry of human speech, and that too just by providing a small set of training data to the model.
Microsoft Researchers proudly announced that “VALL-E 2 is the first voice AI to reach human parity in speech robustness, naturalness, and speaker similarity”. ‘Human Parity’ used here basically signifies that the speech of VALL-E2 is considered equal in quality to human translation and speech.
Two Key features in making the system more realistic:
- Repetitive Aware Sampling: A repetitive aware sampling is a type of technique used in Text to Speech model which is enhancing the naturalness of the speech. The initial Version of Text to Speech Model are monotonous in flow and lacks the importance of syllable in pronunciation. By adopting this technology, a system is trained on intonation, stress and Rythm of speech which is crucial for a natural speech.
- Grouped Code Modelling: This technology breaks the complex large sentences into a smaller group, which are comparatively easy and fast in processing the text. By grouping related features, the model can be better trained in understanding the relationship between the words, improving pronunciation and intonation.
Understanding the Risk:
An advance Text to Speech technology like VALL-E2 of Microsoft and OpenAI’s Voice engine, has a great potential in enhancing Education, health and entertainment industries. But at the same time they are also prone to major risk of potential misuse associated with it. For instance, many industries have deployed a voice identification access for granting any business transaction, this voice identification can be deceived by using this advanced AI technologies like VALL-E2 system.
A major risk is associated with impersonating a voice of a particular person to do fraud and misrepresentation. A recent case can be seen in United States where an AI cloned voice of President Joe Biden is being used for spam calls.
Conclusion:
VALL-E2 has a potential to transform industries such as entertainment, health and education, but its deployment may create certain loopholes which attract the potential risk of increase in amount of fraud and misrepresentation in the society. So various checks, measures and tracking procedures must be defined by the developer and also by the state to regularize rules related to the deployment of Text to Speech Technology.
References:
- https://www.businesstoday.in/technology/news/story/microsoft-develops-eerily-realistic-ai-voice-generator-but-keeps-it-under-wraps-437439-2024-07-17
- https://www.theverge.com/2024/3/29/24115701/openai-voice-generation-ai-model