Please fill in your name
Mobile phone format error
Please enter the telephone
Please enter your company name
Please enter your company email
Please enter the data requirement
Successful submission! Thank you for your support.
Format error, Please fill in again
The data requirement cannot be less than 5 words and cannot be pure numbers
Among the diverse languages seeking integration into this technology, Thai holds a significant place. Thai speech recognition has been a focal point of research and development, driven by the growing demand for localized and personalized user experiences.
Over the past few years, Thai speech recognition technology has witnessed remarkable advancements, largely due to the availability of extensive linguistic data. The foundation of any speech recognition system lies in its dataset, and Thai is no exception. The abundance of voice data from various sources, including social media, podcasts, and recorded conversations, has played a pivotal role in training machine learning algorithms. As a result, Thai speech recognition systems have achieved unprecedented accuracy and fluency.
However, this progress is not devoid of challenges. The linguistic complexity of Thai poses hurdles in developing accurate recognition models. The language is tonal and features a unique script, demanding a deep understanding of its phonetics and syntax. Acquiring and annotating precise data for Thai speech recognition remains an ongoing challenge. Moreover, ensuring the inclusivity of regional accents and dialects further complicates the data collection process.
Datatang Thai Speech Datasets
Thai speech data (reading) is collected from 498 Thailand native speakers and is recorded in quiet environment. The recording is rich in content, covering multiple categories such as econimics, entertainment, news, figure, and oral. Around 400 sentences for each speaker. The valid data volumn is 203 hours. All texts are manual transcribed with high accuray.
The 1,077 Hours - Thai Conversational Speech Data involved 1,986 native speakers, developed with proper balance of gender ratio, Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 8kHz, 8bit, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification.
With the rise of deep learning, emotion recognition methods based on deep neural networks have been widely used. Speech Emotion Recognition, also known as NER, is a computer simulation of the process of human emotion perception and understanding. Computers are used to analyze emotions, extract emotional feature values, and use these parameters for corresponding modeling and recognition to establish a mapping relationship between feature values and emotions. , and finally classify the emotion.
In the rapidly evolving landscape of artificial intelligence, one of the most promising frontiers is multimodal machine learning, where algorithms learn from and make decisions based on a combination of different data types such as text, images, audio, and more. At the heart of this innovation lies a fundamental truth: the power of multimodal machine learning is intricately woven with the quality, diversity, and abundance of data.