Build Personalized Speech Synthesis With High-quality Training Data

From：Datatang Date：2022-06-24

Speech synthesis, commonly known as Text To Speech (TTS), is a technology that can convert any input text into corresponding speech, and is one of the indispensable modules in human-computer voice interaction.

Traditional Speech Synthesis

Traditional speech synthesis systems usually include two modules: front-end and back-end. The front-end module mainly analyzes the input text and extracts the linguistic information required by the back-end module. For Chinese synthesis systems, the front-end module generally includes sub-modules such as Text Normalization (TN), polyphonic word disambiguation, and prosody prediction. The back-end module generates a speech waveform through a certain method according to the front-end analysis results.

Behind the front-end technology, a large amount of basic data such as TN annotation, polyphonic word annotation, and rhythm annotation is needed to help the front-end technology output accurate results.

The back-end technology requires high-quality voice libraries recorded by professional speakers. In order to apply in various scenarios, a large number of voice libraries with diverse timbres and languages are required.

Personalized Speech Synthesis

Personalized speech synthesis usually uses a small amount of and possibly low-quality target speaker speech, and using methods such as transfer learning to train a speech synthesis model capable of synthesizing the target speaker’s speech. The usual approach is to train a general speech synthesis model based on a large number of different speakers, and then fine-tune it with a small number of target speakers.

The application of personalized speech synthesis is becoming more and more mature. Baidu Maps supports users to record 9 sentences to generate a complete personal speech package and use it in all scenarios of the map.

Behind the personalized speech synthesis technology, the multi-speaker average model library is needed as an important data support. Datatang’s speech synthesis data for general scenarios is divided into three categories:

Monophonic Human Synthesis Library

A sound library recorded in a professional recording studio by a single speaker.

•American English Speech Synthesis Corpus-Female

Female audio data of American English,. It is recorded by American English native speaker, with authentic accent and sweet sound. The phoneme coverage is balanced. Professional phonetician participates in the annotation.

•American English Speech Synthesis Corpus-Male

Male audio data of American English. It is recorded by American English native speakers, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation.

•Japanese Synthesis Corpus-Female

Japanese Synthesis Corpus-Female. It is recorded by Japanese native speaker, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation.

Multi-speaker Average Model Library

A sound library recorded in a professional recording studio by multiple speakers.

•Chinese Mandarin Average Tone Speech Synthesis Corpus, General

Chinese Mandarin Average Tone Speech Synthesis Corpus, General. It is recorded by Chinese native speaker. It covers news, dialogue, audio books, poetry, advertising, news broadcasting, entertainment; and the phonemes and tones are balanced. Professional phonetician participates in the annotation.

•Chinese Average Tone Speech Synthesis Corpus-Three Styles

Chinese Average Tone Speech Synthesis Corpus-Three Styles.It is recorded by Chinese native speakers. Corpus includes cunstomer service,news and story. The syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation.

Frontend Text

•200,475 Sentences — Chinese Text Normalization Data

The dataset covers novels, articles, news and other categories, the specific special symbols and Arabic numerals contained in the sentences are marked as Chinese characters, with a total of 199,652 sentences and 454,638 annotations.

•319,977 Sentences — Mandarin Polyphone Corpus Data

The dataset covers news, spoken language and other categories, including 603 phonetic sounds of 266 polyphonic words, a total of 319,977 sentences.

•200,955 Sentences — Mandarin Prosodic Corpus Data

Texts from news and daily chats were annotated at level 4 prosody.

As the world’s leading AI data service provider, Datatang has rich sample sound resources, outstanding technical advantages and data processing experience, and supports personalized collection services for designated language, timbre, age, and gender. Meanwhile, Datatang supports data customization services such as audio segmentation, phoneme boundary segmentation (segmentation accuracy of 0.01 seconds), phonetic tagging, prosody tagging, part-of-speech tagging, pitch proofreading, rhythm tagging, and musical score production to fully meet customers’ diverse requirements.

End

If you need data services, please feel free to contact us: info@datatang.com.

AI Solutions For Machine Learning

<p class="jw b do dp hm jx hn ho jy hq hs jz ii">We live in the world of global market struggle and competitive advantage. Thus many companies strive to balance the growing competition with the automation and smart analysis offered by artificial intelligence. However, businesses are not always ready to have artificial intelligence shoved into their processes without preparation or careful planning…</p>

Datatang at CVPR2022

<p class="jw b do dp hm jx hn ho jy hq hs jz ii">Today marks the start of the 2022 Conference on Computer Vision and Pattern Recognition (CVPR 2022), it’s a knowledge festival exclusively made for professionals and researchers in CV domains, the premier event has been started on June 21 and will be ended in June 24, you can join the conference…</p>

Build Personalized Speech Synthesis With High-quality Training Data

Previous

AI Solutions For Machine Learning

Next

Datatang at CVPR2022

Build Personalized Speech Synthesis With High-quality Training Data

Recent

Datatang is going to attend Interspeech 2023

Empowering Multilingual Speech Recognition in the Automotive Industry

AI-Driven Marketing

Previous

AI Solutions For Machine Learning

Next

Datatang at CVPR2022