Audio samples from "Automatic Stress Annotation and Prediction For Expressive Mandarin Speech Synthesis"

Abstract: The current text-to-speech technique has developed to a close-to-human state, and more research interest has been paid to highly expressive and more controllable speech synthesis. Stress detection and modeling in the Mandarin TTS system have been verified to be an efficient and direct way to enhance the rhythm and prosody performance in previous studies. But labeling stress in training data manually needs linguistic knowledge and is also time-consuming. In this paper, an automatic syllable-level stress annotation mechanism is proposed. Then based on the automatically annotated stress labels, a transformer-based ALBERT front-end module is built for stress label prediction from the text. In the experiment part, a DurIAN based expressive text-to-speech system is built with the proposed automatic stress annotation and prediction module. Experiments show the proposed method can consistently predict stress from linguistic context input, and speech synthesis systems with proposed stress annotation and prediction components outperform baseline systems.

Stress position

Different position and degress of stress in sentence often represent different semantic meanings. Samples are generated from our controllable stress-modeled TTS system with SSML input of stress.
张小花中午去「射箭」了。[emphasis on 'shooting'] 张小花「中午」去射箭了。 [emphasis on 'at noon']
Sample 1
「这」副护具掉下楼了。[emphasis on 'which'] 这副「护具」掉下楼了。[emphasis on 'what']
Sample 2

Q&A context

In this part, there are 3 system:
  • Baseline (Baseline acoustic model without stress input)
  • Front-end_model (Predict stress from text, and make it to be the condition embedding and concat with other embeddings. )
  • Manual (Annotate stress by annotator, and make it to be the condition embedding and concat with other embeddings.)
we use some question-and-answer contexts from our Q&A datasets to verify whether stress obtained by Baseline, Front-end_model and
Manual could make different performance of acoustic model. The MOS result of this part could see in Para 3.3 - Stress modeling and controbility in our paper.
Qustion_1: 张三哪天中午吃了粽子?(Which day did Zhang San eat dumplings at noon?)
Baseline Front-end_model
Manual
Answer_1: 张三「今天」中午吃了粽子。(Today, Zhang San had dumplings for lunch.)
Qustion_2: 我闻到了了糊味,是什么东西烧糊了?(What is burning?)
Answer_2: 是他把「面条」烧糊了。(It's the noodle he scorched.)
Qustion_3: 是哪个花瓶碎了?(Which vase is broken?)
Answer_3: 是「这个」花瓶碎了。(It was this vase that broke.)

The Ablation Studies

chapter-level This part is to explore the Human Fatigue(When people listen to long passages generated from TTS system, if the audio have no pitch rise and fall, they may often lose interest in what follows, which we call Human fatigue.) in real usage scenarios.
Stress, as a high-level feature, is strongly correlated with f0 and energy, which is a refined expression of the prosody feature. There is a consideration that the predic- tion of the f0 and the energy may perhaps bring about some conflict or negative impact. So, the ablation study is employed for exploring a better way to maintain more stress information. We randomly pick up 50 single long sentences and 10 paragraphs in the same genre, chapter-level here is to explore the Human Fatigue(When people listen to long passages generated from TTS system, if the audio have no pitch rise and fall, they may often lose interest in what follows, which we call Human fatigue.) in real usage scenarios. There are three system here Baseline, Stress Stress + f0&energy modeling The MOS test of this part could see in para 3.3 - Ablation Studies in our paper.
Sentence-level Baseline Stress
Stress + f0&energy modeling
Sentence1
Sentence2
Sentence3
Sentence4
Sentence5
Chapter-level Baseline Stress
Stress + f0&energy modeling
Chapter1
Chapter2
Chapter3
Chapter4