Audio samples from "Automatic Stress Annotation and Prediction For Expressive Mandarin Speech Synthesis"
Abstract:
The current text-to-speech technique has developed to a close-to-human state, and more research interest has
been paid
to highly expressive and more controllable speech synthesis. Stress detection and modeling in the Mandarin TTS
system
have been verified to be an efficient and direct way to enhance the rhythm and prosody performance in previous
studies.
But labeling stress in training data manually needs linguistic knowledge and is also time-consuming. In this
paper, an
automatic syllable-level stress annotation mechanism is proposed. Then based on the automatically annotated
stress
labels, a transformer-based ALBERT front-end module is built for stress label prediction from the text. In the
experiment part, a DurIAN based expressive text-to-speech system is built with the proposed automatic stress
annotation
and prediction module. Experiments show the proposed method can consistently predict stress from linguistic
context
input, and speech synthesis systems with proposed stress annotation and prediction components outperform
baseline
systems.
Different position and degress of stress in sentence often
represent different semantic meanings. Samples are generated from our controllable
stress-modeled TTS system with SSML input of stress.
Stress, as a high-level feature, is strongly
correlated with f0 and energy, which is a refined expression of
the prosody feature. There is a consideration that the predic-
tion of the f0 and the energy may perhaps bring about some
conflict or negative impact. So, the ablation study is employed
for exploring a better way to maintain more stress information.
We randomly pick up 50 single long
sentences and 10 paragraphs in the same genre, chapter-level here is to explore the
Human Fatigue(When people listen to long
passages
generated from TTS system, if the audio have no pitch rise and fall, they may often
lose interest in
what follows, which we call Human fatigue.) in real usage scenarios.
There are three system here
Baseline, Stress Stress + f0&energy modeling
The MOS test of this part could see in para 3.3 - Ablation
Studies in our paper.
Sentence-level
Baseline
Stress
Stress + f0&energy
modeling
Sentence1
Sentence2
Sentence3
Sentence4
Sentence5
Chapter-level
Baseline
Stress
Stress + f0&energy
modeling
Chapter1
Chapter2
Chapter3
Chapter4
chapter-level
This part is to explore the Human Fatigue(When people listen to long
passages
generated from TTS system, if the audio have no pitch rise and fall, they may often
lose interest in
what follows, which we call Human fatigue.) in real usage scenarios.