Abstract
This paper proposes a one-shot voice conversion (VC) solution. In many one-shot voice conversion solutions (e.g., Auto-encoderbased VC methods), performances have dramatically been improved due to instance normalization and adaptive instance normalization. However, one-shot voice conversion fluency is still lacking, and the similarity is not good enough. This paper introduces the weight adaptive instance normalization strategy to improve the naturalness and similarity of one-shot voice conversion. Experimental results prove that under the VCTK data set, the MOS score of our proposed model, weight adaptive instance normalization voice conversion (WINVC), reaches 3.97 with five scales, and the SMOS reaches 3.31 with four scales. Besides, WINVC can achieve a MOS score of 3.44 and a SMOS score of 3.11 respectively for one-shot voice conversion under a small data set of 80 speakers with 5 pieces of utterances per person.
Converted speech samples
NOTE: Recommended browsers are Apple Safari, Google Chrome, or Mozilla Firefox.
Experimental conditions
- We evaluate our model on the VCTK[1] dataset.
- We use 10 one-shot speakers' other voice data to complete the one-shot voice conversion experiments.
- All the converted utterances are generated by ParallelWaveGAN vocoder [2].
Compared models
- WINVC: Proposed model trained with (80 speakers × all utterances), and finetuned with another (10 speakers × 1 utterance).
- WINVC5: Proposed model trained with (80 speakers × 5 utterances), and finetuned with another (10 speakers × 1 utterance).
- AdaINVC: Baseline model (AdaINVC) trained with (80 speakers × all utterances + 10 speakers × 1 utterance).
- AdaINVC_W: Baseline model (AdaINVC) replace AdaIN strategy with WIN strategy, and trained with (80 speakers × all utterances + 10 speakers × 1 utterance).
Results
Female (P312) → Female (P318)
Source | Target | AdaINVC | AdaINVC_W | WINVC | WINVC5 | |
---|---|---|---|---|---|---|
Sample 1 | ||||||
Sample 2 | ||||||
Sample 3 |
Male (P316) → Male (P345)
Source | Target | AdaINVC | AdaINVC_W | WINVC | WINVC5 | |
---|---|---|---|---|---|---|
Sample 1 | ||||||
Sample 2 | ||||||
Sample 3 |
Female (P312) → Male (P334)
Source | Target | AdaINVC | AdaINVC_W | WINVC | WINVC5 | |
---|---|---|---|---|---|---|
Sample 1 | ||||||
Sample 2 | ||||||
Sample 3 |
Male (P334) → Female (P318)
Source | Target | AdaINVC | AdaINVC_W | WINVC | WINVC5 | |
---|---|---|---|---|---|---|
Sample 1 | ||||||
Sample 2 | ||||||
Sample 3 |
References
[1] Veaux, Christophe, Junichi Yamagishi, and Kirsten MacDonald. Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. 2016. [Project]
[2] Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020. Speaker Odyssey, 2018. [Paper] [Project]