One-Shot Voice Conversion with Weight Adaptive Instance Normalization

Abstract

This paper proposes a one-shot voice conversion (VC) solution. In many one-shot voice conversion solutions (e.g., Auto-encoderbased VC methods), performances have dramatically been improved due to instance normalization and adaptive instance normalization. However, one-shot voice conversion ﬂuency is still lacking, and the similarity is not good enough. This paper introduces the weight adaptive instance normalization strategy to improve the naturalness and similarity of one-shot voice conversion. Experimental results prove that under the VCTK data set, the MOS score of our proposed model, weight adaptive instance normalization voice conversion (WINVC), reaches 3.97 with ﬁve scales, and the SMOS reaches 3.31 with four scales. Besides, WINVC can achieve a MOS score of 3.44 and a SMOS score of 3.11 respectively for one-shot voice conversion under a small data set of 80 speakers with 5 pieces of utterances per person.

Converted speech samples

NOTE: Recommended browsers are Apple Safari, Google Chrome, or Mozilla Firefox.

Experimental conditions

We evaluate our model on the VCTK[1] dataset.
We use 10 one-shot speakers' other voice data to complete the one-shot voice conversion experiments.
All the converted utterances are generated by ParallelWaveGAN vocoder [2].

Compared models

WINVC: Proposed model trained with (80 speakers × all utterances), and ﬁnetuned with another (10 speakers × 1 utterance).
WINVC5: Proposed model trained with (80 speakers × 5 utterances), and ﬁnetuned with another (10 speakers × 1 utterance).
AdaINVC: Baseline model (AdaINVC) trained with (80 speakers × all utterances + 10 speakers × 1 utterance).
AdaINVC_W: Baseline model (AdaINVC) replace AdaIN strategy with WIN strategy, and trained with (80 speakers × all utterances + 10 speakers × 1 utterance).

Female (P312) → Female (P318)

	Source	Target	AdaINVC	AdaINVC_W	WINVC	WINVC5
Sample 1
Sample 2
Sample 3

Male (P316) → Male (P345)

	Source	Target	AdaINVC	AdaINVC_W	WINVC	WINVC5
Sample 1
Sample 2
Sample 3

Female (P312) → Male (P334)

	Source	Target	AdaINVC	AdaINVC_W	WINVC	WINVC5
Sample 1
Sample 2
Sample 3

Male (P334) → Female (P318)

	Source	Target	AdaINVC	AdaINVC_W	WINVC	WINVC5
Sample 1
Sample 2
Sample 3

References

[1] Veaux, Christophe, Junichi Yamagishi, and Kirsten MacDonald. Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. 2016. [Project]

[2] Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020. Speaker Odyssey, 2018. [Paper] [Project]