Reproducing user specified text input in a user selected celebrity voice style

  • Tech Stack: Python, Keras, OpenCV, Scikit-learn, Librosa, CNN, VGG-16, Dall-E
  • Github URL: Project Link
  • Medium Article URL: Article Link
  • Step-1:Audio to Spectrogram
  • For our project’s purposes, instead of dealing with audio as a two-dimensional signal of amplitude over time, we first convert our audio samples into spectrograms which is a three-dimensional representation of the same signal giving us information about frequency over time with additional data about each frequency’s amplitude at any given point of time.

  • Step-2: Training a Neural Network to Identify the Speaker
  • The images obtained from step-1 are passed through a set of convolutions before feeding them into a neural network that establishes the speaker signature. This becomes the speaker recognition module that learns the voices of each celebrity in the training data. The voice of Google Assistant is also factored in here.

  • Step-3: Building a Decoder for the Convolutions
  • From the previous stage, we have a 4096-dimensional convolution output as the input to the decoder, and we want to recreate spectrogram images of size 128 x 171 (21,888 pixels) as the output.

  • Step-4: Speaker Embedding (Capturing Voice Signatures)
  • We use the convolutions from the first network (Step-2) to obtain a 4096-dimensional vector representation of each speaker relative to Google in this stage. Our expectation is that when the Google Assistant speaks the same phrases as the celebrity, the only difference between the two files will be the voice (or style) of the speaker with respect to Google Assistant. So we expect to capture this difference as a 4096-dimensional representation for each speaker.

  • Step-5: Reconstruction
  • • From the text input of the user, we first generate a google assistant voice sample

    • The voice sample is then converted to a spectrogram

    • This spectrogram is passed through the convolutions of the speaker recognition network (from step-2) to get a 4096-dimensional vector representation of Google Assistant’s speech

    • Based on the user’s choice of voice style, we add the relative speaker embedding from step-3 and feed this modified encoded vector to the decoder

    • The decoder then recreates a spectrogram from the 4096-dimensional input

    • This estimated spectrogram is then converted back to audio