Reproducing user specified text input in a user selected celebrity voice style
- Tech Stack: Python, Keras, OpenCV, Scikit-learn, Librosa, CNN, VGG-16, Dall-E
- Github URL: Project Link
- Medium Article URL: Article Link
For our project’s purposes, instead of dealing with audio as a two-dimensional signal of amplitude over time, we first convert our audio samples into spectrograms which is a three-dimensional representation of the same signal giving us information about frequency over time with additional data about each frequency’s amplitude at any given point of time.
The images obtained from step-1 are passed through a set of convolutions before feeding them into a neural network that establishes the speaker signature. This becomes the speaker recognition module that learns the voices of each celebrity in the training data. The voice of Google Assistant is also factored in here.
From the previous stage, we have a 4096-dimensional convolution output as the input to the decoder, and we want to recreate spectrogram images of size 128 x 171 (21,888 pixels) as the output.
We use the convolutions from the first network (Step-2) to obtain a 4096-dimensional vector representation of each speaker relative to Google in this stage. Our expectation is that when the Google Assistant speaks the same phrases as the celebrity, the only difference between the two files will be the voice (or style) of the speaker with respect to Google Assistant. So we expect to capture this difference as a 4096-dimensional representation for each speaker.