Wav2Vec and other audio embeddings

Reading recent Facebook paper on audio embeddings wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli I wonder how accurate embeddings for speech recognition one can learn from a large collection of music instead of large collection of speech. And if they will be simpy wavelets.