A group of researchers from the Massachusetts Institute of Technology (Massachusetts Institute of Technology), the MIT-IBM Watson AI Lab and IBM Research, among others, have developed a new technique for analyzing unlabeled video and audio data, which can improve the performance of machine learning models used in applications such as speech recognition and object detection. For this, two self-supervised learning architectures were combined: contrast learning and persuasive data modeling.
The technique, called Contrast Audiovisual Masking Autoencoder (CAV-MAE), is a type of neural network that can learn to extract and plot meaningful latent representations in high-dimensional space from audio and video data by training on large sets of YouTube data for 10 years. – Second audio and video clips. The researchers believe that this technique is more effective than previous methods because it clearly identifies the relationships between audio and video data in a way that other methods do not.
CAV-MAE works through learning by prediction and learning by comparison. Masked data modeling, or prediction method, takes a video with the co-ordinated sound waveform, converts the sound into a spectrogram, and masks 75% of both.
Unmasked data is encoded, encoded in separate audio and video encoders before entering a common encoder/decoder, where the model is asked to retrieve the lost data. The difference between the reconstructed output prediction and the original audiovisual construct is later used to train the model for better performance.
Audiovisual retrieval through CAV-MAE
The researchers tested CAV-MAE with other evolving methods in audiovisual retrieval tasks and classification of audiovisual events using a standard (20K and 2M) dataset and the VGGSound dataset, short realistically tagged clips, which can include multiple sounds. Audiovisual retrieval means that the model sees the audio or video component of a query pair and finds the missing query. Classification of events involves identifying actions or sounds within data, such as a person singing or driving a car.
In general, they find contrast learning and masked data modeling complementary methods. CAV-MAE was able to outperform previous technologies by approximately 2% for event classification performance compared to models with comparable computations and to match or outperform models with industry-wide computational resources.
The team model was rated similarly to the trained models with only the contrastive loss. In addition, integration of multimodal data into the CAV-MAE pretraining improves fine-tuning of single-modal representation through supervised learning and performance on audio-only event classification tasks.
The researchers view their contribution to the masked audiovisual contrast autoencoder as an important milestone and step forward for applications, which are increasingly moving from a single method to multiple media and requiring or benefiting from audiovisual fusion. They hypothesize that it could be used in the future to learn about actions in areas such as sports, education, entertainment, automobiles, and public safety, as well as extending it to other modalities.
“Beer enthusiast. Subtly charming alcohol junkie. Wannabe internet buff. Typical pop culture lover.”