Key Points

1. Multimodal learning poses challenges in combining heterogeneous modalities such as video, audio, and text due to differences in sampling rates, alignment, and volume.

2. The Mirasol3B model proposed in the paper addresses these challenges by decoupling multimodal modeling into separate autoregressive components for time-aligned and non-time-aligned modalities.

3. The proposed model partitions video and audio inputs into consecutive snippets to process their representations and introduces a Combiner mechanism for joint feature learning.

4. Mirasol3B achieves state-of-the-art performance on multimodal benchmarks and effectively addresses the high computational demand of media inputs by learning compact representations and controlling the sequence length of audio-video feature representations.

5. The paper discusses in detail the use of autoregressive modeling in time for video and audio inputs, as well as the incorporation of contextual text information using cross-attention mechanisms.

6. The proposed model outperforms much larger models on various benchmarks, including video question-answering, long-video datasets, and audio-video benchmarks.

Summary

The research paper proposes a novel multimodal autoregressive model, Mirasol3B, to address the challenges in combining time-aligned media modalities and non-time-aligned contextual modalities. The model is divided into separate autoregressive components for time-synchronized media modalities (such as audio and video) and non-time-aligned contextual modalities (such as text). A key contribution is the introduction of the Combiner mechanism, which facilitates joint feature representation learning for the media modalities and helps in modeling long-range dependencies. The model demonstrates superior performance on various benchmarks, outperforming larger models, particularly on audio-video-text datasets and long video datasets. The paper also compares the proposed model to existing works and presents ablations to study the individual effects of model components. The proposed model is trained on publicly available datasets and achieves state-of-the-art performance across various benchmarks.

The paper uses datasets such as MSRVTT-QA, ActivityNet-QA, and NExT-QA for evaluation, which are widely used benchmarks for video question answering. The proposed model outperforms larger models and achieves superior performance in video question answering tasks, particularly on long video datasets. The research also presents ablations to study the individual effects of model components and uses public datasets for model evaluation.

The paper introduces a novel autoregressive multimodal model, Mirasol3B, designed to effectively combine time-aligned media modalities and non-time-aligned contextual modalities. The model features a Combiner concept for joint feature representation learning to improve parameter distribution and reduce overall model size. The proposed model outperforms larger models on various benchmarks, particularly on audio-video-text datasets and long video datasets. The researchers visualized different combiners and found their main combiner to outperform others.

The model details of the proposed Mirasol3B include 1.3B parameters, with a smaller model used for ablations with 1.15B parameters. Model pretraining is done on the Video-Text Pairs (VTP) dataset, and fine-tuning involves a comprehensive training approach including audio pretraining and various techniques such as Mixup, specaugment, dropout, and label smoothing. Ablation experiments were conducted with the small model, demonstrating the effectiveness of the proposed approach. The authors present various datasets used in the paper, such as VGG-Sound, Epic-Sound, and Kinetics-Sound, which were all formulated as audio-video-text datasets for classification tasks.

Reference: https://arxiv.org/abs/2311.05698