UniMVU

UniMVU: Not All Modalities Are Equal

Instruction-Aware Gating for Multimodal Videos

1 Mohamed bin Zayed University of Artificial Intelligence  ·  2 Tianjin University  ·  3 Chongqing University  ·  4 Linköping University

UniMVU teaser showing video, audio, depth, and long-video inputs balanced before reasoning in a shared VideoLLM.
UniMVU uses instruction-aware gating to rebalance video, audio, depth, and long-video evidence before unified multimodal reasoning.

Abstract

Pre-trained video large language models excel at visual reasoning, but they struggle when videos arrive with auxiliary streams such as audio, depth maps, or high-frame-rate inputs. In these settings, uniform fusion can introduce modality interference and let irrelevant channels distract the model. UniMVU addresses this with instruction-aware fusion across video, audio, depth, and long-video evidence via two dynamic gating stages: feature-level gates emphasize salient regions within each modality, while modality-level gates reweight whole streams based on the input instruction. Across six benchmarks, UniMVU delivers consistent gains over static-fusion baselines, including up to +13.5 CIDEr on AVSD.
Qualitative teaser comparing UniMVU and a baseline with query-specific modality weights.
UniMVU changes modality weights with the question instead of applying a fixed fusion recipe to every example.
Method

UniMVU Architecture

The instruction is used to guide both within-modality feature selection and cross-modality balancing before the final VideoLLM reasoning stage.

01

Modality encoders and projectors

Video, audio, and depth streams are aligned into a shared token space for unified multimodal reasoning.

02

Instruction-driven inner-modality gating

Salient evidence is emphasized within each modality so irrelevant local features do not dominate the answer.

03

Instruction-driven modality balancing

Whole modalities are reweighted before fusion, allowing the model to trust the right evidence source for each question.

UniMVU architecture and training strategy.
UniMVU framework and modality-balanced fusion. The instruction conditions both intra-modality and inter-modality gating before LLM reasoning.
Results

Main Results

Evaluation across audio-video QA, 3D QA, and long-video QA. PAVE* denotes our reproduction using the public PAVE code. UniMVU refers to the jointly trained multi-task model reported in the paper. For ScanQA and SQA3D, refined scores are shown in parentheses where reported.

Music-AVQA

ScaleMethodAudio Avg.Visual Avg.AV Avg.Overall Avg.
7B CAT-FT 84.986.183.284.3
LLaVA-OV-FT (video-only) 75.489.372.377.4
PAVE* 79.192.777.881.9
UniMVU 78.992.877.281.6
UniMVU 81.793.579.883.7
0.5B VAST 80.7
AVAF-Net 78.182.372.175.9
AV-Master 79.986.574.278.5
LLaVA-OV-FT (video-only) 69.676.362.867.6
LLaVA-OV-FT* (video-audio concat) 76.289.172.477.5
PAVE* 75.988.672.477.3
UniMVU 77.290.274.879.3
UniMVU 79.591.876.781.9

AVQA and AVSD

ScaleMethodAVQA ACC (%)AVSD ROUGE-LAVSD CIDEr
7B LLaVA-OV-FT (video-only) 90.8124.9
PAVE* 93.438.5151.6
UniMVU 92.239.5162.7
UniMVU (AVQA / AVSD) 94.339.8165.1
0.5B PSTP-Net 90.2
AV-Master 91.4
LLaVA-OV-FT (video-only) 86.4117.6
LLaVA-OV-FT* (video-audio concat) 89.935.7127.8
PAVE* 89.636.5134.9
UniMVU 91.137.8145.9
UniMVU (AVQA / AVSD) 92.338.2147.1

ScanQA

ScaleMethodEM@1BLEU-4METEORROUGE-LCIDEr
7B LLaVA-3D-7B 27.0 (45.0)14.520.750.191.7
Scene-LLM-7B 27.212.016.640.080.0
LLaVA-OV-FT (video-only) 27.4 (46.3)13.547.495.1
PAVE* 28.9 (48.2)16.019.848.8102.4
UniMVU 29.2 (48.8)17.8320.149.01104.2
UniMVU 29.6 (48.8)16.019.849.0102.7
0.5B SceSU 25.113.214.935.569.6
DSPNet 26.515.415.739.378.1
LLaVA-OV-FT (video-only) 20.5 (36.3)6.514.336.970.5
LLaVA-OV-FT (video-3d concat) 10.2 (24.3)4.97.420.234.9
PAVE* 23.5 (40.4)12.717.142.784.9
UniMVU 24.7 (42.0)14.517.944.289.7
UniMVU 25.9 (43.2)13.518.044.790.9

SQA3D

ScaleMethodEM@1WhatIsHow
7B LLaVA-3D-7B 55.6 (57.6)
Scene-LLM-7B 54.2
LLaVA-OV-FT (video-only) 55.8 (58.1)
PAVE* 57.6 (59.9)52.3 (56.9)69.2 (69.9)56.3 (57.4)
UniMVU 58.1 (60.4)52.3 (56.6)70.7 (71.8)60.4 (61.1)
UniMVU 59.4 (61.6)53.4 (57.7)75.9 (76.8)55.9 (56.1)
0.5B SceSU 46.832.264.946.2
DSPNet 50.438.266.051.2
LLaVA-OV-FT (video-only) 44.1 (45.7)
PAVE* 48.5 (50.6)37.5 (41.6)61.0 (62.1)50.1 (50.3)
UniMVU 50.8 (52.6)40.9 (44.4)63.7 (64.7)52.5 (52.7)
UniMVU 55.2 (57.1)46.5 (50.6)67.6 (68.4)57.4 (57.6)

MVBench

ScaleMethodSCFGPOSAPASAvg.
0.5B LLaVA-OV 37.549.033.045.5
PAVE* 41.050.032.043.046.544.5
UniMVU 37.549.030.564.063.048.6
UniMVU 43.050.530.053.552.046.7
7B VideoChat2-7B 44.049.042.547.566.051.1
VideoLLaMA2.1-7B 57.3
PAVE-7B* 51.053.539.570.570.757.1
LLaVA-OV-7B (Baseline) 52.053.035.556.7
UniMVU 51.558.039.576.576.159.5
UniMVU 51.054.538.571.077.058.0

Qualitative Results

The qualitative examples show that UniMVU shifts attention toward the modality that actually resolves the question, rather than relying on a fixed multimodal weighting.

Qualitative UniMVU examples across audio-video, 3D, and long-video tasks.
Qualitative examples across task families. The answer quality tracks the modality weights implied by the question.
Additional qualitative UniMVU examples across challenging multimodal cases.
Additional qualitative cases highlighting instruction-aware modality reweighting.

Citation

If you find UniMVU useful in your research, please cite the paper below.

@article{ding2026unimvu, title = {Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos}, author = {Ding, Bonan and Nawaz, Umair and Khan, Ufaq and Shaker, Abdelrahman M. and Khan, Muhammad Haris and Cao, Jiale and Xie, Jin and Khan, Fahad Shahbaz}, journal = {arXiv preprint arXiv:2605.26232}, year = {2026}, url = {https://arxiv.org/abs/2605.26232} }

Acknowledgement

We gratefully acknowledge the open-source projects that UniMVU builds upon: PAVE, Qwen2, LLaVA-OneVision, and LMMS-Eval.

IVAL Oryx MBZUAI