VidMsg: A Benchmark for Implicit Message Inference in Short Videos

OriginAI, Israel

400

clips

9

topic areas

52

target messages

2

evaluation protocols

VidMsg evaluates whether models can identify and retrieve videos by what they communicate.

VidMsg retrieval setting: example clips and their target messages
Example clips with their associated target messages.
Clip duration distribution
Clip duration distribution.
VidMsg: topic and message distribution
Topic and message distribution. Each ring level shows topics (inner) and their individual messages (outer).

Abstract

Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying message or purpose in the clip. We introduce VidMsg, a benchmark for evaluating implicit message understanding in short, internet-native video clips. VidMsg contains 400 YouTube-derived clips across 9 practical topic areas and 52 fine-grained target messages, covering domains such as career and finance, education, health and well-being, culture, safety, sustainability, and lifestyle.

VidMsg is constructed through a message-first pipeline: an LLM first translates target messages into indirect search scenarios, which are used to retrieve candidate clips. Human annotators then retain clips that convey the intended message without being overly explicit. VidMsg is designed primarily for bidirectional message–clip retrieval for scalable applications such as video search and recommendation, where systems must capture holistic video understanding. In addition to retrieval, VidMsg includes a diagnostic multiple-choice QA benchmark, where models select the intended message of a clip from semantically related alternatives.

Experiments with contemporary video-language and retrieval models show that strong models often fail on VidMsg, because the task requires pragmatic inference, integration of contextual cues, and discrimination among semantically close messages. We also introduce VidVec-Msg, a baseline method that improves message-oriented retrieval while leaving substantial headroom for future work.

Dataset Overview

VidMsg covers nine recurring communicative domains: career & finance, community & culture, cultural trends, education, health & well-being, lifestyle & creativity, mobility & work, safety & empowerment, and sustainability. These topics were selected because clips in such domains often convey abstract advice, values, emotions, social meanings, or implicit viewpoints rather than merely documenting visible events.

The dataset is relatively balanced across topics: the number of clips per topic ranges from 35 to 49. Across 52 target messages, each message is represented by 5 to 9 clips, with most messages having 8 or 9 clips. This structure supports both broad topic coverage and fine-grained message-level evaluation.

Importantly, within each topic the benchmark includes semantically related but distinct messages—such as “food as self-care” versus “lifestyle medicine beyond symptom relief,” or “work-life integration” versus “performance over work hours.” Models must therefore distinguish fine-grained communicative meanings rather than simply identify the general domain.

Total number of clips per topic
Total number of clips per topic.

Message-First Data Collection

Constructing a benchmark for implicit message understanding is challenging because standard web search is optimized for explicit objects, events, and keywords. Directly searching for a target message often retrieves clips where the message is stated verbatim in the title, speech, or text overlay. To address this, we adopt a message-first collection pipeline.

Overview of the message-first video dataset construction pipeline
Overview of the message-first video dataset construction pipeline. (1) Starting with a target message, (2) an LLM generates a visual storyline in which the message is embedded implicitly, (3) an LLM generates indirect search keywords, (4) verified by an LLM judge, (5) used to retrieve candidate videos; (6) after duration filtering, (7) a second LLM judge removes unrelated clips. Clips are then annotated by three AMT workers and conservatively filtered, yielding 400 clips from 520 candidates (77% retention).

The process starts with a target message. An LLM is prompted to generate short, story-like scenarios that naturally incorporate the target message into the narrative without stating it directly. A second LLM generates indirect search keywords from these stories, and a separate LLM judge verifies that the keywords remain related to the original message without direct lexical disclosure. The resulting keywords are used to retrieve candidate public clips, which are filtered to three minutes or less.

Candidate clips are validated through human annotation on Amazon Mechanical Turk. Each clip is assigned to three independent annotators, who rate on a five-point scale how strongly it conveys the target message, and flag clips where the message is too explicit. A clip is retained only if a majority of annotators assign a high score (4 or 5), no annotator assigns a score of 1, and no annotator marks the clip as “too obvious.”

Annotation Agreement

We evaluate inter-rater reliability using average pairwise linear-weighted κ among the three annotators assigned to each clip. Agreement is substantial across all annotated clips (κ = 0.681), and higher among the 400 retained clips (κ = 0.795), as expected given the conservative filtering. These results indicate that annotators generally agree on whether a clip conveys the target message, while some variation reflects the inherent nuance of implicit message understanding.

Evaluation Protocols

Bidirectional Retrieval

The primary evaluation protocol. In text-to-video retrieval, each of the 52 target messages is used as a query against all 400 clips. In video-to-text retrieval, each clip is used to retrieve the corresponding target message from the 52 candidates. This setting reflects scalable applications such as search, recommendation, and content analysis. We report Recall@10 and mAP for text-to-video, and Recall@1 and mAP for video-to-text.

Multiple-Choice QA

A diagnostic analysis protocol. For each clip, the model selects the main message from 5 candidate messages sampled from the same topic. The same-topic design makes the task more challenging than coarse topic classification, since distractors are semantically related to the correct message. This setting evaluates MLLMs in their native instruction-following mode, exposing gaps in direct message reasoning.

MCQ example questions from Mobility & Work, Lifestyle & Creativity, and Career topics
Example multiple-choice questions from three topic areas. Each clip is paired with 5 same-topic candidate messages.

Results

Retrieval Results

VidMsg is challenging for standard retrieval models, especially those trained for caption-like video–text alignment. Among off-the-shelf baselines, Qwen3-VL-Emb performs best, reaching 39.7 R@10 and 36.1 mAP (T2V) without transcripts. Our baseline, VidVec-Msg, outperforms all baselines, achieving 47.5 R@10 and 45.6 mAP (T2V) and 48.9 R@1 and 63.2 mAP (V2T) without transcripts, with further gains when transcripts are provided.

Method Text-to-Video Video-to-Text
w/o transcripts w/ transcripts w/o transcripts w/ transcripts
R@10mAP R@10mAP R@1mAP R@1mAP
Clip4Clip21.819.324.137.3
ImageBind24.623.620.034.8
PE-Core10.310.08.116.8
InternVideo2-1B19.517.417.729.4
InternVideo2-6B25.522.623.036.1
VideoPrism28.424.829.942.8
VLM2Vec-2.029.027.632.331.027.943.634.449.2
Qwen3-VL-Emb39.736.142.742.845.850.650.663.1
VidVec28.425.332.128.636.051.039.854.5
VidVec-Msg (Ours) 47.545.6 48.946.7 48.963.2 49.165.0

VidMsg retrieval performance. Recall@10 and mAP for Text-to-Video; Recall@1 and mAP for Video-to-Text; with and without ASR transcripts. Bold indicates best result per column.

Multiple-Choice QA Results

Among open-source models, Qwen2.5-VL-7B achieves the best overall accuracy at 71.9%, outperforming larger variants such as Qwen2.5-VL-32B (67.6%) and Qwen3-VL-32B (68.9%). Commercial models obtain the strongest results: Gemini-3-Flash and Gemini-3.1-Pro both reach 76.5%. Several models perform near the 20% random-choice baseline, confirming that VidMsg-QA is a genuinely challenging diagnostic.

Model Car.Comm.Cult.Edu. Hlth.Life.Mob.Safe.Sust. Overall
Commercial Large Multimodal Models
GPT-5.4-Mini46.573.560.065.273.371.865.976.697.970.6
GPT-5.455.871.454.367.475.669.265.972.395.770.4
Gemini-3-Flash67.475.565.773.973.374.481.876.695.776.5
Gemini-3.1-Pro65.167.365.780.475.671.884.178.795.776.5
Open-Source Vision-Language Models
Qwen3-VL-2B41.967.365.754.457.861.565.966.095.764.3
VideoLLaMA3-7B27.940.831.439.133.346.131.848.961.740.5
Qwen2.5-VL-7B55.877.568.663.068.969.279.570.291.571.9
MiniCPM-V-4.5-8B20.922.428.617.422.225.620.423.421.322.3
Molmo2-8B44.255.140.056.548.961.545.572.389.457.7
NVILA-8B41.955.157.145.648.946.140.966.078.753.7
LLaVA-1.5-13B46.557.160.047.846.759.050.048.980.855.2
Qwen3-VL-8B53.561.260.047.864.469.261.468.195.764.8
Qwen2.5-VL-32B46.579.648.650.077.866.765.970.295.767.6
Qwen3-VL-32B53.575.557.158.766.776.965.963.897.968.9

VidMsg MCQ accuracy (%) by topic. Topics: Car. = Career & Finance, Comm. = Community & Culture, Cult. = Cultural Trends, Edu. = Education, Hlth. = Health & Well-Being, Life. = Lifestyle & Creativity, Mob. = Mobility & Work, Safe. = Safety & Empowerment, Sust. = Sustainability. Bold = best within model group per column.

Per-topic results reveal substantial variation in difficulty. Sustainability is consistently the easiest topic, with several models exceeding 95% accuracy, suggesting its messages are more visually distinctive. In contrast, topics such as career and finance, cultural trends, and lifestyle are more challenging, requiring finer pragmatic distinctions and stronger contextual interpretation.

These results complement the retrieval evaluation: VidMsg is not merely a representation-learning challenge. It also exposes gaps in direct message reasoning, especially when the correct answer must be selected among closely related alternatives.

Key Contributions

  • VidMsg benchmark: A benchmark for implicit message understanding in short, real-world, internet-native video clips — 400 clips across 9 topic areas and 52 fine-grained target messages.
  • Message-first pipeline: A data construction pipeline using LLM-generated storylines, indirect search, and conservative human relevance and explicitness filtering.
  • Two evaluation protocols: Bidirectional message–clip retrieval (T2V and V2T) and multiple-choice QA, enabling assessment of both scalable retrieval and direct message reasoning.
  • Extensive model evaluation: Retrieval results for 10 methods and MCQ results for 14 models, showing that current systems remain far from reliable on VidMsg.
  • VidVec-Msg baseline: A lightweight text-only adaptation method that substantially improves message-oriented retrieval without requiring video-based training data.

BibTeX

@inproceedings{tzachor2026vidmsg,
  title     = {VidMsg: A Benchmark for Implicit Message Inference in Short Videos},
  author    = {Issar Tzachor and Michael Green and Rami Ben-Ari},
  year      = {2026},
}