WEAVE
Unleashing and Benchmarking the Interleaved Cross-modality Comprehension and Generation

Wei Chow^1,*, Weijia Wu^1,*, Fengda Zhang^2,†, Yongyuan Liang³, Mingze Zhou⁴,
Liyu Jia², Saining Zhang², Xue Song², Siliang Tang⁴, Juncheng Li⁴,
Tat-Seng Chua¹, Hanwang Zhang², Jiachun Pan¹

¹National University of Singapore, ²Nanyang Technological University,
³University of Maryland, College Park, ⁴Zhejiang University
^*Equal contribution. ^†Correspondence.

arXiv (Soon) Code (Soon) 🤗 WEAVE-100k 🤗 WEAVEBench 🧨 Finetuned Bagel

Data Overview

Figure 1: Compared to previous works that focus on single-turn edits (a), WEAVE uniquely enables multi-turn editing with visual memory recall (b), allowing models to reference and restore previously edited elements across multiple turns.

Statistic	Number
• Total Chats	100,750
- ≥4 Images Chats	100,584
- ≥5 Images Chats	60,361
- ≥6 Images Chats	31,571
• Average Chat Turns	3.79
- Average Question Length	195.49
• Total Images	505,186
- Maximum Images Per Chat	8
- Average Images Per Chat	5.01

Figure 2: Summary of domain distributions and evaluation methods for WEAVE.

For each instance in WEAVE, we provide a text prompt, one or more initial images, and ground truth examples. The test set additionally includes key information that the correct output images must satisfy.

Our evaluation framework adopts a VLM-as-judge approach with a key-point-based scoring system that focuses on four metrics:

Key Point Correctness (KP): Measures whether the edited image satisfies the specified editing requirements.
Visual Consistency (VC): Ensures non-target elements remain unchanged and maintains consistency with the original image.
Image Quality (IQ): Evaluates the overall quality of the generated image.
Accuracy (Acc): Measures the correctness of the reasoning result for comprehension tasks.

Leaderboard

Model	Size	Science	Creation	Logic	Game	Avg ▼
AnyEdit	1B	0.445	0.514	0.351	0.419	0.472
UltraEdit(SD3)	2B	0.493	0.561	0.491	0.440	0.522
VAREdit-8B	8B	0.536	0.636	0.584	0.580	0.603
Step1X-Edit v1.1	12B	0.574	0.714	0.700	0.625	0.669
Step1X-Edit v1.2	12B	0.560	0.644	0.530	0.562	0.605
FLUX.1 Kontext	12B	0.589	0.756	0.639	0.610	0.689
Qwen-Image-Edit	20B	0.586	0.715	0.589	0.628	0.665
OminiGen	4B	0.398	0.474	0.401	0.177	0.404
OminiGen2	7B	0.511	0.682	0.551	0.511	0.609
Ovis-U1	3B	0.402	0.557	0.364	0.357	0.422
UniPic	1.5B	0.472	0.590	0.463	0.316	0.511
UniPic2-SD3.5M	2B	0.477	0.625	0.543	0.497	0.568
UniPic2-Metaquery	9B	0.493	0.666	0.507	0.444	0.582
NextStep-1-Large	15B	0.519	0.620	0.437	0.309	0.534
Seedream 4.0	-	0.683	0.847	0.679	0.635	0.765
Seedream 4.0	-	0.667	0.830	0.646	0.599	0.746
Nano Banana	-	0.715	0.823	0.666	0.666	0.764
Nano Banana	-	0.710	0.843	0.730	0.613	0.767
Bagel	14B	0.378	0.475	0.406	0.365	0.446
Bagel-Zebra	14B	0.399	0.456	0.393	0.396	0.449

Dataset Details

Data Visualization

Explore samples from our WEAVE dataset by selecting a domain below. Image loading may be delayed, please wait ⌛️! Double-click again to view 👀 other cases of the same type (if any).

Select a domain to view a sample from the WEAVE dataset.

Data Collection

Figure 3: Data Annotation Pipeline for WEAVE. Our methodology ensures data diversity and quality through a multi-round image generation process, supplemented by two rounds of validation and refinement.

Findings and Insights

In-context image generation remains challenging. Among the models tested, the best-performing Edit and UMM approaches achieved maximum scores of only 0.68 and 0.767, respectively. Furthermore, significant domain biases were observed, with performance in creative imagery consistently surpassing that in scientific and logical domains. This suggests substantial room for improvement in generative models' ability to effectively integrate world knowledge.

In-context history usage matters

(a) For comprehension tasks, we observed significant performance improvements when utilizing in-context information compared to baseline conditions without historical context. This effect was particularly pronounced in QwenVL, which demonstrated a remarkable 163% improvement, indicating that WEAVEBench successfully incorporated historical information into the model evaluation.

(b) For generation tasks, increasing history length produced divergent effects across model types. Open-source models exhibited progressive performance degradation with additional historical context—Qwen-Edit showed decremental performance of 5.3% and 8.6% respectively. This suggests that open-source models, constrained by single-round editing capabilities, experience diminished localization accuracy when processing expanded contextual information, thereby failing to effectively utilize in-context data. Conversely, proprietary models such as Nano demonstrated incremental improvement, indicating successful utilization of contextual information.

(c) WEAVEBench exhibits superior image quality. Incorporating WEAVEBench's ground truth images as in-context examples resulted in performance improvements across all models. Notably, Qwen-Image-Edit demonstrated a significant improvement of 7.1%, potentially attributable to Qwen's inherently weaker generative capabilities compared to nano-banana.

Sequential Input Superiority. Sequential image input demonstrates significant performance advantages over concatenated input. This effect is particularly pronounced with the Bagel model, where concatenation results in a 10.3% performance degradation. These findings highlight the potential of UMMs as effective editing models, especially considering that traditional editing models cannot directly process multiple images and historical information as input.

Figure 5: (a) Impact of different in-context modes on performance. (b) Reasoning performance using ground truth as in-context examples. (c) Performance variations when concatenating sequential images. (d) Evaluation reliability of GPT4.1 judger.

WEAVE
Unleashing and Benchmarking the Interleaved Cross-modality Comprehension and Generation

Abstract

Data Overview

Leaderboard

Dataset Details

Data Visualization

Data Collection

Findings and Insights

BibTeX

WEAVE Unleashing and Benchmarking the Interleaved Cross-modality Comprehension and Generation

Abstract

Data Overview

Leaderboard

Dataset Details

Data Visualization

Data Collection

Findings and Insights

BibTeX

WEAVE
Unleashing and Benchmarking the Interleaved Cross-modality Comprehension and Generation