Abstract

Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, a new benchmark suite for interleaved cross-modality comprehension and generation. Our benchmark consists of two complementary parts. WEAVE-100k is a large-scale dataset of 102K interleaved samples spanning over 600K dialogue turns and 700K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-verified evaluation benchmark with 100 tasks based on 600 images, featuring an agent-based and expert-verified evaluation framework that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains such as science, creation, logic, and games. Experiments demonstrate that training on WEAVE-100k enables unified multimodal models to develop emergent visual-memory capabilities, while evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation. We believe Weave provides a foundation for studying long-context reasoning and editing in interleaved vision-language models.

Data Overview

Comparison Figure
Figure 1: Compared to previous works that focus on single-turn edits (a), WEAVE uniquely enables multi-turn editing with visual memory recall (b), allowing models to reference and restore previously edited elements across multiple turns.
Statistic Number
• Total Chats 100,750
- ≥4 Images Chats 100,584
- ≥5 Images Chats 60,361
- ≥6 Images Chats 31,571
• Average Chat Turns 3.79
- Average Question Length 195.49
• Total Images 505,186
- Maximum Images Per Chat 8
- Average Images Per Chat 5.01
Domain distributions and evaluation methods
Figure 2: Summary of domain distributions and evaluation methods for WEAVE.

For each instance in WEAVE, we provide a text prompt, one or more initial images, and ground truth examples. The test set additionally includes key information that the correct output images must satisfy.

Our evaluation framework adopts a VLM-as-judge approach with a key-point-based scoring system that focuses on four metrics:

  • Key Point Correctness (KP): Measures whether the edited image satisfies the specified editing requirements.
  • Visual Consistency (VC): Ensures non-target elements remain unchanged and maintains consistency with the original image.
  • Image Quality (IQ): Evaluates the overall quality of the generated image.
  • Accuracy (Acc): Measures the correctness of the reasoning result for comprehension tasks.

Leaderboard

Model Organizor Size In-context Modality Science Creation Logic Game Avg
AnyEdit 1B 0.445 0.514 0.351 0.419 0.472
UltraEdit(SD3) 2B 0.493 0.561 0.491 0.440 0.522
VAREdit-8B 8B 0.536 0.636 0.584 0.580 0.603
Step1X-Edit v1.1 12B 0.574 0.714 0.700 0.625 0.669
Step1X-Edit v1.2 12B 0.560 0.644 0.530 0.562 0.605
FLUX.1 Kontext 12B 0.589 0.756 0.639 0.610 0.689
Qwen-Image-Edit 20B 0.586 0.715 0.589 0.628 0.665
OminiGen 4B 0.398 0.474 0.401 0.177 0.404
OminiGen2 7B 0.511 0.682 0.551 0.511 0.609
Ovis-U1 3B 0.402 0.557 0.364 0.357 0.422
UniPic 1.5B 0.472 0.590 0.463 0.316 0.511
UniPic2-SD3.5M 2B 0.477 0.625 0.543 0.497 0.568
UniPic2-Metaquery 9B 0.493 0.666 0.507 0.444 0.582
NextStep-1-Large 15B 0.519 0.620 0.437 0.309 0.534
Seedream 4.0 - 0.683 0.847 0.679 0.635 0.765
Seedream 4.0 - 0.667 0.830 0.646 0.599 0.746
Nano Banana - 0.715 0.823 0.666 0.666 0.764
Nano Banana - 0.710 0.843 0.730 0.613 0.767
Bagel 14B 0.378 0.475 0.406 0.365 0.446
Bagel-Zebra 14B 0.399 0.456 0.393 0.396 0.449

Dataset Details

Data Visualization

Explore samples from our WEAVE dataset by selecting a domain below. Image loading may be delayed, please wait ⌛️! Double-click again to view 👀 other cases of the same type (if any).

Select a domain to view a sample from the WEAVE dataset.

Data Collection

Data Annotation Pipeline
Figure 3: Data Annotation Pipeline for WEAVE. Our methodology ensures data diversity and quality through a multi-round image generation process, supplemented by two rounds of validation and refinement.

Findings and Insights

In-context image generation remains challenging. Among the models tested, the best-performing Edit and UMM approaches achieved maximum scores of only 0.68 and 0.767, respectively. Furthermore, significant domain biases were observed, with performance in creative imagery consistently surpassing that in scientific and logical domains. This suggests substantial room for improvement in generative models' ability to effectively integrate world knowledge.

In-context history usage matters

(a) For comprehension tasks, we observed significant performance improvements when utilizing in-context information compared to baseline conditions without historical context. This effect was particularly pronounced in QwenVL, which demonstrated a remarkable 163% improvement, indicating that WEAVEBench successfully incorporated historical information into the model evaluation.

(b) For generation tasks, increasing history length produced divergent effects across model types. Open-source models exhibited progressive performance degradation with additional historical context—Qwen-Edit showed decremental performance of 5.3% and 8.6% respectively. This suggests that open-source models, constrained by single-round editing capabilities, experience diminished localization accuracy when processing expanded contextual information, thereby failing to effectively utilize in-context data. Conversely, proprietary models such as Nano demonstrated incremental improvement, indicating successful utilization of contextual information.

(c) WEAVEBench exhibits superior image quality. Incorporating WEAVEBench's ground truth images as in-context examples resulted in performance improvements across all models. Notably, Qwen-Image-Edit demonstrated a significant improvement of 7.1%, potentially attributable to Qwen's inherently weaker generative capabilities compared to nano-banana.

Sequential Input Superiority. Sequential image input demonstrates significant performance advantages over concatenated input. This effect is particularly pronounced with the Bagel model, where concatenation results in a 10.3% performance degradation. These findings highlight the potential of UMMs as effective editing models, especially considering that traditional editing models cannot directly process multiple images and historical information as input.

Ablation Studies
Figure 5: (a) Impact of different in-context modes on performance. (b) Reasoning performance using ground truth as in-context examples. (c) Performance variations when concatenating sequential images. (d) Evaluation reliability of GPT4.1 judger.

BibTeX