Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, a new benchmark suite for interleaved cross-modality comprehension and generation. Our benchmark consists of two complementary parts. WEAVE-100k is a large-scale dataset of 102K interleaved samples spanning over 600K dialogue turns and 700K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-verified evaluation benchmark with 100 tasks based on 600 images, featuring an agent-based and expert-verified evaluation framework that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains such as science, creation, logic, and games. Experiments demonstrate that training on WEAVE-100k enables unified multimodal models to develop emergent visual-memory capabilities, while evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation. We believe Weave provides a foundation for studying long-context reasoning and editing in interleaved vision-language models.
| Statistic | Number |
|---|---|
| • Total Chats | 100,750 |
| - ≥4 Images Chats | 100,584 |
| - ≥5 Images Chats | 60,361 |
| - ≥6 Images Chats | 31,571 |
| • Average Chat Turns | 3.79 |
| - Average Question Length | 195.49 |
| • Total Images | 505,186 |
| - Maximum Images Per Chat | 8 |
| - Average Images Per Chat | 5.01 |
For each instance in WEAVE, we provide a text prompt, one or more initial images, and ground truth examples. The test set additionally includes key information that the correct output images must satisfy.
Our evaluation framework adopts a VLM-as-judge approach with a key-point-based scoring system that focuses on four metrics:
| Model | Organizor | Size | In-context | Modality | Science | Creation | Logic | Game | Avg ▼ |
|---|---|---|---|---|---|---|---|---|---|
| AnyEdit | ![]() |
1B | 0.445 | 0.514 | 0.351 | 0.419 | 0.472 | ||
| UltraEdit(SD3) | ![]() |
2B | 0.493 | 0.561 | 0.491 | 0.440 | 0.522 | ||
| VAREdit-8B | ![]() |
8B | 0.536 | 0.636 | 0.584 | 0.580 | 0.603 | ||
| Step1X-Edit v1.1 | ![]() |
12B | 0.574 | 0.714 | 0.700 | 0.625 | 0.669 | ||
| Step1X-Edit v1.2 | ![]() |
12B | 0.560 | 0.644 | 0.530 | 0.562 | 0.605 | ||
| FLUX.1 Kontext | ![]() |
12B | 0.589 | 0.756 | 0.639 | 0.610 | 0.689 | ||
| Qwen-Image-Edit | ![]() |
20B | 0.586 | 0.715 | 0.589 | 0.628 | 0.665 | ||
| OminiGen | ![]() |
4B | 0.398 | 0.474 | 0.401 | 0.177 | 0.404 | ||
| OminiGen2 | ![]() |
7B | 0.511 | 0.682 | 0.551 | 0.511 | 0.609 | ||
| Ovis-U1 | ![]() |
3B | 0.402 | 0.557 | 0.364 | 0.357 | 0.422 | ||
| UniPic | ![]() |
1.5B | 0.472 | 0.590 | 0.463 | 0.316 | 0.511 | ||
| UniPic2-SD3.5M | ![]() |
2B | 0.477 | 0.625 | 0.543 | 0.497 | 0.568 | ||
| UniPic2-Metaquery | ![]() |
9B | 0.493 | 0.666 | 0.507 | 0.444 | 0.582 | ||
| NextStep-1-Large | ![]() |
15B | 0.519 | 0.620 | 0.437 | 0.309 | 0.534 | ||
| Seedream 4.0 | ![]() |
- | 0.683 | 0.847 | 0.679 | 0.635 | 0.765 | ||
| Seedream 4.0 | ![]() |
- | 0.667 | 0.830 | 0.646 | 0.599 | 0.746 | ||
| Nano Banana | ![]() |
- | 0.715 | 0.823 | 0.666 | 0.666 | 0.764 | ||
| Nano Banana | ![]() |
- | 0.710 | 0.843 | 0.730 | 0.613 | 0.767 | ||
| Bagel | ![]() |
14B | 0.378 | 0.475 | 0.406 | 0.365 | 0.446 | ||
| Bagel-Zebra | ![]() |
14B | 0.399 | 0.456 | 0.393 | 0.396 | 0.449 |
Explore samples from our WEAVE dataset by selecting a domain below. Image loading may be delayed, please wait ⌛️! Double-click again to view 👀 other cases of the same type (if any).
Select a domain to view a sample from the WEAVE dataset.
In-context image generation remains challenging. Among the models tested, the best-performing Edit and UMM approaches achieved maximum scores of only 0.68 and 0.767, respectively. Furthermore, significant domain biases were observed, with performance in creative imagery consistently surpassing that in scientific and logical domains. This suggests substantial room for improvement in generative models' ability to effectively integrate world knowledge.
In-context history usage matters
(a) For comprehension tasks, we observed significant performance improvements when utilizing in-context information compared to baseline conditions without historical context. This effect was particularly pronounced in QwenVL, which demonstrated a remarkable 163% improvement, indicating that WEAVEBench successfully incorporated historical information into the model evaluation.
(b) For generation tasks, increasing history length produced divergent effects across model types. Open-source models exhibited progressive performance degradation with additional historical context—Qwen-Edit showed decremental performance of 5.3% and 8.6% respectively. This suggests that open-source models, constrained by single-round editing capabilities, experience diminished localization accuracy when processing expanded contextual information, thereby failing to effectively utilize in-context data. Conversely, proprietary models such as Nano demonstrated incremental improvement, indicating successful utilization of contextual information.
(c) WEAVEBench exhibits superior image quality. Incorporating WEAVEBench's ground truth images as in-context examples resulted in performance improvements across all models. Notably, Qwen-Image-Edit demonstrated a significant improvement of 7.1%, potentially attributable to Qwen's inherently weaker generative capabilities compared to nano-banana.
Sequential Input Superiority. Sequential image input demonstrates significant performance advantages over concatenated input. This effect is particularly pronounced with the Bagel model, where concatenation results in a 10.3% performance degradation. These findings highlight the potential of UMMs as effective editing models, especially considering that traditional editing models cannot directly process multiple images and historical information as input.