R3CD: Scene Graph to Image Generation with Relation-aware Compositional Contrastive Control Diffusion

Jinxiu Liu¹
Qi Liu¹

¹South China University of Technology

Abstract

Image generation tasks have achieved remarkable performance using large-scale diffusion models. However, these models are limited to capturing the abstract relations (viz., interactions excluding positional relations) among multiple entities of complex scene graphs. Two main problems exist: (1) fail to depict more concise and accurate interactions via abstract relations; (2) fail to generate complete entities. To address that, we propose a novel Relation-aware Compositional Contrastive Control Diffusion method, dubbed as R3CD, that leverages large-scale diffusion models to learn abstract interactions from scene graphs. Herein, a scene graph transformer based on node and edge encoding is first designed to perceive both local and global information from input scene graphs, whose embeddings are initialized by a T5 model. Then a joint contrastive loss based on attention maps and denoising steps is developed to control the diffusion model to understand and further generate images, whose spatial structures and interaction features are consistent with a priori relation. Extensive experiments are conducted on two datasets: Visual Genome and COCO-Stuff, and demonstrate that the proposal outperforms existing models both in quantitative and qualitative metrics to generate more realistic and diverse images according to different scene graph specifications.

Results

Visualization results of relation features.

The attention maps show the similarity between image pixels and scene graph nodes. Our method assigns higher attention weights to the regions that correspond to the key interaction features of the relations, such as the handshakes, hugs, and riding poses. Our method also generates more realistic and detailed images that reflect the abstract relations, while SGDiff fails to capture the interaction features and generates isolated entities.

Visual examples of graph-to-image generation in complex scene.

The figure shows the attention maps and generated images from scene graphs with multiple entities and relations. Our method can capture the semantic and spatial information of the scene graphs better than SGDiff, such as the relative positions, orientations, colors, and shapes of different entities. Our method also generates more realistic and detailed images that respect the scene graph specifications, while SGDiff fails to generate complete entities and produces blurry images.

Method

The whole pipeline of R3CD

The whole pipeline of R3CD, where node and edge embeddings are encoded by the proposed SGFormer and then are fed to compositional generation model under the guidance of relation-aware contrastive control loss. The figure illustrates how R3CD generates an image from a scene graph in two stages. First, the input scene graph is encoded by SGFormer, which uses a T5 model to initialize the node and edge embeddings, and then applies a graph attention layer and a graph update layer to refine them with both local and global information. Second, the scene graph embeddings are fed to the compositional generation module, which uses a denoise UNet to generate and fuse each component of the image. The relation-aware contrastive control loss is designed to align the abstract relation features in the scene graph with the attention maps and diffusion steps in the image generation process. The attention map contrastive loss minimizes the cosine similarity between attention maps that correspond to different relations, and maximizes the similarity between those that correspond to the same relations. The diffusion steps contrastive loss minimizes the energy function between noise distributions that correspond to different relations, and maximizes the energy function between those that correspond to the same relations. The final output is a realistic and diverse image that respects the scene graph specifications.

The architecture of SGFormer

As shown in figure above, SGFormer comprises two components: (1) The graph attention layer to compute attention scores between node and edge features, and aggregate information from neighboring nodes and edges; (2) The graph update layer, to update the node and edge features based on the aggregated information

Conclusion

In this paper, we have proposed R3CD, a novel framework for image generation from scene graphs that leverages large-scale diffusion models and contrastive control mechanisms, which capture the interactions between entity regions and abstract relation in scene graph. Our method consists of two main components: (1) SGFormer, a transformer-based node and edge encoding scheme that captures both local and global information from scene graphs; (2) Relation-aware Diffusion contrastive control: a contrastive learning module that can align the abstract relation features and the image features across different levels of abstraction, and enhance the model to generate images that reflect the abstract relations. We have conducted extensive experiments on two datasets: Visual Genome and COCO-Stuff, and demonstrated that our method outperforms existing methods in terms of both quantitative and qualitative metrics. We have also shown that our method can generate more realistic and diverse images that respect the scene graph specifications, especially for abstract relations that are hard to express with entity stitching.

Stable Diffusion and Latent Diffusion Models

Compositional Visual Generation with Composable Diffusion Models

Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis.