Review of “Compositional Visual Generation with Composable Diffusion Models”

date
May 24, 2023
slug
paper-review-5
author
status
Public
tags
DL
Paper Review
summary
type
Post
thumbnail
updatedAt
May 24, 2023 04:09 AM

Key Takeaways

  • Point out the model struggle to understand the composition of certain concepts.
  • Propose method (Concept Conjunction and Concept Negation) compose all types of pre-trained text-guided diffusion models.

Concept Conjunction and Concept Negation

Surprisingly, rather than retraining the network, they opted to enhance only the sampling part. When comparing it with the sampling in DDPM (on the left), the only differences lie between steps 5 to 7 and 13 to 15, which correspond to conjunction and concept negation, respectively. In step 5 and 13, the diffusion model generates the output based on the given conditions. Step 6 and 14 represent the unconditional output of the diffusion model. Finally, in steps 7 and 15, they combine the signals by using the unconditional output as a foundation and adding or subtracting the disparity between the conditional and unconditional outputs.
 
notion image
notion image
From a personal perspective, I believe that conditional outputs guide the image to different locations or variations. However, it is crucial to ensure that these different outputs do not conflict and cause system crashes. This is where the unconditional output acts as a balancer, attempting to coordinate the various conditions and maintain overall coherence in the generated outputs.

Reference