Limitation of the Diffusion Model

date

Sep 19, 2023

slug

Research_record_1

author

status

Public

Introduction

In recent years, text-to-image generation models like Stable Diffusion, DallE, and Midjourney have amazed the world with their exceptional quality. However, they are still challenging to use in daily business due to their lack of controllability. Specifically, users find it difficult to instruct the generative model to generate images based on specific instructions, especially when it comes to controlling the region. Therefore, an interesting topic for me is:

Can the model generate complex images ("multiple objects") based on region-based instructions?

To answer this question, I experimented with the diffusion model and made the following observations.

Stable Diffusion Generates Multiple Objects

The images in this graph were generated using this weight.

As shown in the two figures on the right, stable diffusion can sometimes generate perfect images containing two objects correctly. However, it can also fail in some cases. The reason behind this is that during the training of the stable diffusion model, it is common to have both a cat and a dog in the same scene. On the other hand, it is uncommon for a dog to appear together with a canary. As a result, the model struggles to decide what to do and ends up combining the features from different objects. This phenomenon is not only observed in the stable diffusion model but also in DallE and even Midjourney.

Stable Diffusion Inpainting Model

Initially, I thought that the inpainting model had the potential to create complex images iteratively. The idea was to choose different parts of the image to create different objects and ultimately create complex images. However, it did not work as expected. As you can see in the figure below, different selections of inpainting regions can lead to completely different outputs. In particular, the third row shows that even though the top left corner was selected for inpainting a cat, the cat does not appear in the output image.

If you want to try my interface locally, follow this link: https://github.com/Mao718/dash-vusualization/tree/main/Inpainting_stable_diffusion

The reason behind this is that during the fine-tuning process, a diffusion model randomly erases part of the image and fine-tunes the model to recover the missing area based on the corresponding image caption. However, semantic misalignment between the missing area (representing local content) and the global text description can cause the model to fill in the masked region with the background instead of precisely adhering to the text prompt. In other words, the inpainting model is more likely to recover the image instead of creating objects based on text guidance. SmartBrush [1] also demonstrates this phenomenon and refers to it as "text misalignment".

Upcoming

There are some branches of research that are solving this problem. I will update it after I organize the papers.

Reference

[1] SmartBrush: Text and Shape Guided Object Inpainting With Diffusion Model

[2] High-Resolution Image Synthesis with Latent Diffusion Models

Limitation of the Diffusion Model - Research Record

Introduction

Stable Diffusion Generates Multiple Objects

Stable Diffusion Inpainting Model

Upcoming

Reference