CS180 Project 5: Fun With Diffusion Models

Name: Shuai Victor Zhou

SID: 3035912996


Part A: The Power of Diffusion Models


Overview

In this part of the project, we utilize diffusion models to generate and create our own custom images.


Part 0: Setup

For this project, I used seed = 180.

prompt = “an oil painting of a snowy mountain village”
num_inference_steps = 5
prompt = “a man wearing a hat”
num_inference_steps = 5
prompt = “a rocket ship”
num_inference_steps = 5
prompt = “an oil painting of a snowy mountain village”
num_inference_steps = 20
prompt = “a man wearing a hat”
num_inference_steps = 20
prompt = “a rocket ship”
num_inference_steps = 20

Smaller inference steps gave us faster but less detailed results, whereas larger inference steps gave us more detailed images at the cost of taking longer. When we have a higher number of inference steps, we see the following effects:

  1. Our oil painting of the snowy mountain village is much clearer with higher quality, as trees are more detailed and coated with snow. The houses are also much more house-like.
  1. The man wearing the hat is actually much more distinguishably a person, and he has clear facial features as well as distinct facial hair. The hat is also very obvious, and casted shadows can be seen on the man.
  1. The rocket ship is now clearly depicted as what we’d think of when we think “rocket ship.” The colors are more diverse, the fire behind it has a nice gradient, and we’re also now obviously headed towards space.

Part 1: Sampling Loops

We use pretrained DeepFloyd denoisers to create new high quality images. From a clean image x0, we can add noise to get noisy images xt; this process is iteratively repeated until we reach xT (when t=T). On the opposite end, we can also predict the noise at each time t and remove it from the noisy image to eventually recover the original image.


Part 1.1: Implementing the Forward Process

Given a clean image x0, we can calculate the noisy image at time t as

xt=αˉtx0+1αˉtϵ,ϵN(0,1).x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \epsilon \sim \Nu(0, 1).
t = 0
t = 250
t = 500
t = 750

Part 1.2: Classical Denoising

With the generated noisy images, we can use Gaussian blur filtering to attempt to remove the noise.

Noisy
t = 250
Noisy
t = 500
Noisy
t = 750
Gaussian blurred
t = 250
Gaussian blurred
t = 500
Gaussian blurred
t = 750

Part 1.3: One-Step Denoising

Once again using the three noisy images from before, we use the UNet to estimate the noise from the images. We can then remove the noise from the noisy image to estimate the original image.

Original
Original
Original
Noisy
t = 250
Noisy
t = 500
Noisy
t = 750
Noise
t = 250
Noise
t = 500
Noise
t = 750
Original estimate
t = 250
Original estimate
t = 500
Original estimate
t = 750

Part 1.4: Iterative Denoising

Adding onto the previous part, we can iteratively denoise by running what we did in part 1.3 but for multiple iterative steps. To get the estimate at a time t’ after time t (for t< t), we can use the formula

xt=αˉtβt1αˉtx0+αt(1αˉt)1αˉtxt+νσ,x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}}\beta_t}{1 - \bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t'})}{1 - \bar{\alpha}_t}x_t + \nu_\sigma,

where

αt=αˉtαˉt,βt=1αt.\alpha_t = \frac{\bar{\alpha}_t}{\bar{\alpha}_{t'}}, \\ \beta_t = 1 - \alpha_t.
t = 690
t = 540
t = 390
t = 240
t = 90
Iteratively denoised
One-step denoised
Gaussian blurred

Part 1.5: Diffusion Model Sampling

Doing the same thing but now starting off with random noise instead of a noisy Campanile, we can get the following 5 sampled images using prompt “a high quality photo.”


Part 1.6: Classifier Free Guidance

The images from part 1.5 weren’t of the highest quality; to improve this, we use Classifier-Free Guidance (CFG). We use two noise estimates, one conditioned on the text prompt (εc) and one that’s unconditional (εu). Our noise estimate, using these two, is then

ϵ=ϵu+γ(ϵcϵu),\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u),

where γ is what controls the strength of our CFG. Using γ = 7, we generate the following 5 sampled images again using the prompt “a high quality photo.”


Part 1.7: Image-to-Image Translation

Once again returning to our image of the Campanile, we can again call our iterative_denoise_cfg function, except this time using starting index of [1, 3, 5, 7, 10, 20]. This will create images that look closer and closer to the original image of the Campanile.

i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20

Part 1.7.1: Editing Hand-Drawn and Web Images

We can also perform the image-to-image translation with not just the Campanile, but also our own images.

Original (from web)
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
Original (drawn)
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
Original (drawn)
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20

Part 1.7.2: Inpainting

Using a binary mask m, we can generate images where parts are the same as the original, but everywhere else is different. This is accomplished by running the diffusion denoising loop, but at each step, we perform

xt=mxt+(1m)forward(xorig,t)x_t = mx_t + (1 - m) \text{forward}(x_{orig}, t)
Original
Mask
To replace
Inpainted
Original
Mask
To replace
Inpainted
Original
Mask
To replace
Inpainted

Part 1.7.3: Text-Conditional Image-to-Image Translation

Now, instead of using our prompt of “a high quality photo,” we can use our own custom prompts to generate image-to-image translations.

Using prompt “a rocket ship”:

Noise level 1
Noise level 3
Noise level 5
Noise level 7
Noise level 10
Noise level 20
Original

Using prompt “a lithograph of a skull”:

Noise level 1
Noise level 3
Noise level 5
Noise level 7
Noise level 10
Noise level 20
Original

Using prompt “a photo of a hipster barista”:

Noise level 1
Noise level 3
Noise level 5
Noise level 7
Noise level 10
Noise level 20
Original

Part 1.8: Visual Anagrams

Using similar methods as before, we can generate images that look like one thing when upright and another thing when upside down. We do this by utilizing two different prompts. The first prompt denoising image xt generates noise estimate ε1, and the second prompt denoises an upside down version of xt to generate noise estimate ε2. We then flip ε2 back upright and take a weighted average of the two noise estimates for our final noise estimate.

ϵ1=UNet(xt,t,p1)ϵ2=flip(UNet(flip(xt),t,p2))ϵ=wϵ1+(1w)ϵ2\epsilon_1 = \text{UNet}(x_t, t, p_1) \\ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \\ \epsilon = w \epsilon_1 + (1 - w) \epsilon_2

Usually, we’d have weight w = 0.5, but for our first visual anagram of the people around a campfire and an old man, my best result came with w = 0.7.

“an oil painting of people around a campfire”
w = 0.7
“a photo of a dog”
w = 0.7
"an oil painting of an old man”
w = 0.78
“a man wearing a hat”
w = 0.78
“a rocket ship”
w = 0.637
“a photo of a man”
w = 0.637

Part 1.9: Hybrid Images

To create images that seem like one thing when we’re close and another when we’re far away, we utilize low-pass and high-pass filters. Again, two noise estimates are created using separate prompts, but this time they’re combined like so:

ϵ1=UNet(x1,t,p1)ϵ2=UNet(x2,t,p2)ϵ=flowpass(ϵ1)+fhighpass(ϵ2).\epsilon_1 = \text{UNet}(x_1, t, p_1) \\ \epsilon_2 = \text{UNet}(x_2, t, p_2) \\ \epsilon = f_{\text{lowpass}}(\epsilon_1) + f_{\text{highpass}}(\epsilon_2).

The low-pass is done using a Gaussian blur of kernel size 33 and sigma 2, and the high-pass is the inverse.

Far: “a lithograph of a skull”
Close: “a lithograph of waterfalls”

Far: "a photo of a dog”
Close: "an oil painting of people around a campfire”

Far: "a rocket ship”
Close: "a man wearing a hat”


Part B: Diffusion Models from Scratch


Overview

In this second part of the project, we train our own diffusion models using MNIST.


Part 1: Training a Single-Step Denoising UNet

We first build a one-step denoiser.


Part 1.1: Implementing the UNet

We implement the following UNet with its UNet operations:


Part 1.2: Using the UNet to Train a Denoiser

In order to train our denoiser, we need to generate noisy images with

z=x+σϵ,ϵN(0,I).z = x + \sigma\epsilon, \epsilon \sim \Nu(0, I).

For σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0], we get the following images in order.


Part 1.2.1: Training

We now train a denoiser to denoise images with σ = 0.5 on the clean image. We use batch size 256, and we train over our dataset for 5 epochs. We use hidden dimension 128 for our UNet and Adam optimizer with learning rate 1e-4.

Below is our loss curve.

Training losses over steps

Below is the results on a test set after the first and fifth epochs.

Epoch 1 original
Epoch 1 noisy
Epoch 1 denoised
Epoch 2 original
Epoch 2 noisy
Epoch 2 denoised

Part 1.2.2: Out-of-Distribution Testing

When we test on out-of-distribution noise levels, we get the following for σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0], in order alternating between noisy and denoised.