CS180 Project 5: Fun With Diffusion Models

Name: Shuai Victor Zhou

SID: 3035912996

Part A: The Power of Diffusion Models

Overview

In this part of the project, we utilize diffusion models to generate and create our own custom images.

Part 0: Setup

For this project, I used seed = 180.

prompt = “an oil painting of a snowy mountain village”
num_inference_steps = 5

prompt = “a man wearing a hat”
num_inference_steps = 5

prompt = “a rocket ship”
num_inference_steps = 5

prompt = “an oil painting of a snowy mountain village”
num_inference_steps = 20

prompt = “a man wearing a hat”
num_inference_steps = 20

prompt = “a rocket ship”
num_inference_steps = 20

Smaller inference steps gave us faster but less detailed results, whereas larger inference steps gave us more detailed images at the cost of taking longer. When we have a higher number of inference steps, we see the following effects:

Our oil painting of the snowy mountain village is much clearer with higher quality, as trees are more detailed and coated with snow. The houses are also much more house-like.

The man wearing the hat is actually much more distinguishably a person, and he has clear facial features as well as distinct facial hair. The hat is also very obvious, and casted shadows can be seen on the man.

The rocket ship is now clearly depicted as what we’d think of when we think “rocket ship.” The colors are more diverse, the fire behind it has a nice gradient, and we’re also now obviously headed towards space.

Part 1: Sampling Loops

We use pretrained DeepFloyd denoisers to create new high quality images. From a clean image x0, we can add noise to get noisy images xt; this process is iteratively repeated until we reach xT (when t=T). On the opposite end, we can also predict the noise at each time t and remove it from the noisy image to eventually recover the original image.

Part 1.1: Implementing the Forward Process

Given a clean image x0, we can calculate the noisy image at time t as

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \epsilon \sim \Nu(0, 1).

Part 1.2: Classical Denoising

With the generated noisy images, we can use Gaussian blur filtering to attempt to remove the noise.

Part 1.3: One-Step Denoising

Once again using the three noisy images from before, we use the UNet to estimate the noise from the images. We can then remove the noise from the noisy image to estimate the original image.

Part 1.4: Iterative Denoising

Adding onto the previous part, we can iteratively denoise by running what we did in part 1.3 but for multiple iterative steps. To get the estimate at a time t’ after time t (for t’ < t), we can use the formula

x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}}\beta_t}{1 - \bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t'})}{1 - \bar{\alpha}_t}x_t + \nu_\sigma,

where

\alpha_t = \frac{\bar{\alpha}_t}{\bar{\alpha}_{t'}}, \\ \beta_t = 1 - \alpha_t.

Part 1.5: Diffusion Model Sampling

Doing the same thing but now starting off with random noise instead of a noisy Campanile, we can get the following 5 sampled images using prompt “a high quality photo.”

Part 1.6: Classifier Free Guidance

The images from part 1.5 weren’t of the highest quality; to improve this, we use Classifier-Free Guidance (CFG). We use two noise estimates, one conditioned on the text prompt (εc) and one that’s unconditional (εu). Our noise estimate, using these two, is then

\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u),

where γ is what controls the strength of our CFG. Using γ = 7, we generate the following 5 sampled images again using the prompt “a high quality photo.”

Part 1.7: Image-to-Image Translation

Once again returning to our image of the Campanile, we can again call our iterative_denoise_cfg function, except this time using starting index of [1, 3, 5, 7, 10, 20]. This will create images that look closer and closer to the original image of the Campanile.

Part 1.7.1: Editing Hand-Drawn and Web Images

We can also perform the image-to-image translation with not just the Campanile, but also our own images.

Part 1.7.2: Inpainting

Using a binary mask m, we can generate images where parts are the same as the original, but everywhere else is different. This is accomplished by running the diffusion denoising loop, but at each step, we perform

x_t = mx_t + (1 - m) \text{forward}(x_{orig}, t)

Part 1.7.3: Text-Conditional Image-to-Image Translation

Now, instead of using our prompt of “a high quality photo,” we can use our own custom prompts to generate image-to-image translations.

Using prompt “a rocket ship”:

Using prompt “a lithograph of a skull”:

Using prompt “a photo of a hipster barista”:

Part 1.8: Visual Anagrams

Using similar methods as before, we can generate images that look like one thing when upright and another thing when upside down. We do this by utilizing two different prompts. The first prompt denoising image xt generates noise estimate ε1, and the second prompt denoises an upside down version of xt to generate noise estimate ε2. We then flip ε2 back upright and take a weighted average of the two noise estimates for our final noise estimate.

\epsilon_1 = \text{UNet}(x_t, t, p_1) \\ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \\ \epsilon = w \epsilon_1 + (1 - w) \epsilon_2

Usually, we’d have weight w = 0.5, but for our first visual anagram of the people around a campfire and an old man, my best result came with w = 0.7.

“an oil painting of people around a campfire”
w = 0.7

"an oil painting of an old man”
w = 0.78

Part 1.9: Hybrid Images

To create images that seem like one thing when we’re close and another when we’re far away, we utilize low-pass and high-pass filters. Again, two noise estimates are created using separate prompts, but this time they’re combined like so:

\epsilon_1 = \text{UNet}(x_1, t, p_1) \\ \epsilon_2 = \text{UNet}(x_2, t, p_2) \\ \epsilon = f_{\text{lowpass}}(\epsilon_1) + f_{\text{highpass}}(\epsilon_2).

The low-pass is done using a Gaussian blur of kernel size 33 and sigma 2, and the high-pass is the inverse.

Far: “a lithograph of a skull”
Close: “a lithograph of waterfalls”

Far: "a photo of a dog”
Close: "an oil painting of people around a campfire”

Far: "a rocket ship”
Close: "a man wearing a hat”

Part B: Diffusion Models from Scratch

Overview

In this second part of the project, we train our own diffusion models using MNIST.

Part 1: Training a Single-Step Denoising UNet

We first build a one-step denoiser.

Part 1.1: Implementing the UNet

We implement the following UNet with its UNet operations:

Part 1.2: Using the UNet to Train a Denoiser

In order to train our denoiser, we need to generate noisy images with

z = x + \sigma\epsilon, \epsilon \sim \Nu(0, I).

For σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0], we get the following images in order.

Part 1.2.1: Training

We now train a denoiser to denoise images with σ = 0.5 on the clean image. We use batch size 256, and we train over our dataset for 5 epochs. We use hidden dimension 128 for our UNet and Adam optimizer with learning rate 1e-4.

Below is our loss curve.

Below is the results on a test set after the first and fifth epochs.

Part 1.2.2: Out-of-Distribution Testing

When we test on out-of-distribution noise levels, we get the following for σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0], in order alternating between noisy and denoised.