The initial part requires us to test the DeepFloyd IF diffusion model to generate images. I set the random seed in the code to the original value 180 in the notebook. Below are the images with 10 and 20 iterations of the first phase. In the image with fewer steps, the rocket looks a bit strange and the portrait is black and white.
For the forward process, in theory, noise needs to be added step by step according to a formula. However, it has been proven that noise can be added directly to the original image using another formula. The results for time steps 250, 500, and 750 are given below.
Gaussian blurring can be used to reduce the noise of an image. It is easy to find that this method does not work well when the noise content is high.
According to the previous formula, the formula for one-step denoising can be reversed. The image after adding noise, the image denoised by Gaussian blur, and the image denoised in one step are as follows.
One-step denoising can produce a good image, but the edges of the image are still fuzzy. Considering that there must be some errors in the inference process, the gradual denoising reduces the difficulty of each inference and allows for certain adjustments in the subsequent inference steps, which should have better performance. In order to reduce time overhead, a compromise strategy is adopted, that is, denoising across time steps. The following are the outputs of time steps 690, 540, 390, 240, 90.
If we start with a random noise image and perform iterative noise reduction, we can generate images using the diffusion model. The generated images look very much like photographs, but the content is sometimes strange and seems to have no clear meaning.
The CFG method can solve the above problems. Specifically, it selects an empty text prompt and a set text prompt, generates two noise estimates, and then extrapolates the noise. Practice shows that the images obtained in this way are brighter (even fancy), the content is usually meaningful and clear objects, and in many subsequent experiments, there seems to be a strong tendency to output photos of people.
If we add noise to the image and then do the above, we can get a new image that is similar to the original image but also related to the text prompt. Depending on the level of noise added, the new image tends to be closer to the original image and closer to the text prompt.
The homework website asked us to do this for hand-drawn images and web images. I chose my past sketches and doodles on exam papers as hand-drawn images. I chose the library photos on my high school website as web images. Although the trend of the generated images with noise level is consistent with the above analysis, the effect is not as good as I expected. My sketches turned into some three-dimensional shapes with no practical meaning when the noise level is low. My doodles did get pictures with similar color space distribution when the noise level is low, but the content is not what I originally wanted to draw. In fact, this is normal due to the low resolution. When the noise is high, the output almost becomes a portrait, which may be caused by the training data, which caused me some trouble later.
I initially encountered some problems with the code for the image restoration part. We need to keep a portion of the image pixels unchanged, and in order to achieve good fusion, the image should have a consistent noise level with the portion being restored. If the predicted image is spliced with the original image forwarded at the current time step after each model inference, it will result in different noise levels in the two parts in the last few steps of denoising, which will cause noise points in the repaired area. If the bell tower photo is restored in this case, upsampling will turn those noise points into small colored lights, which looks interesting but is actually wrong. In other words, inside the loop, we should add noise to the original image according to the current parameters before entering the model and splicing it with the current image.
During the restoration process, CFG has a strong tendency to insert overly conspicuous objects into the image, especially portraits. It is likely that repeated attempts will be required to get a satisfactory result. In this sense, CFG is not so suitable for "restoration".
The following are the results of translating images under text prompts. The prompt words are all rocket.
In this section, the algorithm is fine-tuned to implement visual anagrams. The noise estimation takes the mean of the image in both the positive and negative directions, and finally an image with different contents from the two directions can be obtained. The mean may need to be adjusted appropriately according to the situation of the two prompt words. The influence of the face is usually stronger, and increasing the weight of the other side often leads to better results.
The prompt words used in the following images are:
"an oil painting of people around a campfire" & "an oil painting of an old man"
"an oil painting of a snowy mountain village" & "an oil painting of an old man"
"a rocket ship" & "an oil painting of an old man"
In this section, the algorithm is also modified to implement hybrid images. The noise estimate is obtained by summing the low-frequency component of the noise estimate given by one prompt word and the high-frequency component given by another prompt word. There is no good way to adjust the signal strength here, so more attempts are needed to get relatively good results.
The prompt words used in the following images are:
"a lithograph of a skull" & "a lithograph of waterfalls"
"a pencil" & "a rocket ship"
"an oil painting of an old man" & "a lithograph of waterfalls"
The structure of Unet is shown in the figure. In the experiment, I used L2 as the loss function.
The loss curve of the model training is as follows.
The effects of single-step denoising on the test set at epochs 1 and 5 are as follows.
The model was trained at a noise level of 0.5. The generalization performance of the model with single-step denoising at noise levels of 0, 0.2, 0.4, 0.5, 0.6, 0.8, and 1.0 was also tested in the experiment when the model was epoch 5.
In order to implement the diffusion model, the time condition is added to the neural network. The modified network structure is shown in the figure.
The training loss curve is as follows.
The model starts by generating images from random noise. Below is the process of generating images at epochs 1, 5, 10, 15, and 20.
The model does produce valid output, but it is not good enough. Combining the experience in the previous part, by adding category conditions to the neural network. The loss curve of training is as follows.
Now the model can accurately generate images for the given digit category. Below is the process of generating images at epochs 1, 5, 10, 15, and 20. Even at a small epoch, the effect is already very good.