{"id":1,"title":"Rapid High Resolution Latent Space Interpolation in Diffusion Models","independent":true,"description":"The goal of this project is to achieve a target of 4K image generation using stable diffusion while traversing its latent space at 24 frames per second, also known as AI \"dreaming\". The entire process will be run locally, with no cloud compute. Using the base standard diffusion model, text to image generation takes about 2 seconds for a 512x512 image with the DDIM sampler at 50 steps, and about 4 seconds for a 576x1024 image. I decided to move forward with the 576x1024 size as it is well suited for subsequent upscaling to 4K. Changing the choice of sampler does not dramatically affect these times nor the resulting image for my purposes.\r\n\r\nAlthough 4-second generation times are quite respectable, they are far from my target of 24fps. Thankfully it is significantly cheaper to explore the latent space of the model. To do so, I can take a text prompt, run it through the transformer encoder, then pass this into the sampler to get an output matrix. This matrix represents a point in a high dimensional space of shape [4, 72, 128]. To decode this latent matrix into an image only takes around 5ms, demonstrating significant potential for rapid traversal. By generating two latent matrices, from two different starter prompts, we can then interpolate latent matrices between both at a certain stepping size with a simple calculation to traverse the latent images between the two outputs. Unfortunately, when I tried to rapidly decode these latent matrices I encountered significant GPU bottlenecks, and after a few generations the time would increase from 5ms to 1000ms or more. After some investigation with Python's cProfile I was able to optimize certain portions of the code, specifically the transfer of memory from GPU to CPU which was a bottleneck (I simply made it non-blocking which cut times in half). However, this did not fix the issue entirely. After experimentation, I believe the issue is primarily derived from memory management on the GPU and memory switching. My current solution is to clear the GPU cache every other frame which allows me to consistently operate at around 8 frames per second.\r\n\r\nWith the above optimizations I am able to generate images at 576x1024 resolution at 8 frames per second. To increase the resolution to 4K (3840x2160) I needed an upscaling technique. ESRGAN (Enhanced Super-Resolution Generative Adversarial Networks) is a deep learning technique for upscaling low-resolution images into higher-resolution versions, through a process where two neural networks (a generator and a discriminator) compete against each other to produce more realistic images. Initial experimentation demonstrated that ESRGAN can upscale an image in a fraction of a second (about 500ms), but unfortunately this is still too slow for my target of 24fps. Alternative options, which rely on simpler statistic methods, include nearest neighbor, bilinear interpolation, bicubic interpolation, and Lanczos resampling, all of which run in less than 100ms. Bicubic interpolation produced the most favorable outputs (with bilinear being the worst), so I have opted to use it for now.\r\n\r\nSince I can now generate images at 4K resolution at about 7fps through the latent space the next step was to utilize frame interpolation techniques to achieve 24fps. Since transitions between individual frames are unlikely to reflect transitions found in traditional videos and performance is a top priority I have opted to use a simple linear frame interpolation. 3 frames are generated between each decoded dream sample, however an issue arose in which type conversions accumulated and became very costly. To fix this, it was necessary to perform frame interpolation before upscaling. Additionally, I batched frame processing so that all 4 frames during each pass-through are processed in parallel (e.g. transferred from GPU to CPU). To handle the upscaling of these frames I utilized multithreading through the concurrent.futures library so that all four frames are upscaled in parallel.\r\n\r\nTo display my output I opted to use PyQt5 for its performance and powerful capabilities. Originally I would update the image every time a dream sample was generated, but with frame interpolation this would result in sporadic bursts of frames. To solve this issue I implemented multithreading with a dream_generation_thread and a display_thread. Both threads access a frame_buffer, which the dream_generation thread continuously adds to while the display_thread pops and displays frames at a regular frequency (the desired framerate). A simple lock was used to protect the buffer from race conditions. With these changes implemented I am now achieving around 30fps latent space traversal (with 8fps directly from the diffusion model) at 4K resolution run entirely on my local machine.","start":"2023-12-21","end":"2024-01-30","img":"https://imgur.com/K2vGt1h.gif","link":"https://github.com/SevanBrodjian/Rapid_Diffusion_Dreamer","slug":"rapid-hi-res-diffusion-latent-space-traversal","topic":[1,2,3],"association":[]}