LucidDreamer: High-Fidelity Text-to-3D Generation via Interval Score Matching

The recent advancements in text-to-3D generative AI frameworks have marked a significant milestone in generative models. They pave the way for new possibilities in creating 3D assets across numerous real-world scenarios. Digital 3D assets now hold an indispensable place in our digital presence, enabling comprehensive visualization and interaction with complex environments and objects that mirror our real-world experiences. These 3D generative AI frameworks are applied in various domains, including animation, architecture, gaming, augmented and virtual reality, and much more. They are also being used extensively in online conferences, retail, education, and marketing.

However, despite the promise of these advancements in text-to-3D generative frameworks, the extensive use of 3D technologies comes with a major issue. Generating high-quality 3D images and media content still requires significant time, effort, resources, and skilled expertise. Even with these requirements met, text-to-3D generation often fails to render detailed and high-quality 3D models. This issue of rendering and low-quality 3D generation is more prevalent in frameworks that use the Score Distillation Sampling (SDS) method. This article will discuss the notable deficiencies observed in models using the SDS method, which introduce inconsistencies and low-quality updating directions, resulting in an over-smoothing effect on the generated output. We will also introduce the LucidDreamer framework, a novel approach that uses the Interval Score Matching (ISM) method to overcome the over-smoothing issue. We’ll explore the model’s architecture and its performance against state-of-the-art text-to-3D generative frameworks. So, let’s get started.

A major reason why 3D generation models has been the talking point of the generative AI industry is because of its widespread applications across various domains and industries, and their ability to produce 3D content in real-time. Owing to their widespread practical applications, developers have proposed numerous 3D content generation approaches out of which, text to 3D generation frameworks stands out from the rest for its ability to use nothing but text descriptions to generate imaginative 3D models. Text to 3D generative frameworks achieves this by using a pre-trained text to image diffusion model to as a strong image before supervising the training of a neural parameterized 3D model thus allowing for rendering 3D images consistently that aligns with the text. This capability to render constant 3D images is grounded in the use of the Score Distillation Sampling fundamentally, and allows SDS to act as the core mechanism to bring 2D results from diffusion models into their 3D counterparts, thus enabling training 3D models without using training images. Despite their effectiveness, 3D generative AI frameworks making use of the SDS method often suffer from distortion and over-smoothing issues that hampers the practical implementations of high-fidelity 3D generation.

To tackle the over-smoothing issues, the LucidDreamer framework implements a ISM or Interval Score Matching approach, a novel approach that uses two effective mechanisms. First, the ISM approach employs DDIM inversion method to mitigate the averaging effect caused by pseudo-Ground Truth inconsistencies by producing an invertible diffusion trajectory. Second, rather than matching the images rendered by the 3D model with the pseudo Ground Truths, the ISM method matches them between two interval steps in the diffusion trajectory that helps it avoid high reconstruction error by avoiding one-step reconstruction. The use of ISM over SDS results in consistently high performance with highly realistic and detailed outputs.

Overall, the LucidDreamer framework aims to make the following contributions in 3D generative AI

Provides an in-depth analysis of SDS, the fundamental concept in text to 3D generative frameworks, and identifies its key limitations of low-quality pseudo-Ground Truths, and provides an explanation for the over-smoothing effect faced by these 3D generative frameworks.
To counter the limitations posed by the SDS approach, the LucidDreamer framework introduces Interval Score Matching, a novel approach that uses interval-based matching and invertible diffusion trajectories to outperform SDS by producing highly-realistic and detailed output.
Achieving state of the art performance by integrating ISM method with 3D Gaussian Splatting to surpass existing methods for 3D content generation with low training costs.

SDS Limitations

As mentioned earlier, SDS is one of the most popular approaches for text to 3D generation models, and it seeks modes for conditional post prior in the latent space of DDPM. The SDS approach also adopts a pretrained DDPM to model the conditional posterior, and aims to distill the 3D representations for conditional posterior that is achieved by minimizing the following KL divergence. Furthermore, the SDS approach also reuses the weighted denoising score matching objective for DDP training. The primary objective of the SDS approach can also be viewed as matching the view of the 3D model with the pseudo-ground truth that is estimated in a single step by the DDPM. However, developers have observed that the distillation process often overlooks key aspects of DDPM, and the following figure demonstrates how a pre-trained DDPM tends to predict pseudo-ground truths with inconsistent features, and produces low quality output during the distillation process.

However, updating directions under undesirable circumstances are updated to 3D representations that ultimately leads to over-smoothed results. Furthermore, it is worth noting that the DDPM component is input sensitive, and the features of the pseudo-ground truth changes significantly even with the slightest change in the input. Additionally, randomness in both the camera pose and the noise component of the inputs might add to the fluctuations which is unavoidable during distillation. Optimizing the input for inconsistent pseudo Ground Truths results in featured-average outcomes. What’s more is that the SDS approach obtains pseudo-ground truths with a single-step prediction for all time intervals, and does not take into account the limitations of a single-step-DDPM component that are unable to produce high-quality output which indicates that distilling 3D assets or images with SDS component might not be the most ideal approach.

LucidDreamer : Methodology and Working

The LucidDreamer framework does introduce the ISM approach, but it also builds on the learnings from other frameworks including text to 3D generative models, diffusion models, and differentiable 3D representation frameworks. With that being said, let’s have a detailed look at the architecture and methodology of the LucidDreamer framework.

Interval Score Matching or ISM

The over-smoothing and low-quality output issues faced by a majority of text to 3D generation frameworks can be owed to their use of the SDS approach that aims to match the pseudo ground truth with the 3D representations that is inconsistent, and often of sub-par quality. To counter the issues faced by SDS, the LucidDreamer framework introduces ISM or Interval Score Matching, a novel approach that has two working stages. In the first stage, the ISM component obtains more consistent pseudo-ground truths during distillation regardless of the randomness in camera poses and noise. In the second stage, the framework generates pseudo-ground truths with better quality.

Another major limitation of SDS is generating pseudo-ground truths with a single-step prediction for all time intervals that makes it challenging to guarantee high-quality pseudo-ground truths, and it forms the basis to improve the visual quality of the pseudo-ground truths. In a similar sense, the SDS objective can be seen as to match the view of the 3D model with the pseudo-ground truth estimated by the DDPM in a single step, although the distillation process does overlook a critical aspect of the DDPM component i.e., it produces low-quality pseudo-ground truths with inconsistent features during the distillation process.

Overall, the ISM component promises to deliver several advantages over previous methods used in text to 3D generation models. First, thanks to ISM’s ability to provide high-quality pseudo-ground truths consistently, it is able to produce high-fidelity distillation outputs with finer structures and richer details, thus eliminating the need for large scale guidance scale, and enhances the flexibility for 3D content creation. Second, transitioning from SDS approach to ISM approach has marginal computational overhead especially since the ISM approach does not compromise on the overall efficiency even though it demands for additional computational costs for DDIM inversions.

The above figure demonstrates the working of the ISM approach, and provides an overview of the architecture of the LucidDreamer framework. The framework first initializes the Gaussian Splatting i.e. the 3D representations using a pretrained text-to-3D generator using a prompt. It is then incorporated with a pretrained 2D DDPM component to disturb random views to noisy unconditional latent trajectories using DDIM inversions, and then updates with the interval score. Thanks to its architecture, the core of optimizing the ISM component focuses on updating the 3D representations towards pseudo-ground truths that are high-quality and features-consistent, yet computationally friendly. This principle is what allows ISM to align with the fundamental objectives of the SDS approach while refining the existing method.

DDIM Inversion

The LucidDreamer framework aims to produce more consistent pseudo-ground truths in alignment with the 3D representations. Therefore, instead of producing 3D representations, the LucidDreamer framework employs the DDIM inversion approach to predict noise latent 3D representations, and predicts an invertible noise latent trajectory in an iterative manner. Furthermore, it is because of the invertibility of DDIM inversion that the LucidDreamer framework is able to increase the consistency of the pseudo-ground truth significantly for all time intervals.

Advanced Generation Pipeline

The LucidDreamer framework also introduces an advanced pipeline in addition to ISM to explore the factors affecting the visual quality of text-to-3D generation, and introduces 3D Gaussian Splatting or 3DGS as its 3D generation, and 3D point cloud generation models for initialization.

3D Gaussian Splatting

Existing works have indicated that increasing the batch size and rendering resolution for training improves the visual quality significantly. However, a majority of learnable 3D representations adopted for text-to-3D generation are time and memory consuming. On the other hand, the 3D Gaussian Splatting approach provides efficient results in both optimization, and rendering that allows the Advanced Generation Pipeline in the LucidDreamer framework to achieve large batch size as well as high-resolution rendering even when operating with limited computational resources.

Initialization

A majority of state of the art text-to-3D generation framework initialize their 3D representations with limited geometries like circle, box or cylinder that often results in undesired outputs on non-axial symmetric objects. On the other hand, as the LucidDreamer framework introduces 3D Gaussian Splatting as 3D representations, the framework can adopt to several text to point generative frameworks naturally to generate a coarse initialization with human inputs. The initialization strategy ultimately boosts the convergence speed significantly.

LucidDreamer : Experiments and Results

Text-to-3D Generation

The above figure demonstrates the results generated by the LucidDreamer model with the original stable diffusion approach whereas the following figure talks about the generated results on different finetuned checkpoints.

As it can be seen, the LucidDreamer framework is capable of generating highly consistent 3D content using the input text and semantic cues. Furthermore, with the use of ISM, the LucidDreamer framework generates intricate and more realistic images while avoiding common issues like over-saturation, or over-smoothing while exceling in generating common objects as well as supporting creative creations.

ISM Generalizability

To evaluate ISM generalizability, a comparison is conducted between the ISM and the SDS methods in both explicit and implicit representations, and the results are demonstrated in the following image.

Qualitative Comparison

To analyze the qualitative efficiency of the LucidDreamer framework, it is compared against current SoTA baseline models, and to ensure fair comparison, it uses Stable Diffusion 2.1 framework for distillation, and the results are demonstrated in the following image. As it can be seen, the framework delivers high-fidelity and geometrically accurate results while consuming less resources and time.

Furthermore, to provide a more comprehensive evaluation, developers also conduct a user study. The evaluation selects 28 prompts and uses different text to 3D generation approaches on each prompt to generate objects. The results were then ranked by the users on the basis of the degree of alignment with the input prompt, and its fidelity.

LucidDreamer : Applications

Owing to its exceptional performance on a wide array of text to 3D generation tasks, the LucidDreamer framework has several potential applications including Zero-shot avatar generation, personalized text to 3D generation, and zero-shot 2D and 3D editing.

The top-left image demonstrates LucidDreamer’s potential in zero-shot 2D and 3D editing tasks whereas the bottom left images demonstrate the ability of the framework in generating personalized text to 3D outputs with LoRA whereas the image on the right showcases the framework’s ability to generate 3D avatars.

Final Thoughts

In this article, we have talked about LucidDreamer, a novel approach that uses Interval Score Matching or ISM method to overcome the over-smoothing issue, and discuss the model architecture, and its performance against state of the art text to 3D generative frameworks. We have also talked about how SDS or Score Distillation Sampling, a common approach implemented in a majority of state of the art text to 3D generation models often results in over-smoothing of the generated images, and how the LucidDreamer framework counters this issue by introducing a new approach, the ISM or Interval Score Matching approach to generate high-fidelity, and more realistic 3D images. The results and evaluation indicates the effectiveness of the LucidDreamer framework on a wide array of 3D generation tasks, and how the framework already performs better than current state of the art 3D generative models. The exceptional performance of the framework makes way for a wide range of practical applications as already discussed.