Paint3D : Lighting-Less Diffusion Model for Image Generation

The rapid development of AI Generative models, especially deep generative AI models, has significantly advanced capabilities in natural language generation, 3D generation, image generation, and speech synthesis. These models have revolutionized 3D production across various industries. However, many face a challenge: their complex wiring and generated meshes often aren’t compatible with traditional rendering pipelines like Physically Based Rendering (PBR). Diffusion-based models, notably without lighting textures, demonstrate impressive diverse 3D asset generation, enhancing 3D frameworks in filmmaking, gaming, and AR/VR.

This article introduces Paint3D, a novel framework for producing diverse, high-resolution 2K UV texture maps for untextured 3D meshes, conditioned on visual or textual inputs. Paint3D’s main challenge is generating high-quality textures without embedded illumination, enabling user re-editing or re-lighting within modern graphics pipelines. It employs a pre-trained 2D diffusion model for multi-view texture fusion, generating initial coarse texture maps. However, these maps often show illumination artifacts and incomplete areas due to the 2D model’s limitations in disabling lighting effects and fully representing 3D shapes. We will delve into Paint3D’s workings, architecture, and comparisons with other deep generative frameworks. Let’s begin.

The capabilities of Deep Generative AI models in natural language generation, 3D generation, and image synthesis tasks is well-known and implemented in real-life applications, revolutionizing the 3D generation industry. Despite their remarkable capabilities, modern deep generative AI frameworks generate meshes that are characterized by complex wiring and chaotic lighting textures that are often incompatible with conventional rendering pipelines including PBR or Physically based Rendering. Like deep generative AI models, texture synthesis has also advanced rapidly especially in utilizing 2D diffusion models. Texture synthesis models employ pre-trained depth-to-image diffusion models effectively to use text conditions to generate high-quality textures. However, these approaches face troubles with pre-illuminated textures that can significantly impact the final 3D environment renderings and introduce lighting errors when the lights are changed within the common workflows as demonstrated in the following image. 

Paint3D : Lighting-Less Diffusion Model for Image Generation

As it can be observed, the texture map with free illumination works in sync with the traditional rendering pipelines delivering accurate results whereas the texture map with pre-illumination includes inappropriate shadows when relighting is applied. On the other hand, texture generation frameworks trained on 3D data offer an alternative approach in which the framework generates the textures by comprehending a specific 3D object’s entire geometry. Although they might deliver better results, texture generation frameworks trained on 3D data lack generalization capabilities that hinders their capability to apply the model to 3D objects outside their training data. 

Current texture generation models face two critical challenges: using image guidance or diverse prompts to achieve a broader degree of generalization across different objects, and the second challenge being the elimination of coupled illumination on the results obtained from pre-training. The pre-illuminated textures can potentially interfere with the final outcomes of the textured objects within rendering engines, and since the pre-trained 2D diffusion models provide 2D results only in the view domain, they lack comprehensive understanding of shapes that leads to them being unable to maintain view consistency for 3D objects. 

Owing to the challenges mentioned above, the Paint3D framework attempts to develop a dual-stage texture diffusion model for 3D objects that generalizes to different pre-trained generative models and preserve view consistency while learning lightning-less texture generation. 

Paint3D is a dual-stage coarse to fine texture generation model that aims to leverage the strong prompt guidance and image generation capabilities of pre-trained generative AI models to texture 3D objects. In the first stage, the Paint3D framework first samples multi-view images from a pre-trained depth aware 2D image diffusion model progressively to enable the generalization of high-quality and rich texture results from diverse prompts. The model then generates an initial texture map by back projecting these images onto the 3D mesh surface. In the second stage, the model focuses on generating lighting-less textures by implementing approaches employed by diffusion models specialized in the removal of lighting influences and shape-aware refinement of incomplete regions. Throughout the process, the Paint3D framework is consistently able to generate high-quality 2K textures semantically, and eliminates intrinsic illumination effects. 

To sum it up, Paint3D is a novel coarse to fine generative AI model that aims to produce diverse, lighting-less and high-resolution 2K UV texture maps for untextured 3D meshes to achieve state of the art performance in texturing 3D objects with different conditional inputs including text & images, and offers significant advantage for synthesis and graphics editing tasks. 

Methodology and Architecture

The Paint3D framework generates and refines texture maps progressively to generate diverse and high quality texture maps for 3D models using desired conditional inputs including images and prompts, as demonstrated in the following image. 

In the coarse stage, the Paint3D model uses pre-trained 2D image diffusion models to sample multi-view images, and then creates the initial texture maps back-projecting these images onto the surface of the mesh. In the second stage i.e. the refinement stage, the Paint3D model uses a diffusion process in the UV space to enhance coarse texture maps, thus achieving high-quality, inpainting, and lighting-less function that ensures the visual appeal and completeness of the final texture. 

Stage 1: Progressive Coarse Texture Generation

In the progressive coarse texture generation stage, the Paint3D model generates a coarse UV texture map for the 3D meshes that use a pre-trained depth-aware 2D diffusion model. To be more specific, the model first uses different camera views to render the depth map, then uses depth conditions to sample images from the image diffusion model, and then back-projects these images onto the mesh surface. The framework performs the rendering, sampling, and back-projection approaches alternately to improve the consistency of the texture meshes, which ultimately helps in the progressive generation of the texture map. 

The model starts generating the texture of the visible region with the camera views focusing on the 3D mesh, and renders the 3D mesh to a depth map from the first view. The model then samples a texture image for an appearance condition and a depth condition. The model then back-projects the image onto the 3D mesh. For the viewpoints, the Paint3D model executes a similar approach but with a slight change by performing the texture sampling process using an image painting approach. Furthermore, the model takes the textured regions from previous viewpoints into account, allowing the rendering process to not only output a depth image, but also a partially colored RGB image with an uncolored mask in the current view. 

The model then uses a depth-aware image inpainting model with an inpainting encoder to fill the uncolored area within the RGB image. The model then generates the texture map from the view by back-projecting the inpainted image into the 3D mesh under the current view, allowing the model to generate the texture map progressively, and arriving at the entire coarse structure map. Finally, the model extends the texture sampling process to a scene or object with multiple views. To be more specific, the model utilizes a pair of cameras to capture two depth maps during the initial texture sampling from symmetric viewpoints. The model then combines two depth maps and composes a depth grid. The model replaces the single depth image with the depth grid to perform multi-view depth-aware texture sampling. 

Stage 2: Texture Refinement in UV Space

Although the appearance of coarse texture maps is logical, it does face some challenges like texture holes caused during the rendering process by self-occlusion or lightning shadows owing to the involvement of 2D image diffusion models. The Paint3D model aims to perform a diffusion process in the UV space on the basis of a coarse texture map, trying to mitigate the issues and enhance the visual appeal of the texture map even further during texture refinement. However, refining the mainstream image diffusion model with the texture maps in the UV space introduces texture discontinuity since the texture map is generated by the UV mapping of the texture of the 3D surface that cuts the continuous texture into a series of individual fragments in the UV space. As a result of the fragmentation, the model finds it difficult to learn the 3D adjacency relationships amongst the fragments that leads to texture discontinuity issues. 

The model refines the texture map in the UV space by performing the diffusion process under the guidance of texture fragments’ adjacency information. It is important to note that in the UV space, it is the position map that represents the 3D adjacency information of texture fragments, with the model treating each non-background element as a 3D point coordinate. During the diffusion process, the model fuses the 3D adjacency information by adding an individual position map encoder to the pretrained image diffusion model. The new encoder resembles the design of the ControlNet framework and has the same architecture as the encoder implemented in the image diffusion model with the zero-convolution layer connecting the two. Furthermore, the texture diffusion model is trained on a dataset comprising texture and position maps, and the model learns to predict the noise added to the noisy latent. The model then optimizes the position encoder and freezes the trained denoiser for its image diffusion task. 

The model then simultaneously uses the position of conditional encoder and other encoders to perform refinement tasks in the UV space. In this respect, the model has two refinement capabilities: UVHD or UV High Definition and UV inpainting. The UVHD method is structured to enhance the visual appeal and aesthetics of the texture map. To achieve UVHD, the model uses an image enhance encoder and a position encoder with the diffusion model. The model uses the UV inpainting method to fill the texture holes within the UV plane that is capable of avoiding self-occlusion issues generated during rendering. In the refinement stage, the Paint3D model first performs UV inpainting and then performs UVHD to generate the final refined texture map. By integrating the two refinement methods, the Paint3D framework is able to produce complete, diverse, high-resolution, and lighting-less UV texture maps. 

Paint3D : Experiments and Results

The Paint3D model employs the Stable Diffusion text2image model to assist it with texture generation tasks while it employs the image encoder component to handle image conditions. To further enhance its grip on conditional controls like image inpainting, depth, and image high definition, the Paint3D framework employs ControlNet domain encoders. The model is implemented on the PyTorch framework with rendering and texture projections implemented on Kaolin. 

Text to Textures Comparison

To analyze its performance, we start by evaluating Paint3D’s texture generation effect when conditioned using textual prompts, and compare it against state of the art frameworks including Text2Tex, TEXTure, and LatentPaint. As it can be observed in the following image, the Paint3D framework not only excels at generating high-quality texture details, but it also synthesizes an illumination-free texture map reasonably well. 

In comparison, the Latent-Paint framework is prone to generating blurry textures that results in suboptimal visual effects. On the other hand, although the TEXTure framework generates clear textures, it lacks smoothness and exhibits noticeable splicing and seams. Finally, the Text2Tex framework generates smooth textures remarkably well, but it fails to replicate the performance for generating fine textures with intricate detailing. 

The following image compares the Paint3D framework with state of the art frameworks quantitatively. 

As it can be observed, the Paint3D framework outperforms all the existing models, and by a significant margin with nearly 30% improvement in the FID baseline and approximately 40% improvement in the KID baseline. The improvement in the FID and KID baseline scores demonstrate Paint3D’s ability to generate high-quality textures across diverse objects and categories. 

Image to Texture Comparison

To generate Paint3D’s generative capabilities using visual prompts, we use the TEXTure model as the baseline. As mentioned earlier, the Paint3D model employs an image encoder sourced from the text2image model from Stable Diffusion. As it can be seen in the following image, the Paint3D framework synthesizes exquisite textures remarkably well, and is still able to maintain high fidelity w.r.t the image condition. 

On the other hand, the TEXTure framework is able to generate a texture similar to Paint3D, but it falls short to represent the texture details in the image condition accurately. Furthermore, as demonstrated in the following image, the Paint3D framework delivers better FID and KID baseline scores when compared to the TEXTure framework with the former decreasing from 40.83 to 26.86 whereas the latter showing a drop from 9.76 to 4.94. 

Final Thoughts

In this article, we have talked about Paint3D,  a coarse-to-fine novel framework capable of producing lighting-less, diverse, and high-resolution 2K UV texture maps for untextured 3D meshes conditioned either on visual or textual inputs. The main highlight of the Paint3D framework is that it is capable of generating lighting-less high-resolution 2K UV textures that are semantically consistent without being conditioned on image or text inputs. Owing to its coarse-to-fine approach, the Paint3D framework produce lighting-less, diverse, and high-resolution texture maps, and delivers better performance than current state of the art frameworks.