Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Appreciation

Importance

Date Added

11.16.25

TLDR

Very impressive 2D image -> 3D mesh (and texturing). At its core, it is a diffusion transformer operating in the latent space of the Hunyuan3D-ShapeVAE which encodes and decodes 3D meshes. So it starts off with noisy latent tokens and with each denoising step, it cross-attends to a DINO embedded input image as conditioning for denoising.

2 Cents

This was a very useful paper to go down different rabbitholes (how do VAEs really work compared to standard AE), what is FPS and SDF and Marching Cubes? The diagrams are nice, and also the results from the demo are shockingly good to me.