Back to bookshelf

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation 

Appreciation
9
Importance
8
Date Added
11.16.25
TLDR
Very impressive 2D image -> 3D mesh (and texturing). At its core, it is a diffusion transformer operating in the latent space of the Hunyuan3D-ShapeVAE which encodes and decodes 3D meshes. So it starts off with noisy latent tokens and with each denoising step, it cross-attends to a DINO embedded input image as conditioning for denoising.
2 Cents
This was a very useful paper to go down different rabbitholes (how do VAEs really work compared to standard AE), what is FPS and SDF and Marching Cubes? The diagrams are nice, and also the results from the demo are shockingly good to me.
Tags

(Notes by Alex Huang)

  • Hunyuan3D-ShapeVAE: a transformer-architectured variational auto-encoder on 3D mesh models. 1) sample point cloud both uniformly & by importance (at high-curvature or edges). 2) run FPS to find a small subset of point cloud each representing their neighbourhoods. 3) run cross attention on uniform subset (Q) and entire sample set (KV). 4) Train encoder and decoder with self-attention (variational latent space in middle). 5) run cross attention on voxelized 3d grid (Q) and latent representation in the residual stream. 6) extract points on surface/ inside based on attention and form mesh with marching cube algorithm

shape-vae

  • Hunyuan3D-DiT: a transformer denoises on the latent representation space of Hunyuan3D-ShapeVAE conditioned on input image token