2025
Oishi Deb1,
Anjun Hu1,
Ashkan Khakzar1,2,
Philip Torr1,
Christian Rupprecht1
1University of Oxford,
2Google DeepMind
We propose Articulate3D that reposes a 3D asset through language control. Despite advances in vision and language understanding, reposing a 3D asset through language control is a surprisingly difficult task.
To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the original pose and a text instruction. We can then align the mesh to the target images through a multi-view pose optimization step.
We introduce a self-attention rewiring mechanism RSActrl that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses.
We can then apply this modification to a multi-view diffusion model such as MVDream without the need for retraining. We observed that differentiable rendering is an unreliable signal for articulation optimization; instead, we use keypoints to establish correspondences between input and target images.
Our method works on a variety of 3D meshes of different shapes and sizes.
This consists of 2 main steps:
Below we show the results for the following user's prompt to get a posed output.
Below we show a number of results for user text prompts on various meshes to get a posed output. Left side of the arrow shows the input mesh, and right side is the final output.
Here, we showcase the animations generated based on user prompts.
We would like to thanks Hirokatsu Kataoka, Minghao Chen, Orest Kupyn, Paul Engstler, David Fan and Zheng Xing for insightful and technical discussions.
@InProceedings{deb2025articulate3d, author = {Oishi Deb and Anjun Hu and Ashkan Khakzar and Philip Torr and Christian Rupprecht}, title = {Articulate3D: Zero-Shot Text-Driven 3D Object Posing}, booktitle = {xx}, month = {xx}, year = {2025}, pages = {xx} }