Articulate3D: Zero-Shot Text-Driven 3D Object Posing

2025

Oishi Deb¹, Anjun Hu^{± 1}, Ashkan Khakzar^{± 2}, Philip Torr¹, Christian Rupprecht¹
^± Equivalent Contribution
¹University of Oxford, ²Google DeepMind

Corresponding Author: oishi.deb@eng.ox.ac.uk

Paper Link Code Coming Soon

We propose Articulate3D that reposes a 3D asset through language control. Despite advances in vision and language understanding, reposing a 3D asset through language control is a surprisingly difficult task.

Approach

To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the original pose and a text instruction. We can then align the mesh to the target images through a multi-view pose optimization step.

We introduce a self-attention rewiring mechanism RSActrl that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses.

We can then apply this modification to a multi-view diffusion model such as MVDream without the need for retraining. We observed that differentiable rendering is an unreliable signal for articulation optimization; instead, we use keypoints to establish correspondences between input and target images.

Our method works on a variety of 3D meshes of different shapes and sizes.

Flow and Architecture Diagram of Articulate3D

This consists of 2 main steps:

RSActrl Module
Multi-view Mesh Optimization
1. Keypoint Detection
2. Keypoint Alignment

Results

Below we show the results for the following user's prompt to get a posed output.

1. User's Prompt: "A Humming bird is folding its wings."

Starting Mesh Position Starting mesh position of a hummingbird

Interactive Starting Mesh

Load 3D Model

Visuals from 8 viewpoints [Grey background visuals are target 2D images]

Final Result

Interactive Final Result

Load 3D Model

2. User's Prompt: "A Tiger is lifting its front legs."

Starting Mesh Position Starting mesh position of a tiger

Interactive Starting Mesh

Load 3D Model

Visuals from 8 viewpoints [Grey background visuals are target 2D images]

Final Result

Interactive Final Result

Load 3D Model

More Interactive Results

Below we show a number of results for user text prompts on various meshes to get a posed output. Left side of the arrow shows the input mesh, and right side is the final output.

3. User's Prompt: "A Phoenix is gliding up."

Starting Mesh Position

Interactive Starting Mesh

Load 3D Model

Final Result

Interactive Final Result

Load 3D Model

4. User's Prompt:"A humming bird is looking up."

Starting Mesh Position

Interactive Starting Mesh

Load 3D Model

Final Result

Interactive Final Result

Load 3D Model

5. User's Prompt: "A seagull is stretching its wings up."

Starting Mesh Position

Interactive Starting Mesh

Load 3D Model

Final Result

Interactive Final Result

Load 3D Model

6. User's Prompt: "A tiger is walking."

Starting Mesh Position

Interactive Starting Mesh

Load 3D Model

Final Result

Interactive Final Result

Load 3D Model

7. User's Prompt: "A Frog is jumping."

Starting Mesh Position

Interactive Starting Mesh

Load 3D Model

Final Result

Interactive Final Result

Load 3D Model

8. User's Prompt: "A hummingbird bending its head down."

Starting Mesh Position

Interactive Starting Mesh

Load 3D Model

Final Result

Interactive Final Result

Load 3D Model

9. User's Prompt: "A brown bird is raising its wings."

Starting Mesh Position

Interactive Starting Mesh

Load 3D Model

Final Result

Interactive Final Result

Load 3D Model

10. User's Prompt: "A tiger is bending towards sitting pose."

Starting Mesh Position

Interactive Starting Mesh

Load 3D Model

Final Result

Interactive Final Result

Load 3D Model

Animations

Here, we showcase the animations generated based on user prompts.

1. User's Prompt: "A humming bird is bending its head."

2. User's Prompt: "A seagull is stretching its wings up."

3. User's Prompt: "A humming bird is looking up."

4. User's Prompt: "A tiger is bending towards sitting pose."

Ablation Experiments

Here we show experimenting with Mask Loss in addition to Keypoint Loss.

Acknowledgements

We would like to thanks Alexei A Efros, Hirokatsu Kataoka, Minghao Chen, Orest Kupyn, and Paul Engstler for insightful and technical discussions.

BibTeX

  @misc{deb2025articulate3d,
      title={Articulate3D: Zero-Shot Text-Driven 3D Object Posing}, 
      author={Oishi Deb and Anjun Hu and Ashkan Khakzar and Philip Torr and Christian Rupprecht},
      year={2025},
      eprint={2508.19244},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.19244}, 
      }