CoFi: Coarse-to-Fine Compositional Diffusion
for Long-Horizon Planning

Byoungwoo Park1,2, Utkarsh A. Mishra2, Jaemoo Choi2, Juho Lee1, Yongxin Chen2

1KAIST    2Georgia Institute of Technology

TL;DR

CoFi is a simple training-free, inference-time compositional diffusion sampler that scales short-horizon diffusion models to long-horizon planning, panoramas, and videos at substantially lower computational cost.

Long-Horizon Results

Long Video Generation

CogVideoX-2B natively generates 49-frame clips by default. CoFi composes 9 temporal chunks into a 273-frame 720p video, extending the temporal horizon by 5.6× without retraining the base model.

A cute happy panda, dressed in a small red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest, strumming a miniature acoustic guitar...
The camera follows behind a white vintage SUV speeding up a steep dirt road on a mountain slope...
A group of colorful hot air balloons taking off in Cappadocia...
A young woman with beautiful and clear eyes in a forest wearing a crown of flowers...
At sunset a modified Ford F-150 Raptor races through a desert landscape...
A detailed wooden toy ship with intricate carvings sails on a carpet sea...

Long-Horizon Robotic Planning

CoFi composes short local plans into long-horizon trajectories on OGBench. Each colored segment corresponds to one local plan, and the final trajectory is obtained by aligning and refining these segments into a coherent task-level global plan.

Maze-Stitch

Task 1
Task 2
Task 3
Task 4
Task 5

Scene-Play

Start Goal
Task 1
Start Goal
Task 2
Start Goal
Task 3
Start Goal
Task 4

Panoramic Image Generation

CoFi composes 9 overlapping 512×512 patches into a 512×4608 panorama while maintaining consistent style, texture, and layout.

Method

CoFi separates long-horizon composition into global structure formation and local detail recovery.

Scaffold

Generated scaffold: globally coherent but locally blurred

Stage 1: Coarse Scaffold Construction

CoFi denoises all local plans in parallel and pulls their clean estimates toward a shared scaffold. This stage resolves long-range structure first, producing a coarse but globally aligned plan.

Refinement

Final output: globally coherent with fine local detail

Stage 2: Structure-Preserving Refinement

CoFi re-noises the coarse scaffold to an intermediate timestep and denoises it again with the same pretrained local prior. This second pass restores local detail while keeping the global arrangement fixed.

Key insight: CoFi first fixes the global scaffold, then spends a short second pass on local refinement. This gives both global coherence and local quality with only T + t* denoiser evaluations, using 2–8× fewer evaluations than baseline.

NFE–Performance Comparison

We compare performance against the number of function evaluations (NFE) across the three domains. CoFi improves the main coherence metrics across domains while using 2–8× fewer NFE than CDGS, substantially reducing inference cost. The scaffold construction stage uses the same denoiser evaluations as GSC, and the refinement stage adds only t* extra steps, giving a total cost of T + t*.

NFE vs Performance

Left: Robotic planning — CoFi achieves 96% success with 2.1× fewer NFE than CDGS. Center: Panoramic images — CoFi reduces Intra-LPIPS to 0.45 with 8× fewer NFE. Right: Long videos — CoFi reaches 94.1% subject consistency with 8× fewer NFE.

Faster and more coherent

Side-by-side comparison of 273-frame long video generation. CoFi better preserves subject appearance and scene layout over long temporal horizons while using 8× fewer denoiser evaluations.

CDGS (6516 NFE)
CoFi (810 NFE)

"A cute happy panda, dressed in a small red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest, strumming a miniature acoustic guitar..."