CoFi: Coarse-to-Fine Compositional Diffusion for Long-Horizon Planning

Long-Horizon Results

Long Video Generation

CogVideoX-2B natively generates 49-frame clips by default. CoFi composes 9 temporal chunks into a 273-frame 720p video, extending the temporal horizon by 5.6× without retraining the base model.

A cute happy panda, dressed in a small red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest, strumming a miniature acoustic guitar...

The camera follows behind a white vintage SUV speeding up a steep dirt road on a mountain slope...

A group of colorful hot air balloons taking off in Cappadocia...

A young woman with beautiful and clear eyes in a forest wearing a crown of flowers...

At sunset a modified Ford F-150 Raptor races through a desert landscape...

A detailed wooden toy ship with intricate carvings sails on a carpet sea...

Long-Horizon Robotic Planning

CoFi composes short local plans into long-horizon trajectories on OGBench. Each colored segment corresponds to one local plan, and the final trajectory is obtained by aligning and refining these segments into a coherent task-level global plan.

Maze-Stitch

Task 1

Task 2

Task 3

Task 4

Task 5

Scene-Play

Start Goal

Task 1

Start Goal

Task 2

Start Goal

Task 3

Start Goal

Task 4

Panoramic Image Generation

CoFi composes 9 overlapping 512×512 patches into a 512×4608 panorama while maintaining consistent style, texture, and layout.

Last supper with cute corgis ▼

Last supper with cute corgis A cinematic view of a castle in the sunset A snowy mountain peak with skiers Skyline of a futuristic city with flying cars A lake under the northern lights A beautiful ocean with coral reef Natural landscape in anime style illustration A beautiful landscape with mountains and a river A forest with a misty fog A city skyline at night A beach with palm trees A rock concert Lush forest with a babbling brook Mountain range at twilight The Dolomites with red lava flowing through the valley Cartoon panorama of spring summer beautiful nature A beachside street under the sunset A beach in La La Land style Silhouette of a dreamy scene with shooting stars A grassland with animals

Method

CoFi separates long-horizon composition into global structure formation and local detail recovery.

Generated scaffold: globally coherent but locally blurred

Stage 1: Coarse Scaffold Construction

CoFi denoises all local plans in parallel and pulls their clean estimates toward a shared scaffold. This stage resolves long-range structure first, producing a coarse but globally aligned plan.

Final output: globally coherent with fine local detail

Stage 2: Structure-Preserving Refinement

CoFi re-noises the coarse scaffold to an intermediate timestep and denoises it again with the same pretrained local prior. This second pass restores local detail while keeping the global arrangement fixed.

            Key insight: CoFi first fixes the global scaffold, then spends a short second pass on local refinement.
            This gives both global coherence and local quality with only T + t* denoiser evaluations, using 
            2–8× fewer evaluations than baseline.
        

NFE–Performance Comparison

We compare performance against the number of function evaluations (NFE) across the three domains. CoFi improves the main coherence metrics across domains while using 2–8× fewer NFE than CDGS, substantially reducing inference cost. The scaffold construction stage uses the same denoiser evaluations as GSC, and the refinement stage adds only t* extra steps, giving a total cost of T + t*.

Left: Robotic planning — CoFi achieves 96% success with 2.1× fewer NFE than CDGS. Center: Panoramic images — CoFi reduces Intra-LPIPS to 0.45 with 8× fewer NFE. Right: Long videos — CoFi reaches 94.1% subject consistency with 8× fewer NFE.

Faster and more coherent

Side-by-side comparison of 273-frame long video generation. CoFi better preserves subject appearance and scene layout over long temporal horizons while using 8× fewer denoiser evaluations.

CDGS (6516 NFE)

CoFi (810 NFE)

"A cute happy panda, dressed in a small red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest, strumming a miniature acoustic guitar..."

TL;DR

Long-Horizon Results

Long Video Generation

Long-Horizon Robotic Planning

Maze-Stitch

Scene-Play

Panoramic Image Generation

Method

Stage 1: Coarse Scaffold Construction

Stage 2: Structure-Preserving Refinement

NFE–Performance Comparison

Faster and more coherent