DEV Community

Cover image for NVIDIA Labs developed SANA model weights and Gradio demo app published —Check out this amazing new Text to Image model by NVIDIA
Furkan Gözükara
Furkan Gözükara

Posted on

NVIDIA Labs developed SANA model weights and Gradio demo app published —Check out this amazing new Text to Image model by NVIDIA

Official repo : https://github.com/NVlabs/Sana

1-Click Windows, RunPod, Massed Compute installers and free Kaggle notebook : https://www.patreon.com/posts/116474081

You can follow instructions on the repository to install and use locally. I tested on my Windows RTX 3060 and 3090 GPUs.

I have tested some speeds and VRAM usage too

Uses 9.5 GB VRAM but someone reported works good on 8 GB GPUs too

Default settings per image speeds as below

  • Free Kaggle Account Notebook on T4 GPU : 15 second

  • RTX 3060 (12 GB) : 9.5 second

  • RTX 3090 : 4 second

  • RTX 4090 : 2 second

More info : https://nvlabs.github.io/Sana/

Works great on RunPod and Massed Compute as well (cloud)

Sana : Efficient High-Resolution Image Synthesis
with Linear Diffusion Transformer

About Sana — Taken from official repo

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence.
As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024 × 1024 resolution image. Sana enables content creation at low cost.

Several Core Design Details for Efficiency

  • Deep Compression Autoencoder: We introduce a new Deep Compressinon Autoencoder (DC-AE) that aggressively increases the scaling factor to 32. Compared with AE-F8, our AE-F32 outputs 16× fewer latent tokens, which is crucial for efficient training and generating ultra-high-resolution images, such as 4K resolution.

  • Efficient Linear DiT: We introduce a new linear DiT, replacing vanilla quadratic attention and reducing complexity from O(N2) to O(N) Mix-FFN, with 3×3 depth-wise convolution in MLP, enhances the local information of tokens. Linear attention achieves comparable results to vanilla, improving 4K generation by 1.7× in latency. Mix-FFN also removes the need for positional encoding (NoPE) without quality loss, marking the first DiT without positional embedding.
    Decoder-only Small LLM as Text Encoder: We use Gemma, a decoder-only LLM, as the text encoder to enhance understanding and reasoning in prompts. Unlike CLIP or T5, Gemma offers superior text comprehension and instruction-following. We address training instability and design complex human instructions (CHI) to leverage Gemma’s in-context learning, improving image-text alignment.

Efficient Training and Inference Strategy: We propose automatic labeling and training strategies to improve text-image consistency. Multiple VLMs generate diverse re-captions, and a CLIPScore-based strategy selects high-CLIPScore captions to enhance convergence and alignment. Additionally, our Flow-DPM-Solver reduces inference steps from 28–50 to 14–20 compared to the Flow-Euler-Solver, with better performance.

Overall Performance

We compare Sana with the most advanced text-to-image diffusion models in Table 1. For 512 × 512 resolution, Sana-0.6 demonstrates a throughput that is 5× faster than PixArt-Σ, which has a similar model size, and significantly outperforms it in FID, Clip Score, GenEval, and DPG-Bench. For 1024 × 1024 resolution, Sana is considerably stronger than most models with ❤B parameters and excels in inference latency. Our models achieve competitive performance even when compared to the most advanced large model FLUX-dev. For instance, while the accuracy on DPG-Bench is equivalent and slightly lower on GenEval, Sana-0.6B’s throughput is 39× faster, and Sana-1.6B is 23× faster.

Top comments (0)