NVIDIA Labs developed SANA model weights and Gradio demo app published —Check out this amazing new Text to Image model by NVIDIA

#beginners #tutorial #ai #opensource

Official repo : https://github.com/NVlabs/Sana

1-Click Windows, RunPod, Massed Compute installers and free Kaggle notebook : https://www.patreon.com/posts/116474081

You can follow instructions on the repository to install and use locally. I tested on my Windows RTX 3060 and 3090 GPUs.

I have tested some speeds and VRAM usage too

Uses 9.5 GB VRAM but someone reported works good on 8 GB GPUs too

Default settings per image speeds as below

Free Kaggle Account Notebook on T4 GPU : 15 second
RTX 3060 (12 GB) : 9.5 second
RTX 3090 : 4 second
RTX 4090 : 2 second

More info : https://nvlabs.github.io/Sana/

Works great on RunPod and Massed Compute as well (cloud)

Sana : Efficient High-Resolution Image Synthesis
with Linear Diffusion Transformer

About Sana — Taken from official repo

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence.
As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024 × 1024 resolution image. Sana enables content creation at low cost.

Several Core Design Details for Efficiency

Deep Compression Autoencoder: We introduce a new Deep Compressinon Autoencoder (DC-AE) that aggressively increases the scaling factor to 32. Compared with AE-F8, our AE-F32 outputs 16× fewer latent tokens, which is crucial for efficient training and generating ultra-high-resolution images, such as 4K resolution.
Efficient Linear DiT: We introduce a new linear DiT, replacing vanilla quadratic attention and reducing complexity from O(N2) to O(N) Mix-FFN, with 3×3 depth-wise convolution in MLP, enhances the local information of tokens. Linear attention achieves comparable results to vanilla, improving 4K generation by 1.7× in latency. Mix-FFN also removes the need for positional encoding (NoPE) without quality loss, marking the first DiT without positional embedding.
• Decoder-only Small LLM as Text Encoder: We use Gemma, a decoder-only LLM, as the text encoder to enhance understanding and reasoning in prompts. Unlike CLIP or T5, Gemma offers superior text comprehension and instruction-following. We address training instability and design complex human instructions (CHI) to leverage Gemma’s in-context learning, improving image-text alignment.

• Efficient Training and Inference Strategy: We propose automatic labeling and training strategies to improve text-image consistency. Multiple VLMs generate diverse re-captions, and a CLIPScore-based strategy selects high-CLIPScore captions to enhance convergence and alignment. Additionally, our Flow-DPM-Solver reduces inference steps from 28–50 to 14–20 compared to the Flow-Euler-Solver, with better performance.

Overall Performance

We compare Sana with the most advanced text-to-image diffusion models in Table 1. For 512 × 512 resolution, Sana-0.6 demonstrates a throughput that is 5× faster than PixArt-Σ, which has a similar model size, and significantly outperforms it in FID, Clip Score, GenEval, and DPG-Bench. For 1024 × 1024 resolution, Sana is considerably stronger than most models with ❤B parameters and excels in inference latency. Our models achieve competitive performance even when compared to the most advanced large model FLUX-dev. For instance, while the accuracy on DPG-Bench is equivalent and slightly lower on GenEval, Sana-0.6B’s throughput is 39× faster, and Sana-1.6B is 23× faster.