DEV Community

Furkan Gözükara
Furkan Gözükara

Posted on

Compared Effect Of Image Captioning For SDXL Fine-tuning / DreamBooth Training for a Single Person, 10.3 GB VRAM via OneTrainer

Check all the images below to see used training dataset and used captionings.

All trainings are done on OneTrainer Windows 10 with newest 10.3 GB Configuration :

A quick tutorial for how to use concepts in OneTrainer :


The training dataset is deliberately a bad dataset. Because people can’t even collect this quality. So I do my tests on a bad dataset to find good settings for general public. Therefore, if you improve dataset quality with adding more different background and clothing images, you will get better quality.

Used SG161222/RealVisXL_V4.0 as a base model and OneTrainer to train on Windows 10 :

The posted example x/y/z checkpoint comparison images are not cherry picked. So I can get perfect images with multiple tries.

Trained 150 epochs, 15 images and used my ground truth 5200 regularization images :

In each epoch only 15 of regularization images used to make DreamBooth training affect

As a caption for 10_3_GB config “ohwx man” is used, for regularization images just “man”

For WD_caption I have used Kohya GUI WD14 captioning and appended prefix of ohwx,man,

For WD_caption and kosmos_caption regularization images concept, just “man” used

For Kosmos-2 batch captioning I have used our SOTA script collection. Kosmos-2 uses as low as 2GB VRAM with 4-bit. You can download and 1 click install it here :

After Kosmo-2 batch captioning I added prefix photo of ohwx man, to the all captions via Kohya GUI

SOTA Image Captioning Scripts For Stable Diffusion: CogVLM, LLaVA, BLIP-2, Clip-Interrogator (115 Clip Vision Models + 5 Caption Models) :

You can download configs and full instructions of this OneTrainer training configuration here :

We have slower and faster configuration. Both of them are same quality and slower configuration uses 10.3 GB VRAM.

Hopefully full public tutorial coming within 2 weeks. I will show all configuration as well

The tutorial will be on our channel :

Training speeds are as below thus durations:

RTX 3060 — slow preset : 3.72 second / it thus 15 train images 150 epoch * 2 (reg images concept) : 4500 steps = 4500 * 3.72 / 3600 = 4.6 hours

RTX 3090 TI — slow preset : 1.58 second / it thus : 4500 * 1.58 / 3600 = 2 hours

RTX 3090 TI — fast preset : 1.45 second / it thus : 4500 * 1.45 / 3600 = 1.8 hours


Captioning reduces likeliness and brings almost no benefit when training a person with such medium quality dataset. However, if you train an object or a style, captioning can be very beneficial. So it depends on your purpose.

Top comments (0)