This is really exciting to see. I applaud Stability AI's commitment to open sour...

ollin · on March 5, 2024

They use three text encoders to encode the caption:

1. CLIP-G/14 (OpenCLIP)

2. CLIP-L/14 (OpenAI)

3. T5-v1.1-XXL (Google)

They randomly disable encoders during training, so that when generating images SD3 can use any subset of the 3 encoders. They find that using T5 XXL is important only when generating images from prompts with "either highly detailed descriptions of a scene or larger amounts of written text".

MrCheeze · on March 5, 2024

One of the diagrams says they're using CLIP-G/14 and CLIP-L/14, which are the names of two OpenCLIP models - meaning they're not using OpenAI's CLIP.

MrCheeze · on March 5, 2024

I have just been informed that my above comment is false, the CLIP-L is in fact referring to OpenAI's, despite that also being the name of an OpenCLIP model.