This is really exciting to see. I applaud Stability AI's commitment to open source and hope they can operate for as long as possible.
There was one thing I was curious about... I skimmed through the executive summary of the paper but couldn't find it. Does Stable Diffusion 3 still use CLIP from Open AI for tokenization and text embeddings? I would naively assume that they would try to improve on this part of the model's architecture to improve adherence to text and image prompts.
They use three text encoders to encode the caption:
1. CLIP-G/14 (OpenCLIP)
2. CLIP-L/14 (OpenAI)
3. T5-v1.1-XXL (Google)
They randomly disable encoders during training, so that when generating images SD3 can use any subset of the 3 encoders. They find that using T5 XXL is important only when generating images from prompts with "either highly detailed descriptions of a scene or larger amounts of written text".
I have just been informed that my above comment is false, the CLIP-L is in fact referring to OpenAI's, despite that also being the name of an OpenCLIP model.
There was one thing I was curious about... I skimmed through the executive summary of the paper but couldn't find it. Does Stable Diffusion 3 still use CLIP from Open AI for tokenization and text embeddings? I would naively assume that they would try to improve on this part of the model's architecture to improve adherence to text and image prompts.