Stability AI has just announced the early preview of its next-generation image tool, Stable Diffusion 3 (SD3), calling it its “most capable text-to-image model” to date. The announcement is a solid follow-up to the company’s release of Stable Diffusion XL (SDXL) last year, which quickly established itself as the most advanced open-source image generator.
The headline improvements delivered with SD3 are better text generation, strong prompt adherence, and resistance to prompt leaking—the latter strengths ensuring that generated images match what was requested. Stability AI has also highlighted SD3 support of multimodal input—promising to demonstrate it via a future technical report.
The AI community has responded with enthusiasm to the SD3 news.
“This AI image generator is the best we’ve ever seen in terms of prompt understanding and text generation,” said MattVidPro, a prominent AI-focused YouTuber. “It is leaps above the rest, and it’s truly mind blowing.”
Similarly, machine learning engineer Ralph Brooks said the model’s text generation capabilities were “amazing.”
Side-by-side showdown
Although Stable Diffusion 3 is only available to select partners right now, Stability AI and AI enthusiasts are sharing comparisons between its output and the result of similar prompts from SDXL, MidJourney, and Dall-E 3. By all appearances, SD3 outperforms its competitors in overall quality, and Decrypt ran some of its own tests to verify this. The results speak for themselves:
SD3 vs MidJourney
Prompt: “Epic anime artwork of a wizard atop a mountain at night casting a cosmic spell into the dark sky that says ‘Stable Diffusion 3’ made out of colorful energy.”
In our first comparison, SD3 followed the prompt very closely. MidJourney failed at the prompt generation, didn’t generate a mountain, and the wizard was not casting a cosmic spell.
SD3 vs ImageFX
Prompt: “Photo of a 90’s desktop computer on a work desk. On the computer screen it says ‘welcome.’ On the wall in the background we see beautiful graffiti with the text ‘SD3’ very large on the wall.”
In our second comparison, SD3 followed the prompt with remarkable adherence, whereas Google’s top AI image generator, ImageFX, suffered from prompt leaking, generating the text SD3 on the computer screen and not in the background, failing to heed the graffiti style request, and failing to depict the word “welcome.”
The aesthetics generated by SD3 are also more like a photograph and less of an obvious “photorealistic” render. Note the effects surrounding the pencil holder and other items, which seem to be blending with the background.
SD3 vs SDXL
Prompt: “Resting on the kitchen table is an embroidered cloth with the text ‘good night’ and an embroidered baby tiger. Next to the cloth there is a lit candle. The lighting is dim and dramatic.”
In our third comparison, both Stable Diffusion 3 and Stable Diffusion XL captured the essence of the prompt, but SDXL failed at generating the text, suffered from prompt leaking (generating two cloths, one of which morphed into something else), and the embroidered baby tiger was generated better by SD3.
SD3 vs Dall-e 3
Prompt: “A painting of an astronaut riding a pig wearing a tutu holding a pink umbrella, on the ground next to the pig is a robin bird wearing a top hat, in the corner are the words ‘stable diffusion.’”
Stable Diffusion 3 generated what was requested in the prompt, whereas Dall-e 3 failed to generate text, created a 3D render instead of a painting, and generated a galaxy background just because it was prompted to generate an astronaut.
Beneath the hood
In theory, Stable Diffusion 3 should have enough computing power to back up its claims of power and prowess.
“(SD3) uses a new type of diffusion transformer (similar to Sora) combined with flow matching and other improvements,“ Emad Mostaque, CEO of Stability AI, said on Twitter. Sora is the top-of-the-line text to video generator announced by OpenAI a few days ago. Flow Matching, meanwhile, is an AI technique for generative modeling based on faster and more stable training and inference than alternative methods, like generative adversarial networks (GANs).
Some notes:
– This uses a new type of diffusion transformer (similar to Sora) combined with flow matching and other improvements.
– This takes advantage of transformer improvements & can not only scale further but accept multimodal inputs..
– More technical details soon— Emad (@EMostaque) February 22, 2024
Stability AI claims these improvements boost the model’s scalability and ability to accept multimodal inputs, and also pave the way for its application in video, 3D, and more. Mostaque tweeted that his vision for SD3 includes a comprehensive ecosystem of tools designed to leverage the latest hardware advancements while at the same time remain accessible and adaptable across various creative domains.
A week prior to the SD3 announcement, Stability AI released Stable Cascade. Unlike its predecessors, Stable Cascade is based on the Würstchen architecture, known for its modularity and record compression achievements. Despite hosting more parameters than Stable Diffusion XL, Stable Cascade boasts faster inference times and superior prompt alignment, showcasing the innovative strides Stability AI continues to make in AI development.
Although Stable Diffusion 3 is not yet publicly available, Stability AI emphasized it would be free, open source, and available to all under a non-commercial license. However, enthusiasts can apply for preview access as part of Stability AI’s membership program.
Edited by Ryan Ozawa.