HOUSEKEEPING NOTE: Yes, I know, I know—I’m off on yet another tangent on this blog! Please know that I will continue to post “news and views on social VR, virtual worlds, and the metaverse” (as the tagline of the RyanSchultz.com blog states) in the coming months! However, over the next few weeks, I will be focusing a bit on the exciting new world of AI-generated art. Patience! 😉
Artificial Intelligence (AI) tools which can create art from a natural-language text prompt are evolving at such a fast pace that it is making me a bit dizzy. Two years ago, if somebody had told me that you would be able to generate a convincing photograph or a detailed painting from a text description alone, I would have scoffed! Many felt that the realm of the artist or photographer would be among the last holdouts where a human being was necessary to produce good work. And yet, here we are, in mid-2022, with any number of public and private AI initiatives which can be used by both amateurs and professionals to generate stunning art!
In a recent interview by The Register‘s Thomas Claburn of David Holz (the former co-founder of augmented reality hardware firm Magic Leap, who founded Midjourney), there’s a brief explanation of how this burst of research and development activity got started:
The ability to create high-quality images from AI models using text input became a popular activity last year following the release of OpenAI’s CLIP (Contrastive Language–Image Pre-training), which was designed to evaluate how well generated images align with text descriptions. After its release, artist Ryan Murdock…found the process could be reversed – by providing text input, you could get image output with the help of other AI models.
After that, the generative art community embarked on a period of feverish exploration, publishing Python code to create images using a variety of models and techniques.
“Sometime last year, we saw that there were certain areas of AI that were progressing in really interesting ways,” Holz explained in an interview with The Register. “One of them was AI’s ability to understand language.”
Holz pointed to developments like transformers, a deep learning model that informs CLIP, and diffusion models, an alternative to GANs [Holz pointed to developments like transformers, a deep learning model that informs CLIP, and diffusion models, an alternative to GANs [models using Generative Adversarial Networks]. “The one that really struck my eye personally was the CLIP-guided diffusion,” he said, developed by Katherine Crawson…
If you need a (relatively) easy-to-understand explainer on how this new diffusion model works, well then, YouTube comes to your rescue with this video with 4 explanations at various levels of difficulty!
Before we get started, a few updates since my last blogpost on A.I.-generated art: After using up my free Midjourney credits, I decided to purchase a US$10-a-month subscription to continue to play around with it. This is enough credit to generate approximately 200 images per month. Also, as a thank you for being among the early beta testers of DALL-E 2, the AI art-generation tool by OpenAI, they have awarded me 100 free credits to use. You can buy additional credits in 115-generation increments for US$15, but given the hit-or-miss nature of the results returned, this means that DALL-E 2 is among the most expensive of the artificial intelligence art generators. It will be interesting to see if and how OpenAI will adjust their pricing as the newer competitors start to nip at their heels in this race!
And I can hardly believe my good fortune, because I have been accepted into the relatively small beta test group for a third AI text-to-art generation program! This new one is called Stable Diffusion, by Stability AI. Please note that if you were to try to get into the beta now, it’s probably too late; they have already announced that they have all the testers they need. I submitted my name 2-3 weeks ago, when I first heard about the project. Stable Diffusion is still available for researcher use, however.
Like Midjourney, Stable Diffusion uses a special Discord server with commands (instead of Midjourney’s /imagine, you use the prompt !dream, followed by a text description of what you want to see, plus you can add optional parameters to set the aspect ratio, the number of images returned, etc.). However, the Stable Diffusion team has already announced that they plan to move from Discord to a web-based interface like DALL-E 2 (we will be beta-testing that, too). Here’s a brief video glimpse of what the web interface could look like:
Given that I am among the relatively few people who currently have access to all three of the top publicly-available AI art-generation tools, I thought it would be interesting to create a chart comparing and contrasting all three programs. Please note that I am neither an artist nor an expert in artificial intelligence, just a novice user of all three tools! Almost all of the information in this chart has been gleaned from the websites of the projects, and online news reports, as well as the active subreddit communities for all three programs on Reddit, where users post pictures and ask questions. Also, all three tools are constantly being updated, so this chart might go very quickly out-of-date (although I will make an attempt to update it).
|Name of Tool||DALL-E 2||Midjourney||Stable Diffusion|
|AI Model Used||Diffusion||Diffusion||Diffusion|
|# Images Used |
to Train the AI
|400 millon||“tens of millions”||2 billion|
|User Interface||website||Discord||Discord (moving to website)|
|Cost to Use||credit system (115 for US$15)||subscription (US$10-30 per month)||currently free (beta)|
|Uses Text Prompts||yes||yes||yes|
|Can Add Optional Arguments||no||yes||yes|
|Generate Variations?||yes||yes||yes (using seeds)|
I have already shared a few images from my previous testing of DALL-E 2 and Midjourney here, here, and here, so I am not going to repost those images, but I wanted to share a couple of the first images I was able to create using Stable Diffusion (SD). To make these, I used the text prompt “a thatched cottage with lit windows by a lake in a lush green forest golden hour peaceful calm serene very highly detailed painting by thomas kinkade and albrecht bierstadt”:
I must admit that I am quite impressed by these pictures! I had asked SD for images with a height of 512 pixels and a width of 1024 pixels, but to my surprise, the second image was a wider one presented neatly in a white frame, which I cropped using my trusty SnagIt image editor! Also, it was not until after I submitted my prompt that I realized that the second artist’s name is actually ALBERT Bierstadt, not Albrecht! It doesn’t appear as if my typo made a big difference in the final output; perhaps for well-known artists, the last name alone is enough to indicate a desired art style?
Here are a few more samples of the kind of art which Stable Diffusion can create, taken from the pod-submissions thread on the SD Discord server:
You can see many more examples over at the r/StableDiffusion subreddit. Enjoy!
If you are curious about Stable Diffusion and want to learn more, there is a 1-1/2 hour podcast interview with Emad Mostaque, the founder of Stability AI (highly recommended!). You can also visit the Stability AI website, or follow them on social media: Twitter or LinkedIn.
I also wanted to submit the same text prompt to each of DALL-E 2, Midjourney, and Stable Diffusion, to see how the AI models in each would respond. Under each prompt you will see three square images: the first from DALL-E 2, the second from Midjourney, and the third from Stable Diffusion. (Click on each thumbnail image to see it in its full size on-screen.)
Text prompt: “the crowds at the Black Friday sales at Walmart, a masterpiece painting by Rembrandt van Rijn”
Note that none of the AI models are very good at getting the facial details correct for large crowds of people (all work better with just one face in the picture, like a portrait, although sometimes they struggle with matching eyes or hands). I would say that Midjourney is the clear winner here, although a longer, much more detailed prompt in DALL-E 2 or Stable Diffusion might have created an excellent picture.
Text prompt: “stunning breathtaking photo of a wood nymph with green hair and elf ears in a hazy forest at dusk. dark, moody, eerie lighting, brilliant use of glowing light and shadow. sigma 8.5mm f/1.4”
When I tired to generate a 1024-by-1024 image in Stable Diffusion, it kept giving me more than one wood nymph, even when I added words like “single” or “alone”, which is a known bug in the current early state of the program. I finally gave up and used a 512×512 image. The clear winner here is DALL-E 2, which has a truly impressive ability to mimic various camera styles and settings!
Text prompt: “a very highly detailed portrait of an African samurai by Tim Okamura”
In this case, the clear winner is Stable Diffusion with its incredible detail, even though, once again, I could not generate a 1024×1024 image because it kept giving me multiple heads! The DALL-E 2 image is a too stylized for my taste, and the Midjourney image, while nice, has eyes that don’t match (a common problem with all three tools).
And, if you enjoy this kind of thing, here’s a 15-minute YouTube video with 21 more head-to-head comparisons between Stable Diffusion, DALL-E 2, and Midjourney:
As I have said, all of this is happening so quickly that it is making my head spin! If anything, the research and development of these tools is only going to accelerate over time. And we are going to see this technology applied to more than still images! Witness a video shared on Twitter by Patrick Esser, an AI research scientist at Runway, where the entire scene around a tennis player is changed simply by editing a text prompt, in real time:
I expect I will be posting more later about these and other new AI art generation tools as they arise; stay tuned for updates!