Today I am going to talk about an AI model that I have been enjoying lately known as Dall-E mini. There is a new version from OpenAI group that is already in place called Dall-E 2. This is currently invitation only and there is apparently a long waitlist to get a chance to peek at it.
History
Early 2021, OpenAI group created an AI framework called Dall-E to generate images from text. You can find more details about the project here. This project is recognized as one of the most ground breaking projects in the area of computer vision. However one of the key problems with this model was that it needed very powerful GPU with significant memory to work properly. The Craiyon team led by Boris Dayma in collaboration with Hugging Face came up with a lighter model and named it Dalle-E mini. They built the entire model in couple of months and demoed in July 2021. Since this is not OpenAI version, to eliminate confusion, they renamed it to Craiyon.
Currently there are two different versions of this. One is trained on a much larger dataset and is also called the mega version.
How does it work
Dall-E mini is built on JAX (Just After Execution) which itself is a google research project. JAX does just in time compilations, so it can optimize better giving better performance. You can find a detailed discussion of how Dalle-E mini works here.
I will not go into discussion on details of how the model works, as the link provided above explains the model fully.
Implementation
As mentioned above, Dall-E uses JAX. Since I already had torch installed, I found an implementation that uses torch framework. I used the implementation from github page here. The only third party dependencies used are numpy, requests, pillow and torch. I installed all of these in a python virtual environment using pip. I am listing all packages installed below,
Package Version
------------------ ---------
certifi 2022.6.15
charset-normalizer 2.1.0
idna 3.3
min-dalle 0.2.10
numpy 1.23.0
Pillow 9.2.0
pip 21.2.4
requests 2.28.1
setuptools 58.1.0
torch 1.12.0
typing_extensions 4.3.0
urllib3 1.26.9
This implementation allows you to use both mini and mega version of the model. Following code shows how to initialize the model.
from min_dalle import MinDalle def __init__(self) -> None: self.model = MinDalle(is_mega=True, models_root='./data-mega')
In this case, I have set is_mega to True. This means I am using the mega version of the model. This library downloads the model to the specified directory if it does not exist.
def show_image(self, text, outfile, seed=10, grid_size=1): image = self.model.generate_image(text, seed=seed, grid_size=grid_size) width, height = image.size draw = ImageDraw.Draw(image) textwidth, textheight = draw.textsize(text) x = width - textwidth - 10 y = height - textheight - 10 draw.text((x, y), text, (255, 255, 255)) image.save(outfile) image.show()
In this code, I am just using the model to generate the image. This library allows generation of more that one image.
Example Generations
I generated the following images using the mini version of the model.



1. A tree in an ocean 2. Eiffel tower flying in space 3. A monkey sitting on a banana couch
However, for most texts, mini version does not generate proper image. I found that mega version works much better for generating images. See the samples below for images generated by the mega model.






1. Cat working on a computer 2. Dog on Himalayan summit 3. Gorilla reading a book 4. Monkey wearing a funny hat 5. Horse grazing in a beach 6. Elephant playing soccer
I found face is not a strong point for the model. Here is one for the laughs.

Marilyn Monroe eating hot dog
Conclusion
This is a fun AI model to build your emojis. The project itself stands as a monumental step forward in computer vision models. I am now waiting on Dall-E 2 as it promises a lot more features than provided by Dall-E. Hope you found this blog helpful. Ciao for now!