Artificial intelligence research group OpenAI has has a new version of DALL-E . made, the text to image generation program. DALL-E 2 features a higher-resolution, lower-latency version of the original system, which produces images with user-written descriptions. It also includes new capabilities, such as editing an existing image. As with previous OpenAI work, the tool will not be released directly to the public. But researchers can sign up online to preview the system, and OpenAI hopes to make it available for use in third-party apps at a later date.
The original DALL-E, a portmanteau of the artist “Salvador Dalí” and the robot “WALL-E”, debuted in January 2021† It was a limited but fascinating test from AI’s ability to visually represent concepts, from everyday images of a mannequin in a flannel shirt to ‘a giraffe made of turtle’ or an illustration of a radish walking a dog. At the time, OpenAI said it would continue to build on the system while investigating potential dangers, such as image generation bias or the production of misinformation. It attempts to address these issues using technical safeguards and new content policies, while also reducing computer load and bringing out the basic capabilities of the model.
One of the new DALL-E 2 features, inpainting, applies DALL-E’s text-to-image capabilities at a more granular level. Users can start with an existing image, select an area and tell the model to edit it. You can hang a painting on the wall of a living room and replace it with, for example, another picture, or put a vase of flowers on a coffee table. The model can fill (or remove) objects, taking into account details such as the directions of shadows in a room. Another feature, variations, is a sort of search tool for images that don’t exist. Users can upload a starting image and then create a series of similar variations. They can also merge two images, generating images that contain elements of both. The generated images are 1,024 x 1,024 pixels, a jump over the 256 x 256 pixels provided by the original model.
DALL-E 2 builds on CLIP, a computer vision system that OpenAI also announced last year. “DALL-E 1 just took our GPT-3 approach from language and applied it to produce an image: we compressed images into a series of words and we just learned to predict what comes next,” says OpenAI- research scientist Prafulla Dhariwal, referring to the GPT model used by many text AI apps. But the word combination didn’t necessarily capture the qualities people thought were most important, and the predictive process limited the realism of the images. CLIP was designed to look at images and summarize their content the way a human would, and OpenAI iterated this process to create “unCLIP” – an inverted version that starts with the description and works its way into an image. DALL-E 2 generates the image using a process called diffusion, which Dhariwal describes as starting with a “bag of dots” and then filling in a pattern of progressively greater detail.
Interestingly, a draft paper on unCLIP says it partially resists a very funny weakness of CLIP: the fact that people can fool the model’s identifiers by labeling an object (like a Granny Smith apple) with a word that indicates something else (such as an iPod ). The variation tool, the authors say, “still generates images of apples with high probability,” even when using a mislabeled image that CLIP cannot identify as a Granny Smith. Conversely, “the model never produces images of iPods, despite the very high relative predicted probability of this caption.”
The full model of DALL-E has never been publicly released, but other developers have honed their own tools over the past year that mimic some of its features. One of the most popular mainstream applications is the Wombo’s Dream mobile app, which generates images of everything users describe in various art styles. OpenAI isn’t releasing any new models today, but developers could use the technical findings to update their own work.
OpenAI has implemented some built-in protections. The model is trained on data in which objectionable material has been removed, ideally limiting its ability to produce objectionable content. There is a watermark indicating the AI-generated nature of the work, although in theory it could be cut out. Also, as a preventive anti-abuse feature, the model can’t generate recognizable faces from a name – not even asking for something like the Mona Lisa would apparently return a variation on the actual face of the painting.
DALL-E 2 can be tested by vetted partners with some caveats. Users are prohibited from uploading or generating images that are “not G rated” and may “cause harm”, including anything containing hate symbols, nudity, obscene gestures or “major conspiracies or events related to major ongoing geopolitical events”. They also need to disclose AI’s role in generating the images, and they can’t offer generated images to other people through an app or website – so you won’t see a DALL-E powered version of something like Dream initially. But OpenAI hopes to add it to the group’s API toolset later, so it can power third-party apps. “We hope to continue doing a phased process here so that we can continue to evaluate based on the feedback we get on how to safely release this technology,” says Dhariwal.
Additional reporting from James Vincent.