{"id":2185,"date":"2021-01-10T11:00:59","date_gmt":"2021-01-10T11:00:59","guid":{"rendered":"https:\/\/thenextweb.com\/?p=1333481"},"modified":"2021-01-10T11:00:59","modified_gmt":"2021-01-10T11:00:59","slug":"heres-how-openais-magical-dall-e-image-generator-works","status":"publish","type":"post","link":"https:\/\/www.londonchiropracter.com\/?p=2185","title":{"rendered":"Here\u2019s how OpenAI\u2019s magical DALL-E image generator works"},"content":{"rendered":"\n<p>It seems like every few months, someone publishes a machine learning paper or demo that makes my jaw drop. This month, it\u2019s OpenAI\u2019s new image-generating model,<span>&nbsp;<\/span><a href=\"https:\/\/openai.com\/blog\/dall-e\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">DALL\u00b7E<\/a>.<\/p>\n<p>This behemoth 12-billion-parameter neural network takes a text caption (i.e. \u201can armchair in the shape of an avocado\u201d) and generates images to match it:<\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" title=\"Generated images of avocado chairs\" src=\"https:\/\/daleonai.com\/images\/screen-shot-2021-01-06-at-1.37.37-pm.png\" alt=\"Generated images of avocado chairs\" width=\"1432\" height=\"1428\" class=\" lazy\" data-lazy=\"true\"><figcaption><a href=\"https:\/\/thenextweb.com\/neural\/2021\/01\/10\/heres-how-openais-magical-dall-e-generates-images-from-text-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2021%2F01%2F10%2Fheres-how-openais-magical-dall-e-generates-images-from-text-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: From https:\/\/openai.com\/blog\/dall-e\/\" data-title=\"Share From https:\/\/openai.com\/blog\/dall-e\/ on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share From https:\/\/openai.com\/blog\/dall-e\/ on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"><\/i><\/a>From https:\/\/openai.com\/blog\/dall-e\/<\/figcaption><\/figure>\n<p>I think its pictures are pretty inspiring (I\u2019d buy one of those avocado chairs), but what\u2019s even more impressive is DALL\u00b7E\u2019s ability to understand and render concepts of space, time, and even logic (more on that in a second).<\/p>\n<p>In this post, I\u2019ll give you a quick overview of what DALL\u00b7E can do, how it works, how it fits in with recent trends in ML, and why it\u2019s significant. Away we go!<\/p>\n<h2 id=\"what-is-dalle-and-what-can-it-do\">What is DALL\u00b7E and what can it do?<\/h2>\n<p>In July, DALL\u00b7E\u2019s creator, the company OpenAI, released a similarly huge model called GPT-3 that wowed the world with<span>&nbsp;<\/span><a href=\"https:\/\/daleonai.com\/gpt3-explained-fast\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">its ability to generate human-like text<\/a>, including Op Eds, poems, sonnets, and even computer code. DALL\u00b7E is a natural extension of GPT-3 that parses text prompts and then responds not with words but in pictures. In one example from OpenAI\u2019s blog, for example, the model renders images from the prompt \u201ca living room with two white armchairs and a painting of the colosseum. The painting is mounted above a modern fireplace\u201d:<\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" title=\"DALLE generated images\" src=\"https:\/\/daleonai.com\/images\/screen-shot-2021-01-06-at-2.39.07-pm.png\" alt=\"DALLE generated images\" width=\"1424\" height=\"1428\" class=\" lazy\" data-lazy=\"true\"><figcaption><a href=\"https:\/\/thenextweb.com\/neural\/2021\/01\/10\/heres-how-openais-magical-dall-e-generates-images-from-text-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2021%2F01%2F10%2Fheres-how-openais-magical-dall-e-generates-images-from-text-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: From https:\/\/openai.com\/blog\/dall-e\/.\" data-title=\"Share From https:\/\/openai.com\/blog\/dall-e\/. on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share From https:\/\/openai.com\/blog\/dall-e\/. on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"><\/i><\/a>From https:\/\/openai.com\/blog\/dall-e\/.<\/figcaption><\/figure>\n<p>Pretty slick, right? You can probably already see how this might be useful for designers. Notice that DALL\u00b7E can generate a large set of images from a prompt. The pictures are then ranked by a second OpenAI model, called<span>&nbsp;<\/span><a href=\"https:\/\/openai.com\/blog\/clip\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">CLIP<\/a>, that tries to determine which pictures match best.<\/p>\n<h2 id=\"how-was-dalle-built\">How was DALL\u00b7E built?<\/h2>\n<p>Unfortunately, we don\u2019t have a ton of details on this yet because OpenAI has yet to publish a full paper. But at its core, DALL\u00b7E uses the same new neural network architecture that\u2019s responsible for tons of recent advances in ML: the<span>&nbsp;<\/span><a href=\"https:\/\/arxiv.org\/abs\/1706.03762\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Transformer<\/a>. Transformers, discovered in 2017, are an easy-to-parallelize type of neural network that can be scaled up and trained on huge datasets. They\u2019ve been particularly revolutionary in natural language processing (they\u2019re the basis of models like BERT, T5, GPT-3, and others), improving the quality of<span>&nbsp;<\/span><a href=\"https:\/\/blog.google\/products\/search\/search-language-understanding-bert\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google Search<\/a><span>&nbsp;<\/span>results, translation, and even in<span>&nbsp;<\/span><a href=\"https:\/\/daleonai.com\/how-alphafold-works\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">predicting the structures of proteins<\/a>.<\/p>\n<p><em>[Read:&nbsp;<a class=\"c-link c-message_attachment__title_link\" href=\"https:\/\/thenextweb.com\/dutch-disruptors\/2020\/12\/15\/meet-the-4-scale-ups-using-data-to-save-the-planet\/\" target=\"_blank\" rel=\"noreferrer noopener\" data-qa=\"message_attachment_title_link\"><span dir=\"auto\">Meet the 4 scale-ups using data to save the planet<\/span><\/a>]<\/em><\/p>\n<p>Most of these big language models are trained on enormous text datasets (like all of Wikipedia or<span>&nbsp;<\/span><a href=\"https:\/\/commoncrawl.org\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">crawls of the web<\/a>). What makes DALL\u00b7E unique, though, is that it was trained on sequences that were a combination of words and pixels. We don\u2019t yet know what the dataset was (it probably contained images and captions), but I can guarantee you it was probably massive.<\/p>\n<h2 id=\"how-smart-is-dalle\">How \u201csmart\u201d is DALL\u00b7E?<\/h2>\n<p>While these results are impressive, whenever we train a model on a huge dataset, the skeptical machine learning engineer is right to ask whether the results are merely high-quality because they\u2019ve been copied or memorized from the source material.<\/p>\n<p>To prove DALL\u00b7E isn\u2019t just regurgitating images, the OpenAI authors forced it to render some pretty unusual prompts:<\/p>\n<p>\u201cA professional high quality illustration of a giraffe turtle chimera.\u201d<\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/daleonai.com\/images\/screen-shot-2021-01-06-at-1.39.04-pm.png\" alt width=\"1436\" height=\"1140\" class=\" lazy\" data-lazy=\"true\"><figcaption><a href=\"https:\/\/thenextweb.com\/neural\/2021\/01\/10\/heres-how-openais-magical-dall-e-generates-images-from-text-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2021%2F01%2F10%2Fheres-how-openais-magical-dall-e-generates-images-from-text-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: From https:\/\/openai.com\/blog\/dall-e\/.\" data-title=\"Share From https:\/\/openai.com\/blog\/dall-e\/. on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share From https:\/\/openai.com\/blog\/dall-e\/. on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"><\/i><\/a>From https:\/\/openai.com\/blog\/dall-e\/.<\/figcaption><\/figure>\n<p>\u201cA snail made of a harp.\u201d<\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/daleonai.com\/images\/screen-shot-2021-01-06-at-1.39.12-pm.png\" alt width=\"1438\" height=\"1434\" class=\" lazy\" data-lazy=\"true\"><figcaption><a href=\"https:\/\/thenextweb.com\/neural\/2021\/01\/10\/heres-how-openais-magical-dall-e-generates-images-from-text-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2021%2F01%2F10%2Fheres-how-openais-magical-dall-e-generates-images-from-text-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: From https:\/\/openai.com\/blog\/dall-e\/\" data-title=\"Share From https:\/\/openai.com\/blog\/dall-e\/ on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share From https:\/\/openai.com\/blog\/dall-e\/ on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"><\/i><\/a>From https:\/\/openai.com\/blog\/dall-e\/<\/figcaption><\/figure>\n<p>It\u2019s hard to imagine the model came across many giraffe-turtle hybrids in its training data set, making the results more impressive.<\/p>\n<p>What\u2019s more, these weird prompts hint at something even more fascinating about DALL\u00b7E: its ability to perform \u201czero-shot visual reasoning.\u201d<\/p>\n<h2 id=\"zero-shot-visual-reasoning\">Zero-Shot Visual Reasoning<\/h2>\n<p>Typically, in machine learning, we train models by giving them thousands or millions of examples of tasks we want them to preform and hope they pick up on the pattern.<\/p>\n<p>To train a model that identifies dog breeds, for example, we might show a neural network thousands of pictures of dogs labeled by breed and then test its ability to tag new pictures of dogs. It\u2019s a task with limited scope that seems almost quaint compared to OpenAI\u2019s latest feats.<\/p>\n<p>Zero-shot learning, on the other hand, is the ability of models to perform tasks that they weren\u2019t specifically trained to do. For example, DALL\u00b7E was trained to generate images from captions. But with the right text prompt, it can also transform images into sketches:<\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/daleonai.com\/images\/screen-shot-2021-01-06-at-1.41.02-pm.png\" alt width=\"1436\" height=\"1438\" class=\" lazy\" data-lazy=\"true\"><figcaption><a href=\"https:\/\/thenextweb.com\/neural\/2021\/01\/10\/heres-how-openais-magical-dall-e-generates-images-from-text-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2021%2F01%2F10%2Fheres-how-openais-magical-dall-e-generates-images-from-text-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Results from the prompt, \u201cthe exact same cat on the top as a sketch on the bottom\u201d. From https:\/\/openai.com\/blog\/dall-e\/\" data-title=\"Share Results from the prompt, \u201cthe exact same cat on the top as a sketch on the bottom\u201d. From https:\/\/openai.com\/blog\/dall-e\/ on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Results from the prompt, \u201cthe exact same cat on the top as a sketch on the bottom\u201d. From https:\/\/openai.com\/blog\/dall-e\/ on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"><\/i><\/a>Results from the prompt, \u201cthe exact same cat on the top as a sketch on the bottom\u201d. From https:\/\/openai.com\/blog\/dall-e\/<\/figcaption><\/figure>\n<p>DALL\u00b7E can also render custom text on street signs:<\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/daleonai.com\/images\/screen-shot-2021-01-06-at-2.51.53-pm.png\" alt width=\"1174\" height=\"1172\" class=\" lazy\" data-lazy=\"true\"><figcaption><a href=\"https:\/\/thenextweb.com\/neural\/2021\/01\/10\/heres-how-openais-magical-dall-e-generates-images-from-text-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2021%2F01%2F10%2Fheres-how-openais-magical-dall-e-generates-images-from-text-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Results from the prompt&nbsp;\u201cA store front that has the word \u2018openai\u2019 written on it\u2019.\u201d From https:\/\/openai.com\/blog\/dall-e\/.\" data-title=\"Share Results from the prompt&nbsp;\u201cA store front that has the word \u2018openai\u2019 written on it\u2019.\u201d From https:\/\/openai.com\/blog\/dall-e\/. on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Results from the prompt&nbsp;\u201cA store front that has the word \u2018openai\u2019 written on it\u2019.\u201d From https:\/\/openai.com\/blog\/dall-e\/. on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"><\/i><\/a>Results from the prompt&nbsp;\u201cA store front that has the word \u2018openai\u2019 written on it\u2019.\u201d From https:\/\/openai.com\/blog\/dall-e\/.<\/figcaption><\/figure>\n<p>In this way, DALL\u00b7E can act almost like a Photoshop filter, even though it wasn\u2019t specifically designed to behave this way.<\/p>\n<p>The model even shows an \u201cunderstanding\u201d of visual concepts (i.e. \u201cmacroscopic\u201d or \u201ccross-section\u201d pictures), places (i.e. \u201ca photo of the food of china\u201d), and time (\u201ca photo of alamo square, san francisco, from a street at night\u201d; \u201ca photo of a phone from the 20s\u201d). For example, here\u2019s what it spit out in response to the prompt \u201ca photo of the food of china\u201d:<\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/daleonai.com\/images\/screen-shot-2021-01-06-at-1.42.22-pm.png\" alt width=\"1444\" height=\"860\" class=\" lazy\" data-lazy=\"true\"><figcaption><a href=\"https:\/\/thenextweb.com\/neural\/2021\/01\/10\/heres-how-openais-magical-dall-e-generates-images-from-text-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2021%2F01%2F10%2Fheres-how-openais-magical-dall-e-generates-images-from-text-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: \u201ca photo of the food of china\u201d from https:\/\/openai.com\/blog\/dall-e\/.\" data-title=\"Share \u201ca photo of the food of china\u201d from https:\/\/openai.com\/blog\/dall-e\/. on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share \u201ca photo of the food of china\u201d from https:\/\/openai.com\/blog\/dall-e\/. on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"><\/i><\/a>\u201ca photo of the food of china\u201d from https:\/\/openai.com\/blog\/dall-e\/.<\/figcaption><\/figure>\n<p>In other words, DALL\u00b7E can do more than just paint a pretty picture for a caption; it can also, in a sense, answer questions visually.<\/p>\n<p>To test DALL\u00b7E\u2019s visual reasoning ability, the authors had it take a visual IQ test. In the examples below, the model had to complete the lower right corner of the grid, following the test\u2019s hidden pattern.<\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/daleonai.com\/images\/screen-shot-2021-01-07-at-1.22.26-pm.png\" alt width=\"1464\" height=\"896\" class=\" lazy\" data-lazy=\"true\"><figcaption><a href=\"https:\/\/thenextweb.com\/neural\/2021\/01\/10\/heres-how-openais-magical-dall-e-generates-images-from-text-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2021%2F01%2F10%2Fheres-how-openais-magical-dall-e-generates-images-from-text-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: A screenshot of the visual IQ test OpenAI used to test DALL\u00b7E&nbsp;from https:\/\/openai.com\/blog\/dall-e\/.\" data-title=\"Share A screenshot of the visual IQ test OpenAI used to test DALL\u00b7E&nbsp;from https:\/\/openai.com\/blog\/dall-e\/. on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share A screenshot of the visual IQ test OpenAI used to test DALL\u00b7E&nbsp;from https:\/\/openai.com\/blog\/dall-e\/. on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"><\/i><\/a>A screenshot of the visual IQ test OpenAI used to test DALL\u00b7E&nbsp;from https:\/\/openai.com\/blog\/dall-e\/.<\/figcaption><\/figure>\n<p>\u201cDALL\u00b7E is often able to solve matrices that involve continuing simple patterns or basic geometric reasoning,\u201d write the authors, but it did better at some problems than others. When the puzzles\u2019s colors were inverted, DALL\u00b7E did worse\u2013\u201csuggesting its capabilities may be brittle in unexpected ways.\u201d<\/p>\n<h2 id=\"what-does-it-mean\">What does it mean?<\/h2>\n<p>What strikes me the most about DALL\u00b7E is its ability to perform surprisingly well on so many different tasks, ones the authors didn\u2019t even anticipate:<\/p>\n<p>\u201cWe find that DALL\u00b7E [\u2026] is able to perform several kinds of image-to-image translation tasks when prompted in the right&nbsp;way.<\/p>\n<p>We did not anticipate that this capability would emerge, and made no modifications to the neural network or training procedure to encourage it.\u201d<\/p>\n<p>It\u2019s amazing, but not wholly unexpected; DALL\u00b7E and GPT-3 are two examples of a greater theme in deep learning: that extraordinarily big neural networks trained on unlabeled internet data (an example of \u201cself-supervised learning\u201d) can be highly versatile, able to do lots of things weren\u2019t specifically designed for.<\/p>\n<p>Of course, don\u2019t mistake this for general intelligence. It\u2019s<span>&nbsp;<\/span><a href=\"https:\/\/lacker.io\/ai\/2020\/07\/06\/giving-gpt-3-a-turing-test.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">not hard<\/a><span>&nbsp;<\/span>to trick these types of models into looking pretty dumb. We\u2019ll know more when they\u2019re openly accessible and we can start playing around with them. But that doesn\u2019t mean I can\u2019t be excited in the meantime.<\/p>\n<p><i><span>This <a href=\"https:\/\/daleonai.com\/dalle-5-mins\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">article<\/a> was written by <\/span><\/i><a href=\"https:\/\/daleonai.com\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\"><i><span>Dale Markowitz<\/span><\/i><\/a><i><span>, an Applied AI Engineer at Google based in Austin, Texas, where she works on applying machine learning to new fields and industries. She also likes solving her own life problems with AI, and talks about it on YouTube.<\/span><\/i><\/p>\n<p class=\"c-post-pubDate\"> Published January 10, 2021 \u2014 11:00 UTC <\/p>\n<p> <a href=\"https:\/\/thenextweb.com\/neural\/2021\/01\/10\/heres-how-openais-magical-dall-e-generates-images-from-text-syndication\/\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>It seems like every few months, someone publishes a machine learning paper or demo that makes my jaw drop. This month, it\u2019s OpenAI\u2019s new image-generating model,&nbsp;DALL\u00b7E. This behemoth 12-billion-parameter neural network takes&#8230;<\/p>\n","protected":false},"author":1,"featured_media":2186,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/posts\/2185"}],"collection":[{"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2185"}],"version-history":[{"count":0,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/posts\/2185\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=\/wp\/v2\/media\/2186"}],"wp:attachment":[{"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2185"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2185"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.londonchiropracter.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2185"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}