<img src="https://certify.alexametrics.com/atrk.gif?account=43vOv1Y1Mn20Io" style="display:none" height="1" width="1" alt="">

AI Text to Image generators and what they tell us about the future

Type and ye shall receive. Pic: Dall-E 2
4 minute read
Type and ye shall receive. Pic: Dall-E 2

AI Text to Image software such as Dall-E 2, craiyon, and Midjourney, are hot right now, even occupying this week’s John Oliver show. And, as David Shapton writes, there’s a lot going on behind the pixels.

I've been - literally - playing with Dall-E 2. Just two years ago, if I'd told you that we'd be able to do what Dall-E 2 can do two years into the future, you wouldn't have believed me. So rather than explain it to you, I'm just going to show you this sentence:

A vase of yellow tulips on a chrome-coloured coffee table in a mid-century-modern living room with marble flooring and a vast garden visible through sliding glass doors with lots of bokeh and neon lighting

And this image:

DALL·E 2022-08-11 09.40.10 - A vase of yellow tulips on a chrome coloured coffee table in a mid-century-modern living room with marble flooring and a vast garden visible through s

 The text caused the image. There were no other clues or hints. Dall-E 2 read the text and made the image.

And what an image! In its own right, it's an OK picture. I've seen better compositions, and it's a bit heavy on bokeh and bright reflections, but that's because I asked it to be.

You can't get away from AI any more than you could have got away from computers in the last thirty years. In those three decades, anyone denying that computers were "the future" would have been seen as wilfully ignorant or deluded. It's the same with the internet. Very, very few people now proudly say that "I don't ‘do' computers,” because that's becoming kind-of equivalent to saying "I can't read.”

And so it is with AI and, in a different but related way, with the metaverse.

But AI is in a class all of its own, and what's motivated me to write this piece is not that I think AI is going to control us or destroy us (it may do, and we should treat such possibilities with seriousness, if not inevitability), but because it can surprise us.

The surprise gap

It should be hard to surprise an expert. If you know enough about a subject, you probably know enough to fend off surprises. At the very least, you ought to know enough to be able to predict what's sort of likely to happen in your field of knowledge.

But what if you are still surprised despite your qualifications and experience? That can only mean that whatever it is that's surprised you is so massively different and unexpected that it represents a giant leap forward in technology.

Let's imagine that Tesla announces its next car, the distinguishing feature of which is that it has no wheels because it has an anti-gravity motor. It just hovers in front of you until you command it to move forward, silently, floating on a mysterious and invisible force.

Or maybe Airbus says its next product is not an aircraft but involves high-speed travel. Almost instant, in fact, because it's a matter transporter, just like in Star Trek.

Either of those would surprise us, and with good reason: these things are supposed to be impossible. So it is reasonable to act surprised when confronted with something that was supposed to be impossible.

For a few years, we've seen examples of text-to-image. Ask a computer to draw a purple dinosaur eating a banana, and you'd get a rather shabby, naive-looking image of precisely that. The images weren't much good, but they didn't need to be to prove a point.

Try to figure out, then, what it would take for that cartoonish demonstration to become so advanced, so adept, and so damn clever that it makes an image indistinguishable from real life (notwithstanding purple dinosaurs, etc.).

But this has now happened. How about this:

"A MacBook pro connected to an oscilloscope on a gigantic hardwood table with an orange and teal aesthetic and lots of bokeh"


 DALL·E 2022-08-11 09.42.20 - A macbook pro connected to an oscilloscope on a gigantic hardwood table with an orange and teal aesthetic and lots of bokeh-1

That image has never existed before. Nobody had seen that image before I typed that instruction.

How it works. No, really...

What made me want to write this piece is that I have a lot of friends in the moving image industry. I respect all of them, and their expertise next to mine is like a planet compared to a golf ball.

But most of them got this wrong. They didn't understand how Dall-E 2 works.

And I don't blame them for a minute because it's human nature to explain new things in terms that we're familiar with. But that strategy doesn't work with a phenomenon as powerful and challenging as AI.

One common reaction to these pictures is that "the AI has scanned the web, looking for images with descriptions and..." Just jumping in here: that part is entirely correct. The next part is wrong "... pastes them into the new image". That's a reasonable guess, but it's wrong.

It's wrong because this is not cutting and pasting. It's not compositing. It's not even about pixels.

See that table in the first image? You could probably do a reverse image search and find something like that. But you wouldn't find it, from the same angle, with the same lighting and the same reflections. What's more, you could have asked for the table from any angle. The same table, but "reimagined" to be seen from any direction. You can't do that with cutting and pasting. To get credible reflections like that, you'd at least have to have an understanding of the 3D essence of the table. So there's something deeper at work here.

It's easier to understand the process if you replace "pixels" with "concepts". What Dall-E 2 has done is look at thousands - or even billions - of images matched with a text description. From these text-image pairs, it learns not just what these objects are but what they are like.

Why am I putting it like that? Because if you ask someone what a red rose is "like", they'll describe it. And they can describe a big red rose, a small one, a deep-coloured one or a lighter-coloured one. A red rose in a vase, a red rose in the garden, and a red rose in a Monet painting. And they can describe the same rose from any angle, in any lighting. You can do that if you know what something is like.

So while it's vanishingly unlikely that Dall-E is sentient, or knows it has knowledge, what it does know - in some sense that I'm not sure I can define at this stage, is what things are like.

And it's knowing what things are like that makes it possible for it to create entirely novel images that are credible and realistic.

And this is just the start. So, I wonder: is this a case of "I, for one, welcome our new AI design overlords"? Or is it simply a wonderful time to be alive? 

Tags: Technology

Comments