Today we're going to be looking at yet another aspect of the ways in which human minds or animal minds and what we might think of as computational minds, the two perspectives can lend insight into each other and this is a particularly interesting aspect of that discussion as we talk about vision. So what we're going to be looking at today and over the next couple of lectures really, will be the ways in which we can treat human vision, again, with occasional forays perhaps into animal vision but primarily human vision, the ways in which we can treat that as a computational problem. This perspective that is viewing vision itself as a computational problem, is a relatively new idea. You might say that it actually started with interest in the Machine Vision in the 1970s or thereabouts and it has grown to an extraordinarily fruitful enterprise. That is, the study of getting machines to see objects and interpret scenes, sometimes generally, the way that we imagine people do. Sometimes the intent is to get machines to go further than people are able to do. But for now, we'll start with the ideas of what we know about human vision and how those can be treated in a computational way. It's a really fascinating story. For our purposes, we'll begin with a very rough picture of what human vision does. The reason that we're going to do that is because it simplifies the discussion of the computational treatment of vision. That is, if we start with a rather simple view of what human vision does, we can get some foothold on how to treat it as a computational problem. But we'll begin then with this rather simplified portrait of vision and really this diagram which I got off the web. This is a diagram of the eye and it includes much more detail than we really care to use in our discussion. For our purposes, we can think of the human eye as being something like a pinhole camera. Now, in a sense, that's already a kind of machine metaphor for the eye if you want to think of it that way. The eye has this red small hole through which light comes, is refracted by the lens in the eye and is then the focused picture is projected on the surface of the retina. So you see that in this diagram here the retina is that, my hand disappears behind the picture, but you see the the yellow layer there. That's the retina of the eye. So to a first approximation, think of it this way. The light comes in from the scene outside our heads, it is projected onto this essentially two-dimensional scream at the back of the eye, the retina. Now, for our purposes again, there's much more to say about this and there's a lot more detail. But the first thing to notice about this very rough portrait is that what we have to do as people is then interpret the patterns of light falling on the retina to translate them into beliefs about three-dimensional objects out in the world. That's an extraordinarily difficult task when you think about it. Again, we feel that it's so automatic. We just opened our eyes and see objects out in the world. But when you view this as a true computational problem, you see that it's anything but automatic. In fact, it's fiendishly difficult. What we're trying to do is take patterns of light falling on essentially two-dimensional surface at the back of our eye, and from those patterns of light, we're going to generate beliefs well-founded beliefs we hope, about where objects are in the world. One thing to say about this right off the bad is that, from a mathematical standpoint, recovering three-dimensions from a two-dimensional projection is impossible, in the most general statement of this. That is to say, if you are looking at a two-dimensional projection, like a scene a movie screen, if you are looking at a two-dimensional projection of a 3D objects, the three-dimensional objects that could give rise to that two-dimensional projection are infinite in number. You have to do a great deal of interpretation to imagine, to choose a particular set of 3D structures that is generating the particular two-dimensional projection that you're dealing with. This is an example, this visual illusion by Shigeo Fukuda, Japanese artist, gives you a suggestion of what I'm talking about. This is a photograph. What you're looking at is a sculpture with a whole bunch of weirdly placed pieces in front of a mirror. The mirror is reflecting what looks to be a grand piano. It isn't reflecting the grand piano, it's reflecting that jumbo in the front part of the photo. In other words, if you were standing behind the mirror, if you are looking at this object from the position of the mirror, you would see a pattern of light that you would justifiably interpret as a grand piano, 99 times out of 999 times out of 1000 or even more than that. If you're seeing something that looks like a grand piano, it is. But it doesn't have to be, it might in fact be that jumble of pieces. This is what I mean by saying that the general computational vision problem is mathematically impossible. That is to say, let's go back to the earlier slide. What you're trying to do is take this pattern of light on the two-dimensional retina and infer from that where the 3D objects are. In point of fact, an infinite number of possible arrangements of 3D objects or pieces could have given rise to that very same image. What does that mean computationally? It means that this is what we're going to have to do in order to solve the vision problem for ourselves when we open our eyes. What we're going to have to do is make a lot of guesses. Those are educated guesses. They've been honed into us through millions of years of evolution. So we're very good at making these kinds of guesses, and generally, we don't encounter scenes like this. So we don't have to worry about being fooled by a collection of 3D objects out in the world that are arranged to look like something familiar. However, just as a general rule, it's good to think about illusions like this. Illusions; optical illusions, visual illusions, they're often very good hints as to the computational nature of our own vision. That is to say, they tell us something when we go wrong, when we look at something that is an illusion or we're confused by it. Where that often gives us a clue as to what algorithms we're running to interpret scenes. In here is an example. This is a photo by the wonderful photographer Walter Wick showing an impossible dog house. So you might look at this photo and try to imagine how something like this could be. Again, this is a photograph. So this is a real object in the world. You look at this thing, you have to decide for yourself, what's wrong here? I'm having trouble interpreting this as a true three-dimensional object. The trouble that I'm having should be a clue again to the algorithms that I'm using to interpret two-dimensional scenes as three-dimensional objects. In this case, you might look at some of those boards in the dog house and when we see a portrait like that, we interpret things for example that look they're solid or connected as in fact solid or connected. It's rare that things that appear to us as one piece art, but it can happen. This is another view of the dog house from a slightly different angle and you can see how this is a much clearer version to us of what the three-dimensional dog house looks like. But when seen from a particular angle, it looks like things are crossing in ways that would be impossible in three-dimensions. So there's the list of wonderful optical illusions. There are books and books of them. They're so amazing and often beautiful to look at and they teach us something. Many of them teach us something about our own, I will say algorithms and vision. That is they show us the limitations of our own strategies for interpreting two-dimensional scenes. Now, what we're going to do in our ensuing discussions of vision, we're going to begin with a very simplified version of the vision problem. So for those of you who are watching this who are computer programmers, I'll put this to you as a computer program, an assignment of a computer program to write. This would be a very difficult program to write clearly. The idea is, you are given a photograph, say a grayscale photograph like the one that you see here on the left, okay? You're given a grayscale photograph. Now, that's translated into a large number, an array of pixel values. A pixel should be a familiar term, but it's a little element of a two dimensional graphical picture. In other words, think of this photograph at the left as being say 1,000 by 1,000 pixels, 1,000 rows, 1,000 columns and therefore one million pixels, and each pixel is in this case a number which can range from, we'll make it an 8-bit number which can range from 0-255 and 0 corresponds to black and 255 corresponds to bright white, and everything in between corresponds to some shade of gray. So what you're looking at in this diagram is a photo which would be perhaps a million pixels, 1,000 by 1,000. I think I can just point to it. This little box here that's at the side of this little black region, this rectangle that's been drawn at the side of this cube, that's expanded into the set of numerical values that you see at the right. So if the particular numerical values are not so important here. But if you look carefully at them, you'll see that they range in value in that range from 0-255. So here's our computer program problem. We're given 1,000 by 1,000 array of numbers just like the little subset that you see at the right there. So you're given 1,000 by 1,000 array of numbers, all ranging from 0-255. That's your input. Your output should be something like hand holding a cube. In other words, the program that you're going to write is going to be one that takes as its input pixel values which as our first approximation are the intensity values received on the retina. So your job is going to be taking the very large number of intensity values. In this case, we're not even dealing with color, we're just dealing with grayscale. So you're going to get these numbers and you have to interpret that as a particular three-dimensional scene, a hand holding a cube. That is a very difficult computational problem and you have to make lots and lots of assumptions about it. I should say off the bat that this is, there are lots and lots of simplifications about this statement of the problem relative to human vision. We do have many more Qs than simply an array of numbers like this. For one thing, we have two eyes. So a better statement of the problem would be that we have two corresponding scenes of two corresponding arrays coming from slightly different vantage points and that would help us interpret the three-dimensional structure of these numbers. When our eyes are also, it should be mentioned, the retina's don't return arrays of pixels which correspond to an evenly distributed interpretation of the scene. In fact, the resolution of our retina is far greater in a central area called the fovea. So if you were representing the human eye in this array to be more accurate, you would show a very high resolution chunk of numbers toward the center and a lower resolution chunk of numbers towards the periphery. That's another approximation that we're making. Here we're just saying okay. We're just going to assume that we've got 1,000 by 1,000 array. So all of the same resolution is true throughout the scene. Of course we can do other things. We can move around. We might integrate our understanding of the scene by reaching out to touch certain objects. We have color. Things are moving. We can move. The objects might move. There's all other cues we might pick up. Still this is not the most unfair portrait of the vision problem, because when you think about it, you were able to interpret that two-dimensional photo on the left. You didn't have color. It didn't matter if you moved about. The photograph is not moving. In other words, you're able to look at that array of pixels on the left and generate the response that this is a picture of a hand holding a cube. We would like to be able, again as a first approximation to the vision problem, we'd like to get a computer, a program to perform in the way that we do as people. We'd like to get a computer to look at this array and make a well justified guess. It's all it could be as to what the three-dimensional objects are that are being represented. So that will be our starting version of the computational vision problem and that's what we're going to use as the basis for our discussion. We'll expand things a little bit as we go along, but that's what we'll use as the basis of our discussion as we continue talking about vision.