Friday, August 30, 2013

My PhD dissertation: Understanding Images

Next month I will be defending my PhD dissertation. I thought you might like to hear what it is about.
There is a difference between knowing a thing and understanding it. This is illustrated by a story related by the inventor of the Dewey Decimal system, John Dewey. He was talking to the class of another educator and asked, "What would you find if you dug a deep hole in the earth?" When none of the children were able to answer after repeated questioning, the students' teacher explained to Dewey, "you're asking the wrong question." She asked the class,"What is the state of the center of the earth?" to which the class replied unanimously 'igneous fusion.'"
The students knew the answer to the teacher's question, but they didn't understand the answer. The facts in the students heads weren't connected to related facts in a way that would let them answer a question phrased differently.
So can a computer really understand something? It's easy to make a computer program that will answer "Albany" when you ask it to name the capital of New York. But if that fact isn't connected to anything else, the answer is essentially meaningless.
My field is computer vision: writing software that takes images or video as input, and tells something about the scene as output. Normally, this is just a label for part of the image, like "person" or "sky." In fact, to save effort, I usually just output a particular color and have a key sitting off to the side saying, "red means person, yellow means road, etc..." As far as the program is concerned that color is a meaningless symbol.
After working with some colleagues in Europe, I started to think about what you would need to do in order to go beyond that, to make a program that really understood what it was seeing.
Now, in a way, this is clearly impossible. A computer can't experience visual sensations the way we do. It isn't conscious, it can't know what it is really like to see the color red. At the most fundamental level, the color red is just [255 0 0] for the computer. That's a weird, deep philosophical issue. So I changed the question a little. I asked, "How can I build a machine that passes tests of understanding?"
I used to be a teacher, and one of the things they taught us was that there are different kinds of questions you can ask on a test. One kind of question tests for knowledge, and another kind tests for understanding. If you want to test whether a student knows what is in a picture, you ask them to label the items depicted. If you want to know whether a student understands a picture, you ask them questions that require the use of knowledge they already have from another source. So in order to make a program that can answer this second type of question, I needed to give it a source of knowledge about the world and the ability to connect what it was recognizing in the image to what was already stored in its memory.
First, though, I needed the program to be able to recognize what was in the image. Imagine you are asked to describe a scene. You might say, "There is a girl running on a gravel road eating blue ice cream." This description contains:

  • concrete nouns: girl, road, ice cream
  • concrete verbs: running
  • concrete adjectives: gravel, blue
So I wrote programs to do each of these things: a program to label objects (like person and road), a program to label materials (like gravel) and colors, and a detector for activities. (Actually, the activity detector still needs some work-- I can get it to work if I have a Kinect sensor, but not with just video.) Other people had written programs to do these things before, but I invented my own, new ways of doing each of them.
Now, you shouldn't overestimate what these programs can do. I can't just hand it an arbitrary picture and expect it to tell me everything in that picture. Instead, I can pick maybe a dozen nouns, give it examples of those things, and it can recognize those pretty well in a new image. But that's okay, I think. It's the job of someone like Google or Microsoft to go to all the work of making it able to handle lots of words, and other researchers can figure out ways to make them more accurate. I'm just trying to show it can be done at all.
Once I had it where it could make these descriptions of the scene, I connected it up with something called "ResearchCyc." Cyc is one of the longest running projects in computer science. Its goal is to encode everything we know as part of common sense. I wrote a little program to take these descriptions of the scene, and express them in a language Cyc can work with. 
With all these parts combined, you can ask questions like,"What is there in this scene that might catch on fire if exposed to a flame?" The program will then reason something like this:
  • There is an object 89% likely to be a fence in the scene.
  • The fence is 77% likely to be made of wood.
  • There are objects 93% likely to be trees in the scene. These objects are 62% likely to be concrete [because they happen to look smooth and grey in this particular picture] and 35% likely to be wood.
  • All trees are made of wood. [It has never heard of fake concrete trees and thinks that things are what they appear to be. It's kind of like a little kid that way.]
  • Therefore, these trees are made of wood.
  • Wood things are flammable.
  • Flammable things catch on fire if exposed to a flame.
  • Therefore, the fence at location (50, 127) and the trees at location (156, 145) might catch fire if exposed to flame.
In fact, it can report that chain of reasoning and answer in (pretty good) English, because it has canned translations for each of the concepts it knows into English. Unfortunately, it doesn't go the other way yet: you have to write your questions in the language Cyc understands, instead of just writing them in English.
There's still a lot of work to do. Despite all the work that has been done, Cyc still has huge gaps of things it doesn't know. What we would really like would be some way to automatically build a resource like Cyc from analyzing web pages, or from a robot interacting with the world. That's still pretty hard, but programs like IBM's Watson, the Jeopardy champ, make it seem like it might be reachable in the next decade or two. One thing my program can't do at all is guess why someone is doing something. It doesn't have any kind of model of what is going on in people's heads. It can't perform visual reasoning in any way at the moment. My hope is that other researchers will see what I've done and say, "okay, technically this can answer some understanding questions, but its really quite lousy. I can do better." If enough people do that, we might get something useful.

Here are some other cool things the program can do:

  • draw like a child (well, in one way)
  • recognize that there is probably a window in the scene even though it doesn't have a window detector. How? It knows that buildings and cars usually have windows, and it can detect those.
  • recognize that it is probably looking at an urban scene or a rural scene or an indoor scene based on what kinds of things it can see.
  • guess that something that might be a dog or might be a cow but it can't tell which is a mammal, a quadruped, an animal, a living thing, is able to move, and so forth. This is called least common superclass.


  1. Wow! So cool, Doug. Does this mean you are finished and have submitted everything?

  2. I have turned it in. All I have to do now is give a final presentation, called a defense.


  4. Ha Ha, exactly. Defend it to the death!

  5. Google and Stanford have developed image recognition software that can describe what's happening in an image.