New research from Google’s UK-based DeepMind subsidiary demonstrates that deep neural networks have a remarkable capacity to understand a scene, represent it in a compact format, and then “imagine” what the same scene would look like from a perspective the network hasn’t seen before.
Human beings are good at this. If shown a picture of a table with only the front three legs visible, most people know intuitively that the table probably has a fourth leg on the opposite side and that the wall behind the table is probably the same color as the parts they can see.
A DeepMind team led by Ali Eslami and Danilo Rezende has developed software based on deep neural networks with these same capabilities—at least for simplified geometric scenes. Given a handful of “snapshots” of a virtual scene, the software—known as a generative query network (GQN)—uses a neural network to build a compact mathematical representation of that scene. It then uses that representation to render images of the room from new perspectives—perspectives the network hasn’t seen before.
The researchers didn’t hard-code any prior knowledge about the kind of environments they would be rendering into the GQN. Human beings are aided by years of experience looking at real-world objects. The DeepMind network develops its own similar intuition simply by examining a bunch of images from similar scenes.
“One of the most surprising results [was] when we saw it could do things like perspective and occlusion and lighting and shadows,” Eslami told us in a Wednesday phone interview. “We know how to write renderers and graphics engines,” he said. What’s remarkable about DeepMind’s software, however, is that the programmers try to hard-code these laws of physics into the software. Instead, Eslami said, the software started with a blank slate that was able to “effectively discover these rules by looking at images.”
It’s the latest demonstration of the incredible versatility of deep neural networks. We already know how to use deep learning to classify images, win at Go, and even play Atari 2600 games. Now we know they have a remarkable capacity for reasoning about three-dimensional spaces.
How DeepMind’s generative query network works
Here’s a simple schematic from DeepMind that helps provide an intutition about how the GQN is put together:
Under the hood, the GQN is really two different deep neural networks connected together. On the left, the representation network takes in a collection of images representing a scene (together with data about the camera location for each image) and condenses these images down to a compact mathematical representation (essentially a vector of numbers) of the scene as a whole.
Then it’s the job of the generation network to reverse this process: starting with the vector representing the scene, accepting a camera location as input, and generating an image representing how the scene would look like from that angle.
Obviously, if the generation network is given a camera location corresponding to one of the input images, it should be able to reproduce the original input image. But this network can also be provided with other camera positions—positions for which the network has never seen a corresponding image. The GQN is able to produce images from these locations that closely match the “real” image that would be taken from the same location.
“The two networks are trained jointly, in an end-to-end fashion,” the DeepMind paper says.
The team used the standard machine learning technique of stochastic gradient descent to iteratively improve the two networks. The software feeds some training images into the network, generates an output image, and then observes how much this image diverged from the expected result. Whereas a conventional neural network uses externally-supplied labels to judge whether the output is correct or not, the GQN’s training algorithm uses scene images both as inputs to the representation network as a way to judge whether the output of the generation network is correct.
If the output doesn’t match the desired image, then the software back-propagates the errors, updating the numerical weights on the thousands of neurons to improve the network’s performance. The software then repeats the process many times, and on each pass the network gets a little bit better at getting the input and output images to match.
“You can think of it like two funnels connected to each other so that the bottlenecks connect in the middle,” Eslami told us. “Because that bottleneck is tight, the two networks learn to work together to make sure that the scene’s contents are communicated compactly.”
The network can make educated guesses about regions it can’t see
During the training process, the neural network is provided with multiple images each from a bunch of different “rooms” with similar characteristics. In one experiment, the team generated a bunch of stylized square “rooms” containing multiple geometric shapes like spheres, cubes, and cones. Each room also has light sources and wall colors and textures chosen at random. Because the network is trained on data from multiple “rooms,” it must find ways to represent a room’s contents in a way that generalizes well.
Once the GQN is trained, it can then be provided with one or more images from a new “room” that it hasn’t seen before. Having trained on a bunch of other rooms with similar characteristics, the network has a good intuition about what rooms normally look like and so is able to make educated guesses about portions of the room that are not directly visible.
For example, a GQN can predict that a repeating pattern on the wall is likely to continue on the part of the wall obscured by other objects. It can predict how objects in the scene will cast shadows on the wall, floor, and other objects. And it does all this without the researchers hard-coding any explicit rules about either the physics of light or the characteristics of the scenes being analyzed.
“It can learn things that we don’t know how to learn by hand,” Eslami told us. “The fact that tables are usually situated next to chairs—that’s something we intuitively know, but hard to quantify and code up. The neural network can learn that, in the same way it learns that objects cast shadows.”
In other words, suppose that a GQN was trained with a bunch of images of home interiors and was then given images from a house it hadn’t seen before. If the available images showed only one half of the dining room table, the network could likely figure out what the other half of the table looked like—and that there are probably chairs next to the table. If there’s a room upstairs the size of a bedroom but whose interior isn’t in one of the images, the network might guess that it will have a bed and a dresser in it.
And this wouldn’t be because the network has a conceptual understanding of what tables and chairs or beds are. It would simply have observed that, statistically speaking, table-shaped objects tend to have chair-shaped objects next to them while bedroom-shaped rooms tend to have bed-shaped objects inside of them.
Generative query networks are highly versatile
The networks the DeepMind team built can draw remarkably rich inferences from a very limited amount of data. In another experiment, the researchers trained the network by showing it a bunch of randomly-generated shapes that look like three-dimensional Tetris pieces. During the training process, the network was shown a sequence of different randomly-generated pieces, with several images of each piece.
Once the network was trained, the researchers gave the network a single image of new Tetris shape it hadn’t seen before. From this single image, the network was often able to generate realistic three-dimensional images of the piece from any other angle:
Of course, this isn’t always possible. If the single example image is taken from an angle where some segments of a piece are hidden, then there’s no way the network can know what the obscured segments look like. In this case, the network will randomly generate one of the many shapes that’s consistent with the observed portion of the image. But if all segments are visible in the example image, the network is remarkably good at inferring the shape of the piece and rendering images of it from any angle.
And the GQN can handle surprisingly complex scenes. In another experiment, the researchers constructed three-dimensional mazes that look a bit like miniature Doom levels. Because these virtual environments had multiple rooms and passages, no single image could show more than a small portion of the overall environment. But if it’s given a half-dozen snapshots of a new maze, a GQN is able to assemble an accurate model of the entire maze—or at least of those portions of it that are shown in at least one image.