Online video is a huge untapped source of training data. OpenAI has now found a new way to use them.

OpenAI’s Minecraft bot is considered to be the best of its kind to date. And there’s a reason for that: During the training phase, the system was able to view 70,000 hours of video material of people playing the popular computer game. The system is an example of a powerful new technique that could be used to train machines for a wide range of tasks—simply by turning to video sites like YouTube, which represent a huge source of training data that hasn’t been exploited until now.
The Minecraft AI even learned from watching YouTube perform intricate sequences of keyboard and mouse clicks to accomplish in-game tasks like chopping down trees and crafting tools. It’s the first Minecraft bot that can even make so-called diamond tools – a task that usually takes a good human player 20 minutes to click at top speed or 24,000 actions to perform.
The Minecraft bot is the next breakthrough in a technique that has come to be known as imitation learning, in which neural networks are trained to perform tasks by watching humans complete them. With the help of imitation learning, artificial intelligence can already be trained to control robotic arms, drive cars or navigate through websites.
The internet is a rich resource
There is a huge amount of videos on the Internet showing people performing various tasks. By tapping into this resource, the researchers hope to achieve for imitation learning what GPT-3 achieved for large language models. “In recent years we have seen the rise of the GPT-3 paradigm, where large models trained on huge amounts of text found on the web have developed amazing capabilities,” says Bowen Baker of OpenAI, who is part of the team, which is behind the new Minecraft bot. “A big part of that is because we’re modeling what people do when they’re online.”
The problem with existing approaches to imitation learning is that video demonstrations have to be hand-labeled for each step: if you do this action, this happens, if you do that action, this happens – and so on. Such manual labeling by a human being is very labor-intensive so such data sets are usually small. Baker and his colleagues wanted to find a way to turn the millions of videos available online into a new model.
The team’s approach, called video pre-training (VPT), bypasses previous imitation learning bottlenecks by training another neural network to automatically label videos. Researchers first assigned crowd workers to play Minecraft and recorded their keyboard and mouse clicks along with the videos on their screens. This gave them 2,000 hours of “annotated” Minecraft play, which they used to train a model that maps actions to on-screen results. For example, clicking a mouse button in a certain situation will cause the character to swing its axe.
AI describes itself
The next step was to use this model to generate action labels for 70,000 hours of unlabeled video from around the web and then unleash the Minecraft bot on this larger dataset. “Video is a training resource with great potential,” said Peter Stone, executive director of Sony AI America, who has researched imitation learning.
Imitation learning is an alternative to reinforcement learning, in which a neural network learns through trial and error to solve a task from scratch. This technique is behind many of the biggest AI breakthroughs of recent years. This has been used to train models that can beat humans at games, control a fusion reactor, and find a faster way to do basic math.
The problem is that reinforcement learning works best on tasks that have a clear goal, but where random actions can lead to random success. Reinforcement learning algorithms reward these random successes to make them more likely to be repeated. However, Minecraft is a game without a clear goal. Players can do whatever they want: wander through a computer-generated world, mine different materials and combine them into different objects.
Minecraft world is just big enough
Minecraft’s openness makes it a good environment for training AI. Baker was one of the researchers behind Hide & Seek, a project that released bots into a virtual playground where they used reinforcement learning to figure out how to cooperate and use tools to win simple games. But the bots soon outgrew their surroundings. “The agents sort of took over the universe; there was nothing else for them to do,” says Baker. “We wanted to expand it, and we thought Minecraft would be a great area to work on.”
They are not alone in this. Minecraft is increasingly becoming an important testing ground for new AI techniques. MineDojo, a Minecraft environment with dozens of pre-built tasks, won an award at this year’s NeurIPS, one of the largest AI conferences. Using VPT, OpenAI’s bot itself is able to perform tasks that would have been impossible with reinforcement learning alone – such as making planks and turning them into a table, which takes about 970 consecutive actions. Still, the team found that the best results were obtained when imitation learning and reinforcement learning were used together. A VPT-trained robot fine-tuned using reinforcement learning was able to complete tasks with more than 20.
The researchers believe their approach could be used to train AIs for other tasks. For example, bots that navigate websites with a keyboard and mouse could use it to book flights or buy groceries online – an interesting idea. But it could theoretically be used to train robots to perform physical tasks in the real world by copying videos of people doing those tasks firsthand. “That’s plausible,” says Stone.
Real-world too complicated?
But Matthew Guzdial of the University of Alberta in Canada, who used video to teach AI the rules of games like Super Mario Bros., doesn’t think that’s going to happen anytime soon. Actions in games like Minecraft and Super Mario Bros. are performed by pressing buttons. Actions in the real world are far more complicated and harder for a machine to learn. “This brings with it a whole range of new research problems,” says Guzdial.
“This work is further evidence that it is possible to scale and train models on large datasets to perform well,” says Natasha Jaques, who works at Google and the University of California at Berkeley on “Multi-Agent Reinforcement Learning” works. Large internet-scale datasets will certainly open up new possibilities for AI, Jaques says: “We’ve seen this again and again, and it’s a great approach”. But OpenAI puts a lot of faith in the power of large datasets alone, she says: “Personally, I’m a bit more skeptical that data can solve any problem.”
Still, Baker and his colleagues believe that collecting more than a million hours of Minecraft videos will make their AI even better. It’s probably the best Minecraft-playing bot yet, says Baker: “But with more data and bigger models, I’d expect it to feel like you’re watching a human play, rather than a baby AI trying to imitate a human.”