OpenAI Introduces Video Pretraining (VPT), ​​a New Semi-Supervised Imitation Learning Technique


There are a ton of videos freely available on the internet that one can use to learn. However, these videos, such as videos of a digital artist drawing a beautiful sunset, do not show the precise order in which the mouse was moved and the keys were pressed. In other words, the lack of action tags creates a new problem because they don’t provide a record of how things were done.

The OpenAI team presents Video PreTraining, a new, yet simple, semi-supervised imitation learning technique for utilizing the abundance of unlabeled video data readily available on the Internet (VPT).

The researchers started by compiling a small dataset of entrepreneurs, recording both their video and their actions, in addition to their visual movements. They then use this information to train an Inverse Dynamics Model (IDM), which predicts the actions taken at each step in the video.

The researchers say this job is simpler and requires far less data. It is also possible to train the trained IDM to label a much larger data set of Internet videos using behavioral cloning.

Free 2 Minute AI NewsletterJoin over 500,000 AI people

The researchers selected Minecraft to validate their method. They chose it because it contains a lot of freely available video data, it is open, and it offers a wide variety of activities, like real-life applications like using a computer. Their findings suggest that their model is more broadly applicable than previous Minecraft work that uses simpler action spaces to facilitate exploration.

Their behavioral cloning model (the “VPT core model”) complements Minecraft tasks that are practically difficult to perform using reinforcement learning from scratch. It was trained on 70,000 hours of IDM-labeled web video. He learns to chop down trees to gather logs, turn those logs into planks, and then turn those planks into a crafting table; for a human Minecraft player, this process would take around 50 seconds or 1,000 consecutive game actions. The model also shows other challenging actions that players perform frequently, such as swimming, hunting, and devouring prey.

Base models are intended to be broadly competent on a wide range of tasks and to have a broad behavioral profile. It is common to refine these models to smaller, more focused datasets to include new knowledge or allow them to specialize on a narrower distribution of tasks. Researchers observed a significant improvement in the base model’s ability to reliably perform early game abilities after fine-tuning.

They suggest that training an IDM (as a step in the VPT pipeline) using labeled entrepreneur data is significantly more efficient than training a BC foundation model directly using the same small contractor data set.


Reinforcement learning (RL) is a powerful tool for producing high, even superhuman performance when a reward function can be specified. However, most RL approaches use random exploration priors to overcome difficult exploration obstacles in many tasks. For example, models are frequently rewarded for acting randomly with entropy bonuses. Since simulating human behavior should be much more beneficial than random actions, the VPT model should be far superior before RL.

The team gave their model the difficult task of locating a diamond pickaxe. This feature is unique to Minecraft and is more difficult when using the stock human interface. A long and complex series of smaller activities must be completed to craft a diamond pickaxe. They pay agents for each sequence component to make this task feasible.

The researchers found that an RL policy taught from random initialization (the traditional RL method) almost never receives a reward and never learns to collect sticks or logs. On the contrary, fine-tuning a VPT model not only learns how to craft diamond pickaxes (which it does in 2.5% of 10-minute Minecraft episodes), but it even has a level success rate. human to collect all the necessary items to get to the diamond pickaxe. This is the first time an IT agent has been shown to be able to create diamond tools in Minecraft, an activity that typically takes over 20 minutes (24,000 actions).

VPT paves the way for agents to learn how to act by watching countless movies online. VPT presents the intriguing prospect of directly learning large-scale behavioral priors in more than just language, unlike contrastive approaches or generative video modeling, which would only produce representational priors.

This Article is written as a summary article by Marktechpost Staff based on the paper 'Video PreTraining (VPT): Learning to Act by
Watching Unlabeled Online Videos'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper, github and blog post.

Please Don't Forget To Join Our ML Subreddit

About Norman Griggs

Check Also

Q&A with cybersecurity awareness company SoSafe

85% of cyberattacks start with the human factor, but 80% of employees do not feel …