Minecraft Reinforcement Learning

This is a loosely organized crosspost of a project I did in Doina Precup's excellent course on Reinforcement Learning at McGill University.

Research into the use of Minecraft as a platform for building custom reinforcement learning environments. Detailed blog post coming soon.Can Minecraft be a powerful alternative to classic control, Nvidia Isaac, and many other RL frameworks due to its open ended nature and possibilities for custom environment crafting?

successful RL contests IE basalt, etc.
mini challenges with MineRL and Minedojo

Challenges:

Spotty documentation
Complex implementations without creature comforts like Stable baselines support
Network errors
Docker requires running via vglrun, no easy way to preview it
Restrictions on numpy and gym versions block some cutting edge features
Docker env doesn’t have MPS / m1 support (big for students developing at home)
Example notebooks often broken on Colab due to now incompatible versions
Breaking changes in Gradle forge plugin seem to have broken everything

apply plugin: 'net.minecraftforge.gradle.forge'

THE natural liberty of man is to be free from any superior power on earth, and not to be under the will or legislative authority of man, but to have only the law of nature for his rule. The liberty of man, in society, is to be under no other legislative power, but that established, by consent, in the commonwealth;

As an alternative to Atari (DQN, PPO using CNNs)

It’s practical, check minerl-parkour repo (replicated study)

As an alternative to Cartpole, FrozenLake

Theoretically possible via custom action and observation space in Malmo, but difficult to setup due to lack of support for Malmo with current environments.

Needs custom action and observation spaces
Can be made practical for frozen lake
No easy way to implement CartPole due to continuous action space and lack of physics sim.

Abstract

This paper introduces Minecraft as a reinforcement learning environment, and aims to provide insights as to whether or not it is a useful platform for introductory RL education. Through doing a meta-analysis of challenges and frameworks using Minecraft as an environment, as well as a study on reducing the complexity of the game to allow for experimentation on lower end hardware, the practicality of using Minecraft is analyzed. Due to training time limitations, this paper does not have concrete data on algorithm performance, for this please see the also submitted comparison of deep RL algorithms on Atari games which re-uses much of the code from this exploration.

Intro

At 238,000,000 copies sold, Minecraft is by far the most popular video game in the world. It has been subject to cutting edge reinforcement learning research from top firms including OpenAI, Nvidia, Microsoft, DeepMind, and more. The top papers regarding Minecraft as an environment for reinforcement learning have focused on training models to play the game based on human priors. This paper aims to evaluate the state of Minecraft as an environment for simple, yet customizable RL experiments. For as long as Minecraft has been around, players have been constructing challenges for eachother like SkyBlock, parkour maps, mazes, and more. This paper provides a meta-analysis of existing works in Minecraft based reinforcement learning, as well as the results of various experiments in reduction of observation and action spaces, reward engineering, environment building, and developer experience.

Prior Works & Existing Solutions

2016’s Project Malmo

A project by Microsoft to enable reinforcement learning in Minecraft. Provides a gym-like environment for Minecraft, but has fallen far out of maintenance and lacks compatibility with modern Python versions and libraries. Requires complex additional setup including separately running a modified version of Minecraft and manually installing out of data libraries.

2018’s marLo

An extension of Malmo for multi-agent reinforcement learning made by CrowdAI. Provides a set of environments for multi agent RL. Based closely on Malmo requiring a similar setup procedure.

2018’s MalmoEnv

A Microsoft built wrapper of Malmo which allows for easier setup and installation, making it more similar to conventional Gym environments.

2019-current MineRL:

Provides a gym-like interface for Malmo, but maintained more recently with better library support, python version support, active challenges, and included datasets for imitation learning.

2021’s IGLU

A minerl derivative built for an RL/NLP challenge

2022’s MineDojo

Built on top of MineRL, Minedojo provides a massive library of pre-made challenges and datasets from Reddit, Youtube, and Minecraft Wikis to allow experimentation in open ended Minecraft models. An available Docker image makes setup and installation much easier than alternative frameworks.

This paper focuses on implementations using MineRL and Minedojo, as they are currently maintained and provide a robust set of environments and mostly current documentation.

Defining What to Build

The goal of this exploration is to follow in the spirit of Minecraft players in creating a challenge for the agent to solve. The challenge is meant to have the following properties

Can give results within a reasonable amount of iterations on consumer grade hardware
Is narrow enough in scope to allow useful results without imitation learning
Can be run without excessively complex setup

Hunt Cow Dense Environment

As a naïve first attempt, MineDojo’s hunt cow environment was used.

env = minedojo.make(task_id="hunt_cow", image_size=(160, 256))

This default environment spawns the player in a plains biome, with a cow nearby. The player gets a sparse reward for killing the cow.

The Model

Based on existing solutions to Minecraft challenges, there seemed to be a consensus that PPO (proximal policy optimization) with a CNN as a function approximatior is a robust solution. The first experiments were run using a custom implemented PPO algorithm, with later tests using Stable Baselines implementations to eliminate the variable of an incorrectly implemented algorithm. 4 frame frame stacking was used for temporal awareness and multi instance parallel learning was used to speed up training time when using Stable Baselines based models.

With the sparse reward environment, quite predictably the model failed to converge to any sort of useful result. Because the agent never got a reward early on, there was no signal to guide the policy so the agent was essentially acting with no awareness of its task.

To fix this, some reward engineering was necessary. Based on example code, a custom environment was written which calculates a dense reward based on some in game signals as well as some human information about how the game is meant to be played.

Reduction of action space, removing inventory and crafting related actions, keeping only movement, camera movement, and attack, allows for more effective training by not needing to train the model to not take unnecessary actions. Naive of the action space like this caused the model to start making some meaningful progress.

The first iteration of the dense reward environment provides rewards as follows:

A small reward calculated based on the distance between the player and the nearest entity of type Cow.
A reward of 5 every time the player damages a cow
A reward of 100 for killing a cow

After a day of training, this reward system converged to an interesting strategy — it would immediately look at the ground wherever it was and begin to dig a hole. The hole would cause entities nearby to fall in, maximizing the distance score, and while the player stares at the ground, it will hit any entity that is standing in the same spot, giving more reward if it is a cow.

This strategy was sometimes effective, but different from the choices a human may make when playing the game. This demonstrates Minecraft’s open ended nature, and how we can’t necessarily expect humanlike behaviour. To make the agent behave more as expected, an additional reward signal was added to penalize looking straight at the sky (which is all that some iterations of training would do), and to penalize looking down at the ground (which is only useful for the hole digging strategy).

Unfortunately, the model based on the augmented reward signal did not show sings of convergence before the timeline for submitting this project prevented further training. In an effort to boost training speed, the code was re-implemented to use stable-baselines3 based PPO with multiprocessing using a custom wrapper around Minedojo.

This approach did not have sufficient training time to converge.

To improve performance, several approaches were taken:

Reduction of observation space to a smaller image
More frame stacking (8)
The addition of privileged observations such as the distance to the nearest cow.

Future Improvements for Hunting Cow

Future work for this approach would be to further wrap the environment, for an observation space of the player coordinates (3 float), 9x9 nearby voxel info (for terrain awareness), current camera angle, and the coordinates of the nearest cow. Using an MLP, a model could be trained based on these privileged observations, and then its actions could be transferred to a CNN based model with the screen as input via imitation learning. Another approach which would allow this model to converge would be to bias it towards human-like actions by imitation learning on the MineRL dataset.

Working with a more abstract goal - Action Space Reduction: Clustering Common Human Actions via K-Means & Imitation Learning

Using vectorized environments from MineRL, which represent the states and actions as vectors, we can more intelligently reduce the action space, so that the agent takes actions from a set which “make sense”.

The way we define this set of actions is by taking the human generated data from the MineRL dataset. This dataset provides data collected from human players over several months of running a Minecraft server. A dataset is provided containing 2gb of vectorized states and actions which, when clustered, represent a set of actions which make sense in the context of the given task. In this case, the task under study was chopping trees. The player spawns in a forest with an axe and must collect as much wood as possible.

In this task, our naive PPO agent failed to do anything, never receiving its sparse reward. On the other hand, an agent trained in the vectorized k-mean’s’d action space took actions which make sense in the context of the challenge, and performed better. However, with limited training time it was also not able to converge, though it is more likely that it might given more iterations.

An agent trained using imitation learning via behavioural cloning on the vectorized observation space performed even better. Comparing it to an equivalent behavioural cloning agent trained on a non-vectorized environment, it is likely to be far faster to train. However, due to limited time in submitting this assignment, these tests did not have time to run to conclusion. What is known is that after less than an hour of training, the vectorized agent is seeing similar rewards to the vanilla behavioural cloning agent after several hours on the same hardware.

It is possible that due to the reduced action space of the vectorized agent, it will may underperform the traditionally trained agent given very long training time.

For this reason, the vectorized environment can be used to provide a quick expert baseline to new challenges which can then be imitated by a model with a full action and observation space access.

Meta-analysis of challenges using Minecraft

Most notable among Minecraft RL challenges are those hosted by MineRL, such as the MineRL Basalt challenge. These challenges rely on the MineRL dataset and environment and have users train models to perform tasks from navigating simple terrain to obtaining diamonds. The most recent such challenge was in 2022 with MineRL Basalt. As time has gone on, these challenges have gotten increasingly complex, relying on advanced model architectures to provide results.

Earlier MineRL challenges are the focus of this paper, as they more closely provide an avenue for exploration of deep RL algorithms in a way that may be more interesting than traditional environments.

Future work

This paper leaves much of its learning objectives incomplete due to training time limitations. The extremely slow nature of Minecraft as an environment (training at around 30fps with 8 concurrent envs) makes it a challenge unless techniques are carefully chosen prior to experimentation. Because of this, also submitted for this assignment is the paper on deep RL algorithms in Atari games with Meilin Lyu. That paper has more concrete comparison of algorithms, as well as a deeper analysis of techniques not seen in class.

One particularly interesting aspect of Minecraft as an environment is its customizability. Since it has such a range of observations, actions, reward functions, and scenarios, it can be used to run ablation studies on which aspects of a game can correlate to performance of various RL approaches. An exploration of the same challenge, with dense vs sparse reward, rgb vs voxel vs vectorized observations, reduced vs full vs clustered action spaces, and more can yield useful information about the specific aspects of a game which make certain techniques more applicable.