The Embodied Communication Game:
A Task for Reinforcement-Learning Agents
Evan McCormick
December 15, 2024
1
Abstract
An important question in linguistics, psychology, and now computer science is how
communication systems are created and modified by the agents who use them. Evolutionary
linguistics tries to answer this question in the context of human language, while emergent
language tries to answer it for humans and artificial agents. One well-known experiment
in emergent language is the Embodied Communication Game (ECG). To succeed in the
game, human players must generate and use a novel system of communication, but they are
not explicitly told this, and are given no explicit channels for communication. Instead, they
must communicate via actions in the game which could have other uses. Most human players
were able to establish a system of communication in the ECG. So far, no machine learning
agent has been able to match human performance in the ECG. For this study, we pitted an
array of modern reinforcement learning agents against a virtual version of the ECG. None
of the agents found a communicative solution to the game. Instead, the most successful
agents settled on the optimal non-communicative strategy. This study highlights humans’
unique propensity to create, send, recognize, and receive signals without explicit instruction.
Reinforcement learning agents struggle to use communication to their advantage, unless they
are explicitly rewarded for doing so.
Related Works
Evolutionary linguistics has itself evolved as a field over the last 50 years, with new
approaches supplanting old ones. Early research was historical analysis: charting the evolu-
tion of human languages throughout history. Later research involved modern case studies,
where researchers observed the creation and evolution of novel languages in isolated mod-
2
ern communities (Carol, Kegl, DawnMacLaughlin, et al. [1]). Lately the field has seen a
boom in experimental research, where subjects participate in tasks and games designed to
elicit the creation of novel communication systems. Over time, these controlled experiments
(emergent-language games) have become a popular approach in evolutionary linguistics.
Generally, the rules of an emergenet-language game forbid communication via talking, ges-
turing, etc., but participants are allowed to communicate via their actions in the game. The
games are designed such that the only way to achieve the maximum score is to communi-
cate via the limited communication channels they are given. This usually requires them to
signal their intentions or their game state to their partners via novel symbols or signs. One
such emergent-language game conducted by Scott-Phillips, Kirby, and Ritchie [2] is called
the Embodied Communication Game (ECG), and tests the ability of participants not only
to create an artificial language to communicate their game’s state to their partners, but to
realize that a channel for communication even exists. In the ECG, players can only commu-
nicate via movements which could are also necessary to navigate the environment. Thus, any
movement in the game could be interpreted as communicative, or simply a move to reach
the goal. In addition, the game’s instructions gave no indication that communication was
possible or necessary to achieve success.
The Embodied Communication Game
The purpose of the original ECG was to study how humans could ”signal signalhood”:
how they could show that they are trying to communicate. The game also studied how
humans might recognize an attempt at communication.The ECG was run with 2 human
participants, each playing remotely from a computer. On each player’s computer screens
3
were 2 2x2 grids, each with a stick figure occupying one of the squares. For the first player,
the grid on the left was colored, so that each square was a different color, while the grid on
the right was grey. For the second player, the grid on the right had colors, while the grid
on the left was grey. Each player could control the movements of the stick figure in their
colored grid, and could see the movements of the other player’s figure on the other grid. The
game was run in short trials, where the goal was to be on the same colored square as the
other player at the end of the trial. In order to succeed at this task above chance rates,
players had to agree upon a color to choose. However, the game was designed such that the
two colored grids would only ever have one color in common, and that color would change
randomly between trials. Thus, the players could not hope to succeed in the game by simply
going to the same color (however, this method, known as the ‘default color’ strategy, was
the first step to success, and is the optimal solution sans communication) . The only way to
achieve a 100% success rate was for players to somehow communicate their grid’s colors to
their partner using their actions in the game, and then agree on a color to both go to.
The Evolution of Human Signaling Systems in the ECG
The original ECG was run on twelve pairs of human participants, who each played
the game for roughly 200 rounds with the same partner. Remarkably, seven out of the
twelve pairs achieved performance at levels only possible via communication. Within these
pairs, both participants were able to describe in detail the communication system they had
invented, and the method by which it had been invented. There were two main ways that
communication systems emerged: either the players gradually built up a symbolic system
from simple to more complex grammar (an organic language), or one player quickly realized
4
the possibility of communication, and formed a complete symbolic system, which the other
player eventually caught onto (a forced language). Of the seven pairs which successfully
communicated, five developed an organic language, while two developed a forced language.
The organic languages each developed in roughly the same way. They went through similar
phases of development, with new symbols being added and new meaning being added to
symbols in roughly the same order by each successful pair. The first step in organic language
development was for the pair to come upon the ‘default color strategy’, wherein they both
choose the same color (e.g. Blue) whenever it is available. Once one default color was
established, the players then agreed on a secondary color to go to in the event that neither
player had the default color. At some point (usually shortly after the second default color was
established), one player was faced with a grid which did not contain the default color, and,
realizing their partner would go for the default color if hey had it, started frantically moving
around the grid (human players sometimes described this as an attempt to communicate to
their partner I don’t have it! ”, while others simply stated they did this out of frustration).
The other player then interpreted this frantic movement as a signal: I don’t have the default
color, go for the secondary color! ”. Once this first signal was established, it became shorter
and more distinct, and came to refer to the secondary color. Then, when the players were
faced with a grid containing neither the primary or secondary default color, they created a
novel signal to indicate a 3rd color. Over time, these signals decreased in length and each
became associated with one of the square colors. The forced languages involved one player
realizing early on the effectiveness of communicating the color they intended to go to, and
committing to a particular symbolic language signifying their color choice. These languages
5
actually performed worse than the organic languages, as it sometimes took a very long time
for the communicator’s partner to recognize the significance of their movements.
Emergent Language in Reinforcement Learning Agents
The study of emergent language in reinforcement learning (RL) agents is a growing
field of research in linguistics and computer science. Researchers have successfully taught
RL models to communicate via artificial languages in an explicit communication task. For
example, Havrylov and Titov [3] trained a pair of Long-Short Term Memory (LSTM) models
to communicate the contents of images via a tokenized grammatical language. In their
experiment, one model was tasked with communicating to the other which of a set of images
to choose. This study differs significantly from the ECG, however, in that a clear sender
and receiver of communication was established from the start, and a clear channel and form
for communication was given. In other words, this study demonstrated the capacity for
RL agents to learn to communicate explicitly via a pre-ordained communication channel.
While many studies have shown the effectiveness of emergent language generation in RL
models, (Bullard, Meier, Keila, et al. [4],Lazaridou, Peysakhovich, and Baroni [5],Lazaridou,
Hermann, Tuyls, et al. [6]), these studies either relied on an explicit communication channel,
or explicitly rewarded their agents for successfully communicating. The ECG tests the ability
of its players to recognize that a channel for communication exists in the first place, and to
then use that channel effectively. This is a difficult yet possible task for human players, but in
the literature reviewed, no study exists demonstrating the ability of RL agents to complete
this task. Bie [7] used computer agents to replicate the forced communication strategies
of some human pairs in the ECG, but with limited success. Hughes, Gupta, Tolstaya,
6
et al. [8] implemented the ECG as a multi-agent reinforcement learning environment in
which they trained SOTA reinforcement learning agents, but did not produce communicative
strategies. Instead, the reinforcement learning agents settled on the optimal non-’default
color’ communicative strategy.
Experimental Methods: Testing Versions of the ECG with Reinforcement
Learning Agents
The objective of this study was to design and build versions of the ECG within which
the latest reinforcement learning agents could be tested. To accomplish this, we used Gym-
nasium (Gym), a python library currently maintained by Farama, which allows individuals
to easily create and run reinforcement learning environments (Towers, Kwiatkowski, Terry,
et al. [9]). The reinforcement learning models were taken from Stable-Baselines3 (SB3 ),
a package containing Python implementations of the latest reinforcement learning models,
designed for compatibility with Gym environments(Raffin, Hill, Gleave, et al. [10]). Finally,
since Gym and SB3 did not directly support multi-agent environments, we used Farama’s
PettingZoo API to convert our Gym environments into multi-agent environments and the
third-party SuperSuit package (SS ) to allow our SB3 RL models to train in multi-agent
settings (Terry, Black, and Hari [11]).
The Simple Color Game
Firstly, we designed a simplified version of the Embodied Communication Game called
the Simple Color Game (SCG) to become familiar with building and running custom envi-
ronments and training the RL models on them. The SCG is a single agent environment in
which the agent is placed within a 2x2 grid with 4 ‘colored’ squares (colors were represented
7
by a 2x2 array of integers) and is tasked with finding a ’target’ color (integer). The agent’s
observation space included three parts:
1. An int[ ] of size (2,2) representing the square colors on the 2x2 grid.
2. An int [] of size (2,) representing the agent’s coordinates within the 2x2 grid.
3. An int representing the target color.
The action space was a discrete space of size 4, with each number representing a
movement in a cardinal direction (up, left, down, right). The agent was given a reward
of 1 when the ‘color’ of the square it occupied within the 2x2 grid matched the target
color. We tested three SB3 models in the SCG: Deep Q-Learning (DQN) - A Q-learning RL
model in which the use of a Q-table is replaced by a function approximator (i.e. a neural
network) used to estimate the Q-values of any given state-action pair (S,A). Proximal Policy
Optimization (PPO) - A modified version of DQN in which the rate at which the function
approximator updates its weights is clipped by a ‘maximum difference’ metric, representing
how significantly its choice of actions in a given state changes. PPO is designed to prevent
the model from making a precipitous change in behavior and ‘falling off a cliff (losing its
prior performance). Advantage Actor Critic (A2C) - The A2C model uses two function
approximators, one to decide which action to take in a given state, and another to estimate
what the effective Q-value of an action taken will be. As the model trains, both networks are
updated. Both PPO and A2C were able to score well on the 3x3 SCG after around 1,000,000
training timesteps, while DQN performed inconsistently and learned the game more slowly
(Figure 1). We also tested the models on the Timed Color Game (TCG), a version of the
SCG in which the reward for reaching the target square was time-discounted. Again, PPO
8
A2C significantly out-performed DQN, quickly achieving optimal performance (Figure 2).
From this point on, we stopped testing with DQN models.
Figure 1. Learning Curve of PPO, A2C, and DQN models in the Simple Color Game
The Stop on Color Game
After successfully training the A2C and PPO models in the SCG, we designed and
tested a version of the game that required agents to take a ’Commit’ action when they
reached the square of the target color to receive a reward. Thus, in this version of the game,
the action spaces of each agent included 5 actions: Up, Left, Right, Down, and Commit.
We trained the PPO and A2C models in this Stop On Color Game (SOCG), resulting in
learning curves as shown in figure 3. Interestingly, PPO quickly improved its mean reward
in the stop-on-color game, while A2C struggled to improve.
9
Figure 2. Learning Curve of PPO, A2C, and DQN models in the Timed Color Game
The Embodied Communication Game
We then used the PettingZoo library to design the ECG as originally conceived by
Scott-Phillips et al[2]. The game utilized the ParallelEnv class to function as a simultaneous
action game. The observation spaces of each agent were as follows:
1. An int [] of size (2,2) representing the square colors of their 2x2 grid.
2. An int [] of size (2,) representing the agent’s coordinates within their 2x2 grid.
3. An int [] of size (2,) representing the coordinates of the other agent within their
respective 2x2 grid.
Each grid contained 4 distinct ‘colors’ (integers chosen from within the range [0, 6]),
so that by the pigeonhole principle, at least one color would be shared between the two grids.
However, it was possible for the grids to share more than one color. We used the SuperSuit
10
Figure 3. Learning Curve of PPO and A2C models in the Stop On Color Game
package to convert the ParallelEnvs into SB3 VecEnvs (Vectorized Environments), such that
each Env within the VecEnv corresponded to a single agent within the ParallelEnv. This
trick allowed our SB3 models, which are normally only capable of training in single-agent
environments, to train in the ParallelEnvs. This approach has one major limitation: the
models will not find asymmetric strategies, as each agent must follow the same policy.
Optimal Non-Communicative Performance in the ECG
The optimal non-communicative strategy in the ECG is a hierarchical default color
strategy. In this strategy, the agents come to a consensus on a primary color to commit
to whenever it is available to them. Thus, whenever both agents’ grids contain this color,
the agents will succeed at matching on it. This color (e.g. ‘yellow’) becomes the primary
11
default color. The agents’ success rate can be further increased by choosing a 2nd, 3rd,
and 4th color, only if no color of a higher rank is available to them. The expected reward
of this strategy is the probability of matching on the primary default color (which occurs
whenever both grids contain it), m(C
1
) = P (C
1
)
2
= (4/7)
2
0.32653, plus the probability
of matching on the 2nd color, m(C
2
) = P (¬C
1
)
2
P (C
2
)
2
= (3/7)
2
(4/6)
2
.08163, 3rd
color, m(C
3
) = P (¬C
1
)
2
P (¬C
2
)
2
P (C
3
)
2
= (3/7)
2
(2/6)
2
(4/5)
2
.01306, and 4th
color,m(C
4
) = P (¬C
1
)
2
P (¬C
2
)
2
P (¬C
3
)
2
P (C
4
)
2
= (3/7)
2
(2/6)
2
(1/5)
2
1
2
.00082.
No fifth color is necessary for this strategy, as by the pigeonhole principle both grids must
share at least 1 of the 4 primary colors. Thus, the expected mean reward of the optimal
non-communicative strategy is
¯
R = m(C
1
) + m(C
2
) + m(C
3
) + m(C
4
) =
517
1225
0.42204. In
our implementation of the ECG, the PPO model successfully converged on a default-color
strategy in the ECG, but did not find a communicative strategy (Figure 4).
Adding Memory to the Models
One key feature of any machine-learning algorithm, which is likely necessary, but not
sufficient, to achieve a communicative solution to the ECG, is memory. PPO and A2C models
are, at their core, function approximators of the form f(S
i
, A
l
) A
i
, where S
i
,A
l
,A
i
A
l
are the current state, the action space, and an action within that action space, respectively).
These models have no way to account for the prior actions of other agents, unless those
actions are somehow encoded in the current state of the environment. Unfortunately, this
version of the ECG has no such way of encoding agents’ previous actions in the current state.
A PPO or A2C model observing that its partner is at the coordinates [0,1] in the current
state has no knowledge of whether the agent got there from [0,0], [1,1], or was in [0,1] the
12
turn prior. For this reason, we attempted to train a Long-Short-Term Memory (LSTM) PPO
model in the ECG in addition to the standard PPO model, to see if the improved memory
could allow it to learn the communicative solution to the ECG. Unfortunately, the model did
not converge on even the non-communicative optimum which PPO was able to find (Figure
4).
Figure 4. Learning Curve of PPO and Recurrent PPO models in the Embodied Communication
Game
Adding Memory to the Game
Finally, we designed a simplified version of the ECG in which the prior location of
each agent was encoded in the current state of the environment. This game was also much
simpler than the ECG. Rather than a 2x2 grid, each agent occupied a 2x1 grid, in which
each cell was one of three possible colors. The grids were designed to have exactly one color
in common. The observation space of each agent contained the following:
13
1. An int in the range [0,1] representing the agent’s current location
2. An int[ ] of size (2,) representing the colors of the agent’s grid
3. An int[ ] of size(10,) representing the location of the agent’s partner during the
ten previous turns of the game.
Figure 5. Learning Curve of PPO and A2C models in the Simplified Embodied Communication
Game
The PPO and A2C models trained in this environment for 1,000,000 training timesteps.
Unfortunately, no communication occurred, and the models did not converge on the non-
communicative optimal strategy (Figure 5). The fact that a PPO model was able to succeed
14
at finding the optimal strategy in the ECG, but was unable to do so in the SECG, highlights
the unpredictable and volatile performance of reinforcement learning algorithms
1
.
Conclusion
In conclusion, the ECG remains a vexing task for reinforcement learning agents, and a
stark counterexample to the trend of machine-learning algorithms replicating more and more
human behavior. It sheds light on a unique feature of human learning: our natural propensity
to create, send, recognize and receive linguistic signals. Emergent communication is a very
difficult strategy for a policy-optimizer to find, as it necessitates a significant change in
policy with no immediate associated increase in reward. Rather, communication is entirely
dependent on another agent recognizing, understanding, and acting upon the signal, to
increase both agents’ respective rewards. It is unclear how current RL algorithms could be
incentivized to find emergent communicative solutions. If these algorithms could be designed
to find emergent communicative solutions, their performance would improve drastically in
many real-world scenarios requiring multi-agent cooperation and coordination. Emergent
communication is an incredibly powerful and unique human tool, and one of the few which
artificial agents still fail to replicate.
1. during an earlier iteration of training, a PPO was able to find the optimal strategy in the SECG
after about 1,000,000 training timesteps. Unfortunately, I lost the data from that session in my attempt to
fix an issue with my graphing functions
15
References
[1] N. Carol, J. Kegl, DawnMacLaughlin, B. Bahan, and R. Lee, The syntax of American
Sign Language : functional categories and hierarchical structure. MIT Press, 1999.
[Online]. Available: https://digitalcommons.usm.maine.edu/facbooks/476.
[2] T. Scott-Phillips, S. Kirby, and G. Ritchie, “Signalling signalhood and the emergence
of communication,” Cognition, no. 113, pp. 226–233, 2009.
[3] S. Havrylov and I. Titov, “Emergence of language with multi-agent games: Learning to
communicate with sequences of symbols,” Advances in Neural Information Processing
Systems 30, pp. 2149–2159, 2017.
[4] K. Bullard, F. Meier, D. Keila, J. Pineau, and J. Foerster, “Exploring zero-shot emer-
gent communication in embodied multi-agent populations,” arXiv:2010.15896, 2020.
[5] A. Lazaridou, A. Peysakhovich, and M. Baroni, “Multi-agent cooperation and the
emergence of (natural) language,” arXiv preprint arXiv:1612.07182, 2016.
[6] A. Lazaridou, K. M. Hermann, K. Tuyls, and S. Clark, “Emergence of linguistic com-
munication from referential games with symbolic pixel input,” International Confer-
ence on Learning Representations, 2018.
[7] P. de Bie, Computational agents in the embodied communication game, 2009.
[8] E. Hughes, A. Gupta, E. Tolstaya, and T. Schott-Phillips, Signalling signalhood in
machine learning agents, Abstract for workshop, Machine learning and the Evolution
of Language, 2022. [Online]. Available: https : //www.guabhinav. com / docs / 10 _
signalling_signalhood_in_machi.pdf.
16
[9] M. Towers, A. Kwiatkowski, J. Terry, et al., “Gymnasium: A standard interface for
reinforcement learning environments,” arXiv preprint arXiv:2407.17032, 2024.
[10] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-
baselines3: Reliable reinforcement learning implementations,” Journal of Machine
Learning Research, vol. 22, no. 268, pp. 1–8, 2021. [Online]. Available: http://jmlr.
org/papers/v22/20-1364.html.
[11] J. K. Terry, B. Black, and A. Hari, “Supersuit: Simple microwrappers for reinforcement
learning environments,” arXiv preprint arXiv:2008.08932, 2020.