The Embodied Communication Game:

A Task for Reinforcement-Learning Agents

Evan McCormick

December 15, 2024

Abstract

An important question in linguistics, psychology, and now computer science is how

communication systems are created and modiﬁed by the agents who use them. Evolutionary

linguistics tries to answer this question in the context of human language, while emergent

language tries to answer it for humans and artiﬁcial agents. One well-known experiment

in emergent language is the Embodied Communication Game (ECG). To succeed in the

game, human players must generate and use a novel system of communication, but they are

not explicitly told this, and are given no explicit channels for communication. Instead, they

must communicate via actions in the game which could have other uses. Most human players

were able to establish a system of communication in the ECG. So far, no machine learning

agent has been able to match human performance in the ECG. For this study, we pitted an

array of modern reinforcement learning agents against a virtual version of the ECG. None

of the agents found a communicative solution to the game. Instead, the most successful

agents settled on the optimal non-communicative strategy. This study highlights humans’

unique propensity to create, send, recognize, and receive signals without explicit instruction.

Reinforcement learning agents struggle to use communication to their advantage, unless they

are explicitly rewarded for doing so.

Related Works

Evolutionary linguistics has itself evolved as a ﬁeld over the last 50 years, with new

approaches supplanting old ones. Early research was historical analysis: charting the evolu-

tion of human languages throughout history. Later research involved modern case studies,

where researchers observed the creation and evolution of novel languages in isolated mod-

ern communities (Carol, Kegl, DawnMacLaughlin, et al. [1]). Lately the ﬁeld has seen a

boom in experimental research, where subjects participate in tasks and games designed to

elicit the creation of novel communication systems. Over time, these controlled experiments

(emergent-language games) have become a popular approach in evolutionary linguistics.

Generally, the rules of an emergenet-language game forbid communication via talking, ges-

turing, etc., but participants are allowed to communicate via their actions in the game. The

games are designed such that the only way to achieve the maximum score is to communi-

cate via the limited communication channels they are given. This usually requires them to

signal their intentions or their game state to their partners via novel symbols or signs. One

such emergent-language game conducted by Scott-Phillips, Kirby, and Ritchie [2] is called

the Embodied Communication Game (ECG), and tests the ability of participants not only

to create an artiﬁcial language to communicate their game’s state to their partners, but to

realize that a channel for communication even exists. In the ECG, players can only commu-

nicate via movements which could are also necessary to navigate the environment. Thus, any

movement in the game could be interpreted as communicative, or simply a move to reach

the goal. In addition, the game’s instructions gave no indication that communication was

possible or necessary to achieve success.

The Embodied Communication Game

The purpose of the original ECG was to study how humans could ”signal signalhood”:

how they could show that they are trying to communicate. The game also studied how

humans might recognize an attempt at communication.The ECG was run with 2 human

participants, each playing remotely from a computer. On each player’s computer screens

were 2 2x2 grids, each with a stick ﬁgure occupying one of the squares. For the ﬁrst player,

the grid on the left was colored, so that each square was a diﬀerent color, while the grid on

the right was grey. For the second player, the grid on the right had colors, while the grid

on the left was grey. Each player could control the movements of the stick ﬁgure in their

colored grid, and could see the movements of the other player’s ﬁgure on the other grid. The

game was run in short trials, where the goal was to be on the same colored square as the

other player at the end of the trial. In order to succeed at this task above chance rates,

players had to agree upon a color to choose. However, the game was designed such that the

two colored grids would only ever have one color in common, and that color would change

randomly between trials. Thus, the players could not hope to succeed in the game by simply

going to the same color (however, this method, known as the ‘default color’ strategy, was

the ﬁrst step to success, and is the optimal solution sans communication) . The only way to

achieve a 100% success rate was for players to somehow communicate their grid’s colors to

their partner using their actions in the game, and then agree on a color to both go to.

The Evolution of Human Signaling Systems in the ECG

The original ECG was run on twelve pairs of human participants, who each played

the game for roughly 200 rounds with the same partner. Remarkably, seven out of the

twelve pairs achieved performance at levels only possible via communication. Within these

pairs, both participants were able to describe in detail the communication system they had

invented, and the method by which it had been invented. There were two main ways that

communication systems emerged: either the players gradually built up a symbolic system

from simple to more complex grammar (an organic language), or one player quickly realized

the possibility of communication, and formed a complete symbolic system, which the other

player eventually caught onto (a forced language). Of the seven pairs which successfully

communicated, ﬁve developed an organic language, while two developed a forced language.

The organic languages each developed in roughly the same way. They went through similar

phases of development, with new symbols being added and new meaning being added to

symbols in roughly the same order by each successful pair. The ﬁrst step in organic language

development was for the pair to come upon the ‘default color strategy’, wherein they both

choose the same color (e.g. Blue) whenever it is available. Once one default color was

established, the players then agreed on a secondary color to go to in the event that neither

player had the default color. At some point (usually shortly after the second default color was

established), one player was faced with a grid which did not contain the default color, and,

realizing their partner would go for the default color if hey had it, started frantically moving

around the grid (human players sometimes described this as an attempt to communicate to

their partner “I don’t have it! ”, while others simply stated they did this out of frustration).

The other player then interpreted this frantic movement as a signal: “I don’t have the default

color, go for the secondary color! ”. Once this ﬁrst signal was established, it became shorter

and more distinct, and came to refer to the secondary color. Then, when the players were

faced with a grid containing neither the primary or secondary default color, they created a

novel signal to indicate a 3rd color. Over time, these signals decreased in length and each

became associated with one of the square colors. The forced languages involved one player

realizing early on the eﬀectiveness of communicating the color they intended to go to, and

committing to a particular symbolic language signifying their color choice. These languages

actually performed worse than the organic languages, as it sometimes took a very long time

for the communicator’s partner to recognize the signiﬁcance of their movements.

Emergent Language in Reinforcement Learning Agents

The study of emergent language in reinforcement learning (RL) agents is a growing

ﬁeld of research in linguistics and computer science. Researchers have successfully taught

RL models to communicate via artiﬁcial languages in an explicit communication task. For

example, Havrylov and Titov [3] trained a pair of Long-Short Term Memory (LSTM) models

to communicate the contents of images via a tokenized grammatical language. In their

experiment, one model was tasked with communicating to the other which of a set of images

to choose. This study diﬀers signiﬁcantly from the ECG, however, in that a clear sender

and receiver of communication was established from the start, and a clear channel and form

for communication was given. In other words, this study demonstrated the capacity for

RL agents to learn to communicate explicitly via a pre-ordained communication channel.

While many studies have shown the eﬀectiveness of emergent language generation in RL

models, (Bullard, Meier, Keila, et al. [4],Lazaridou, Peysakhovich, and Baroni [5],Lazaridou,

Hermann, Tuyls, et al. [6]), these studies either relied on an explicit communication channel,

or explicitly rewarded their agents for successfully communicating. The ECG tests the ability

of its players to recognize that a channel for communication exists in the ﬁrst place, and to

then use that channel eﬀectively. This is a diﬃcult yet possible task for human players, but in

the literature reviewed, no study exists demonstrating the ability of RL agents to complete

this task. Bie [7] used computer agents to replicate the forced communication strategies

of some human pairs in the ECG, but with limited success. Hughes, Gupta, Tolstaya,

et al. [8] implemented the ECG as a multi-agent reinforcement learning environment in

which they trained SOTA reinforcement learning agents, but did not produce communicative

strategies. Instead, the reinforcement learning agents settled on the optimal non-’default

color’ communicative strategy.

Experimental Methods: Testing Versions of the ECG with Reinforcement

Learning Agents

The objective of this study was to design and build versions of the ECG within which

the latest reinforcement learning agents could be tested. To accomplish this, we used Gym-

nasium (Gym), a python library currently maintained by Farama, which allows individuals

to easily create and run reinforcement learning environments (Towers, Kwiatkowski, Terry,

et al. [9]). The reinforcement learning models were taken from Stable-Baselines3 (SB3 ),

a package containing Python implementations of the latest reinforcement learning models,

designed for compatibility with Gym environments(Raﬃn, Hill, Gleave, et al. [10]). Finally,

since Gym and SB3 did not directly support multi-agent environments, we used Farama’s

PettingZoo API to convert our Gym environments into multi-agent environments and the

third-party SuperSuit package (SS ) to allow our SB3 RL models to train in multi-agent

settings (Terry, Black, and Hari [11]).

The Simple Color Game

Firstly, we designed a simpliﬁed version of the Embodied Communication Game called

the Simple Color Game (SCG) to become familiar with building and running custom envi-

ronments and training the RL models on them. The SCG is a single agent environment in

which the agent is placed within a 2x2 grid with 4 ‘colored’ squares (colors were represented

by a 2x2 array of integers) and is tasked with ﬁnding a ’target’ color (integer). The agent’s

observation space included three parts:

1. An int[ ] of size (2,2) representing the square colors on the 2x2 grid.

2. An int [] of size (2,) representing the agent’s coordinates within the 2x2 grid.

3. An int representing the target color.

The action space was a discrete space of size 4, with each number representing a

movement in a cardinal direction (up, left, down, right). The agent was given a reward

of 1 when the ‘color’ of the square it occupied within the 2x2 grid matched the target

color. We tested three SB3 models in the SCG: Deep Q-Learning (DQN) - A Q-learning RL

model in which the use of a Q-table is replaced by a function approximator (i.e. a neural

network) used to estimate the Q-values of any given state-action pair (S,A). Proximal Policy

Optimization (PPO) - A modiﬁed version of DQN in which the rate at which the function

approximator updates its weights is clipped by a ‘maximum diﬀerence’ metric, representing

how signiﬁcantly its choice of actions in a given state changes. PPO is designed to prevent

the model from making a precipitous change in behavior and ‘falling oﬀ a cliﬀ’ (losing its

prior performance). Advantage Actor Critic (A2C) - The A2C model uses two function

approximators, one to decide which action to take in a given state, and another to estimate

what the eﬀective Q-value of an action taken will be. As the model trains, both networks are

updated. Both PPO and A2C were able to score well on the 3x3 SCG after around 1,000,000

training timesteps, while DQN performed inconsistently and learned the game more slowly

(Figure 1). We also tested the models on the Timed Color Game (TCG), a version of the

SCG in which the reward for reaching the target square was time-discounted. Again, PPO

A2C signiﬁcantly out-performed DQN, quickly achieving optimal performance (Figure 2).

From this point on, we stopped testing with DQN models.

Figure 1. Learning Curve of PPO, A2C, and DQN models in the Simple Color Game

The Stop on Color Game

After successfully training the A2C and PPO models in the SCG, we designed and

tested a version of the game that required agents to take a ’Commit’ action when they

reached the square of the target color to receive a reward. Thus, in this version of the game,

the action spaces of each agent included 5 actions: Up, Left, Right, Down, and Commit.

We trained the PPO and A2C models in this Stop On Color Game (SOCG), resulting in

learning curves as shown in ﬁgure 3. Interestingly, PPO quickly improved its mean reward

in the stop-on-color game, while A2C struggled to improve.

Figure 2. Learning Curve of PPO, A2C, and DQN models in the Timed Color Game

The Embodied Communication Game

We then used the PettingZoo library to design the ECG as originally conceived by

Scott-Phillips et al[2]. The game utilized the ParallelEnv class to function as a simultaneous

action game. The observation spaces of each agent were as follows:

1. An int [] of size (2,2) representing the square colors of their 2x2 grid.

2. An int [] of size (2,) representing the agent’s coordinates within their 2x2 grid.

3. An int [] of size (2,) representing the coordinates of the other agent within their

respective 2x2 grid.

Each grid contained 4 distinct ‘colors’ (integers chosen from within the range [0, 6]),

so that by the pigeonhole principle, at least one color would be shared between the two grids.

However, it was possible for the grids to share more than one color. We used the SuperSuit

Figure 3. Learning Curve of PPO and A2C models in the Stop On Color Game

package to convert the ParallelEnvs into SB3 VecEnvs (Vectorized Environments), such that

each Env within the VecEnv corresponded to a single agent within the ParallelEnv. This

trick allowed our SB3 models, which are normally only capable of training in single-agent

environments, to train in the ParallelEnvs. This approach has one major limitation: the

models will not ﬁnd asymmetric strategies, as each agent must follow the same policy.

Optimal Non-Communicative Performance in the ECG

The optimal non-communicative strategy in the ECG is a hierarchical default color

strategy. In this strategy, the agents come to a consensus on a primary color to commit

to whenever it is available to them. Thus, whenever both agents’ grids contain this color,

the agents will succeed at matching on it. This color (e.g. ‘yellow’) becomes the primary

default color. The agents’ success rate can be further increased by choosing a 2nd, 3rd,

and 4th color, only if no color of a higher rank is available to them. The expected reward

of this strategy is the probability of matching on the primary default color (which occurs

whenever both grids contain it), m(C

) = P (C

)

= (4/7)

≈ 0.32653, plus the probability

of matching on the 2nd color, m(C

) = P (¬C

)

∗ P (C

)

= (3/7)

∗ (4/6)

≈ .08163, 3rd

color, m(C

) = P (¬C

)

∗ P (¬C

)

∗ P (C

)

= (3/7)

∗ (2/6)

∗ (4/5)

≈ .01306, and 4th

color,m(C

) = P (¬C

)

∗ P (¬C

)

∗ P (¬C

)

∗ P (C

)

= (3/7)

∗ (2/6)

∗ (1/5)

∗ 1

≈ .00082.

No ﬁfth color is necessary for this strategy, as by the pigeonhole principle both grids must

share at least 1 of the 4 primary colors. Thus, the expected mean reward of the optimal

non-communicative strategy is

R = m(C

) + m(C

) =

517

1225

≈ 0.42204. In

our implementation of the ECG, the PPO model successfully converged on a default-color

strategy in the ECG, but did not ﬁnd a communicative strategy (Figure 4).

Adding Memory to the Models

One key feature of any machine-learning algorithm, which is likely necessary, but not

suﬃcient, to achieve a communicative solution to the ECG, is memory. PPO and A2C models

are, at their core, function approximators of the form f(S

, A

) → A

, where S

∈ A

are the current state, the action space, and an action within that action space, respectively).

These models have no way to account for the prior actions of other agents, unless those

actions are somehow encoded in the current state of the environment. Unfortunately, this

version of the ECG has no such way of encoding agents’ previous actions in the current state.

A PPO or A2C model observing that its partner is at the coordinates [0,1] in the current

state has no knowledge of whether the agent got there from [0,0], [1,1], or was in [0,1] the

turn prior. For this reason, we attempted to train a Long-Short-Term Memory (LSTM) PPO

model in the ECG in addition to the standard PPO model, to see if the improved memory

could allow it to learn the communicative solution to the ECG. Unfortunately, the model did

not converge on even the non-communicative optimum which PPO was able to ﬁnd (Figure

4).

Figure 4. Learning Curve of PPO and Recurrent PPO models in the Embodied Communication

Game

Adding Memory to the Game

Finally, we designed a simpliﬁed version of the ECG in which the prior location of

each agent was encoded in the current state of the environment. This game was also much

simpler than the ECG. Rather than a 2x2 grid, each agent occupied a 2x1 grid, in which

each cell was one of three possible colors. The grids were designed to have exactly one color

in common. The observation space of each agent contained the following:

1. An int in the range [0,1] representing the agent’s current location

2. An int[ ] of size (2,) representing the colors of the agent’s grid

3. An int[ ] of size(10,) representing the location of the agent’s partner during the

ten previous turns of the game.

Figure 5. Learning Curve of PPO and A2C models in the Simpliﬁed Embodied Communication

Game

The PPO and A2C models trained in this environment for 1,000,000 training timesteps.

Unfortunately, no communication occurred, and the models did not converge on the non-

communicative optimal strategy (Figure 5). The fact that a PPO model was able to succeed

at ﬁnding the optimal strategy in the ECG, but was unable to do so in the SECG, highlights

the unpredictable and volatile performance of reinforcement learning algorithms

Conclusion

In conclusion, the ECG remains a vexing task for reinforcement learning agents, and a

stark counterexample to the trend of machine-learning algorithms replicating more and more

human behavior. It sheds light on a unique feature of human learning: our natural propensity

to create, send, recognize and receive linguistic signals. Emergent communication is a very

diﬃcult strategy for a policy-optimizer to ﬁnd, as it necessitates a signiﬁcant change in

policy with no immediate associated increase in reward. Rather, communication is entirely

dependent on another agent recognizing, understanding, and acting upon the signal, to

increase both agents’ respective rewards. It is unclear how current RL algorithms could be

incentivized to ﬁnd emergent communicative solutions. If these algorithms could be designed

to ﬁnd emergent communicative solutions, their performance would improve drastically in

many real-world scenarios requiring multi-agent cooperation and coordination. Emergent

communication is an incredibly powerful and unique human tool, and one of the few which

artiﬁcial agents still fail to replicate.

1. during an earlier iteration of training, a PPO was able to ﬁnd the optimal strategy in the SECG

after about 1,000,000 training timesteps. Unfortunately, I lost the data from that session in my attempt to

ﬁx an issue with my graphing functions

References

[1] N. Carol, J. Kegl, DawnMacLaughlin, B. Bahan, and R. Lee, The syntax of American

Sign Language : functional categories and hierarchical structure. MIT Press, 1999.

[Online]. Available: https://digitalcommons.usm.maine.edu/facbooks/476.

[2] T. Scott-Phillips, S. Kirby, and G. Ritchie, “Signalling signalhood and the emergence

of communication,” Cognition, no. 113, pp. 226–233, 2009.

[3] S. Havrylov and I. Titov, “Emergence of language with multi-agent games: Learning to

communicate with sequences of symbols,” Advances in Neural Information Processing

Systems 30, pp. 2149–2159, 2017.

[4] K. Bullard, F. Meier, D. Keila, J. Pineau, and J. Foerster, “Exploring zero-shot emer-

gent communication in embodied multi-agent populations,” arXiv:2010.15896, 2020.

[5] A. Lazaridou, A. Peysakhovich, and M. Baroni, “Multi-agent cooperation and the

emergence of (natural) language,” arXiv preprint arXiv:1612.07182, 2016.

[6] A. Lazaridou, K. M. Hermann, K. Tuyls, and S. Clark, “Emergence of linguistic com-

munication from referential games with symbolic pixel input,” International Confer-

ence on Learning Representations, 2018.

[7] P. de Bie, Computational agents in the embodied communication game, 2009.

[8] E. Hughes, A. Gupta, E. Tolstaya, and T. Schott-Phillips, Signalling signalhood in

machine learning agents, Abstract for workshop, Machine learning and the Evolution

of Language, 2022. [Online]. Available: https : //www.guabhinav. com / docs / 10 _

signalling_signalhood_in_machi.pdf.

[9] M. Towers, A. Kwiatkowski, J. Terry, et al., “Gymnasium: A standard interface for

reinforcement learning environments,” arXiv preprint arXiv:2407.17032, 2024.

[10] A. Raﬃn, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-

baselines3: Reliable reinforcement learning implementations,” Journal of Machine

Learning Research, vol. 22, no. 268, pp. 1–8, 2021. [Online]. Available: http://jmlr.

org/papers/v22/20-1364.html.

[11] J. K. Terry, B. Black, and A. Hari, “Supersuit: Simple microwrappers for reinforcement

learning environments,” arXiv preprint arXiv:2008.08932, 2020.