Abstract
Multi-Agent Reinforcement Learning (MARL) has emerged as a key
paradigm for solving complex real-world problems involving multiple
agents interacting in dynamic environments. However, training MARL
models, especially for cooperative reasoning tasks, remains
computationally intensive and sample-inefficient due to nonstationarity, credit assignment, and policy coupling issues.
Conventional policy gradient methods struggle with convergence and
scalability in multi-agent settings. Centralized training frameworks
suffer from bottlenecks and synchronization overheads. Evolutionary
algorithms, while more robust to non-differentiable objectives, are
often too slow when applied in single-node environments. To address
these challenges, we propose Distributed Co-evolutionary Policy
Optimization (DCPO), a hybrid learning framework that distributes
evolutionary computation across multiple nodes. DCPO decomposes
the global policy search into sub-population-based parallel
explorations, with each node evolving a subset of agent policies using
fitness-driven mutation, crossover, and local policy gradient updates. A
global coordinator aggregates top-performing policies periodically to
ensure cooperative learning convergence. DCPO was tested on
standard cooperative MARL benchmarks such as StarCraft II
Micromanagement and Multi-Agent Particle Environments (MPE).
Compared to traditional baselines such as MADDPG, QMIX, MAPPO,
COMA, and EPOpt, DCPO showd up to 37% faster convergence, 25%
higher final cumulative rewards, and enhanced generalization in
unseen environments.
Authors
A. Rajavel1, P.T. Kalaivaani2
Kamaraj College of Engineering and Technology, India1, Vivekanandha College of Engineering for Women, India2
Keywords
Multi-Agent Reinforcement Learning, Evolutionary Algorithms, Distributed Learning, Policy Optimization, Cooperative Reasoning