Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games We introduce Pipeline PSRO (P2SRO), the first scalable general method for finding approximate Nash equilibria in large zero-sum imperfect-information games P2SRO is able to parallelize PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcement learning workers, each training against the policies generated by lower levels in the hierarchy We show that unlike existing methods, P2SRO converges to an approximate Nash equilibrium, and does so faster as the number of parallel workers increases, across a variety of imperfect information games We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game tree complexity of $10^{50}$ P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots Experiment code is available athttps://githubcom/JBLanier/pipeline-psro

Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games

The deep policy gradient method has demonstrated promising results in many large-scale games, where the agent learns purely from its own experience. Yet, policy gradient methods with self-play suffer convergence problems to a Nash Equilibrium (NE) in multi-agent situations. Counterfactual regret minimization (CFR) has a convergence guarantee to a NE in 2-player zero-sum games, but it usually needs domain-speciﬁc abstractions to deal with large-scale games. Inheriting merits from both methods, in this paper we extend the actor-critic algorithm framework in deep reinforcement learning to tackle a large-scale 2-player zero-sum imperfect-information game, 1-on-1 Mahjong, whose information set size and game length are much larger than poker. The proposed algorithm, named Actor-Critic Hedge (ACH), modiﬁes the policy optimization objective from originally maximizing the discounted returns to minimizing a type of weighted cumulative counterfactual regret. This modiﬁcation is achieved by approximating the regret via a deep neural network and minimizing the regret via generating self-play policies using Hedge. ACH is theoretically justiﬁed as it is derived from a neural-based weighted CFR, for which we prove the convergence to a NE under certain conditions. Experimental results on the proposed 1-on-1 Mahjong benchmark and benchmarks from the literature demonstrate that ACH outperforms related state-of-the-art methods. Also, the agent obtained

Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game

Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies). One promising line of research uses neural networks to approximate counterfactual regret minimization (CFR) or its modern variants. DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains a neural network on an estimated regret target that can have extremely high variance due to an importance sampling term inherited from Monte Carlo CFR (MCCFR). In this paper we propose an unbiased model-free method that does not require any importance sampling. Our method, ESCHER, is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability. We show that the variance of the estimated regret of ESCHER is orders of magnitude lower than DREAM and other baselines. We then show that ESCHER outperforms the prior state of the art -- DREAM and neural fictitious self play (NFSP) -- on a number of games and the difference becomes dramatic as game size increases. In the very large game of dark chess, ESCHER is able to beat DREAM and NFSP in a head-to-head competition over $90\%$ of the time.

pdf/escher-eschewing-importance-sampling-in-games-by-computing-a-2fq34suj.pdf

ESCHER: Eschewing Importance Sampling in Games by Computing a History Value Function to Estimate Regret

—Game theory has by now found numerous applications in various ﬁelds, including economics, industry, jurispru- dence, and artiﬁcial intelligence, where each player only cares about its own interest in a noncooperative or cooperative manner, but without obvious malice to other players. However, in many practical applications, such as poker, chess, evader pursuing, drug interdiction, coast guard, cyber-security, and national defense, players often have apparently adversarial stances, that is, selﬁsh actions of each player inevitably or intentionally inﬂict loss or wreak havoc on other players. Along this line, this paper provides a systematic survey on three main game models widely employed in adversarial games, i.e., zero-sum normal-form and extensive-form games, Stackelberg (security) games, zero-sum differential games, from an array of perspectives, including basic knowledge of game models, (approximate) equilibrium concepts, problem classiﬁcations, research frontiers, (approximate) optimal strategy seeking techniques, prevailing algorithms, and practical applications. Finally, promising future research directions are also discussed for relevant adversarial games.

pdf/a-survey-of-decision-making-in-adversarial-games-3rki13xi.pdf

A Survey of Decision Making in Adversarial Games

Counterfactual Regret Minimization (CFR) has achieved many fascinating results in solving large-scale Imperfect Information Games (IIGs). Neural CFR is one of the promising techniques that can effectively reduce computation and memory consumption by generalizing decision information between similar states. However, current neural CFR algorithms have to approximate the cumulative regrets with neural networks. This usually results in high-variance approximation because regrets from different iterations could be very different. The problem can be even worse when importance sampling is used, which is required for model-free algorithms. In this paper, a new CFR variant, Recursive CFR, is proposed, in which the cumulative regrets are recovered by Recursive Substitute Values (RSVs) that are recursively defined and independently calculated between iterations. It is proved the new Recursive CFR converges to a Nash equilibrium. Based on Recursive CFR, a new model-free neural CFR algorithm with bootstrap learning is proposed. Experimental results show that the new algorithm can match the state-of-the-art neural CFR algorithms but with less training overhead.

pdf/model-free-neural-counterfactual-regret-minimization-with-1mztn57bx4.pdf

Model-free Neural Counterfactual Regret Minimization with Bootstrap Learning.

Regret minimization has played a key role in online learning, equilibrium computation in games, and reinforcement learning (RL). In this paper, we describe a general model-free RL method for no-regret learning based on repeated reconsideration of past behavior: Advantage Regret-Matching Actor-Critic (ARMAC). Rather than saving past state-action data, ARMAC saves a buffer of past policies, replaying through them to reconstruct hindsight assessments of past behavior. These retrospective value estimates are used to predict conditional advantages which, combined with regret matching, produces a new policy. In particular, ARMAC learns from sampled trajectories in a centralized training setting, without requiring the application of importance sampling commonly used in Monte Carlo counterfactual regret (CFR) minimization; hence, it does not suffer from excessive variance in large environments. In the single-agent setting, ARMAC shows an interesting form of exploration by keeping past policies intact. In the multiagent setting, ARMAC in self-play approaches Nash equilibria on some partially-observable zero-sum benchmarks. We provide exploitability estimates in the significantly larger game of betting-abstracted no-limit Texas Hold'em.

/pdf/the-advantage-regret-matching-actor-critic-2taff79a7h.pdf

The Advantage Regret-Matching Actor-Critic

A fundamental ability of an intelligent web-based agent is seeking out and acquiring new information. Internet search engines reliably find the correct vicinity but the top results may be a few links away from the desired target. A complementary approach is navigation via hyperlinks, employing a policy that comprehends local content and selects a link that moves it closer to the target. In this paper, we show that behavioral cloning of randomly sampled trajectories is sufficient to learn an effective link selection policy. We demonstrate the approach on a graph version of Wikipedia with 38M nodes and 387M edges. The model is able to efficiently navigate between nodes 5 and 20 steps apart 96% and 92% of the time, respectively. We then use the resulting embeddings and policy in downstream fact verification and question answering tasks where, in combination with basic TF-IDF search and ranking methods, they are competitive results to the state-of-the-art methods.

Learning to Navigate Wikipedia by Taking Random Walks

Progress in fields of machine learning and adversarial planning has benefited significantly from benchmark domains, from checkers and the classic UCI data sets to Go and Diplomacy. In sequential decision-making, agent evaluation has largely been restricted to few interactions against experts, with the aim to reach some desired level of performance (e.g. beating a human professional player). We propose a benchmark for multiagent learning based on repeated play of the simple game Rock, Paper, Scissors along with a population of forty-three tournament entries, some of which are intentionally sub-optimal. We describe metrics to measure the quality of agents based both on average returns and exploitability. We then show that several RL, online learning, and language model approaches can learn good counter-strategies and generalize well, but ultimately lose to the top-performing bots, creating an opportunity for research in multiagent learning.

Population-based Evaluation in Repeated Rock-Paper-Scissors as a Benchmark for Multiagent Reinforcement Learning

Handle everyday research tasks with reliable, citation-backed results

Your personal Research Agent to handle research tasks with citation-backed results

Popular Tasks used by Researchers

How can I help with your research?

Meet SciSpace

Get more enhanced response by uploading the PDFs you want me to reference.

No relevant PDFs in your library

SciSpace is the AI research assistant for academics. Run systematic literature reviews on 280M+ papers, and write papers with cited sources. Trusted by 1M+ students, PhDs & researchers.

SciSpace | AI for Research

Analyze PDFs

Code & Manuscripts

Funding & Grants

Literature & Patents

Medical & Clinical Data

Systematic Review

Visualize & Present

Web & Data

Build a Google Scholar-like website for your research.

Build a website

Create charts and images for your research

Create a Chart

Write a paper for submission to a journal

Draft a manuscript

Patent Search

Design eye-catching scientific posters in minutes.

Scientific Poster Generation

Systematic Literature Review

One task is running at the moment. Your messages will be shown right after.

Drag and drop or click here to browse

Loved by <highlight>1 million+</highlight> researchers

Extract a list of specific topics and their sources from unstructured text

Topics

Compare and analyze relevant papers that matches with your search

Papers

Get insights from PDFs and bookmarked papers from your library

My library

Recent searches

Try searching for:

Catch AI-generated content in scholarly and non-scholarly content

{ai} Detector

Ai Writer

Get PDF Summaries, highlighted text explanations 

Chat with PDF

Effortlessly create in-text citations and bibliographies in APA and 2,500 other formats

Citation generator

Get explanations, summaries, and answers on academic papers

Ease up your research workflow with {scispace}'s cohort of exciting AI tools

Elevate your academic writing skills and convey your ideas the way you want

Paraphraser

Explore our range of reading and writing tools

Your file is being prepared and should be ready in a few minutes. If it's a large file, it might take a bit longer. You can close this window, and we'll email you the file when it's done.

You have reached a maximum limit of <strong>{limit}</strong> columns in the table. Remove at least <strong>1</strong> column to add or create another one.

John Schultz

Author Tools

Chat about Author

Papers

The Advantage Regret-Matching Actor-Critic

Learning to Navigate Wikipedia by Taking Random Walks

Population-based Evaluation in Repeated Rock-Paper-Scissors as a Benchmark for Multiagent Reinforcement Learning