Key Points

1. The paper discusses the use of large language models (LLMs) as backbones for general-purpose agents and proposes the exploration of competitive games as an evaluation method for LLMs to incorporate multiplayers and dynamicize the environment.

2. The authors studied the rationality, strategic reasoning ability, and instruction-following capability of various LLMs in competitive games, finding that most LLMs play rational strategies but not as rational as indicated by Nash Equilibria (NEs).

3. Certain types of LLMs, such as GPT4, demonstrated higher levels of rationality and strategic reasoning ability compared to other models when game history was available, leading to faster convergence to NE strategies and higher winning rates.

4. The paper introduces an Economics Arena (EconArena) as a dynamic simulation to test the abilities of LLMs in competitive games, specifically beauty contests and private-value second price auctions, providing quantitative performance metrics for measuring the LLMs' performance.

5. It highlights LLM-based agents' potential for artificial general intelligence and their applications in various real-world scenarios, such as human behavior simulation, industrial automation, software development, and scientific research.

6. Previous studies on the ability of LLMs to replicate human behavior in game theory are mentioned, including their performance in strategic games such as Prisoner's Dilemma and Dictator game, as well as applications in exploratory Werewolf games.

7. The study addresses the current limitations of evaluating LLMs in static environments and proposes single-round competitive games in EconArena to evaluate LLMs based on their rationality and strategic reasoning ability.

8. Several metrics for evaluating LLMs in EconArena are listed, including changes in payoffs over games, strategies over game configurations, and strategies over player configurations and game history availability.

9. Experimental findings from the paper show that LLMs' behavior in EconArena does not always align with maximally rational strategies in beauty contests and second-price auctions, leading to insights about their rationality and strategic reasoning capabilities.

Summary

Introduction and Methodology
The paper proposes an evaluation method for large language models (LLMs) using simulated number-based competitive games, such as beauty contests and private-value second-price auctions, to assess their performance. The authors introduce an economics arena (EconArena) to facilitate the evaluation and demonstrate through experiments that certain LLMs exhibit adaptability to game configurations and opponent strategies, stronger strategic reasoning ability, and in-context learning capacity. The paper discusses the rationality degree of LLMs and their propensity for rule-breaking behavior, providing a set of quantitative metrics for measuring LLMs' performance.

Evaluation of LLMs' Capabilities
The paper focuses on evaluating LLMs' rationality, strategic reasoning ability, and instruction-following capability in dynamic environments. It provides metrics for assessing the change in payoffs over games, strategies over game configurations and player configurations, and strategies over game history availability.

Experimental Findings
In experiments with nine LLMs playing beauty contests and second-price auctions, the study finds that certain models demonstrate stronger rationality, adaptability to game configurations, and strategic reasoning ability. However, it also notes non-maximally rational behaviors among LLMs, suggesting room for improvement in their rationality degree.

Contribution to the Literature
The paper contributes to the existing literature by proposing an innovative evaluation method for LLMs, allowing for a comparison of their respective performances and providing a set of quantitative metrics for measuring their capabilities.

Reference: https://arxiv.org/abs/2309.16039