Evaluating Large Language Models Trained on Code (AI summary)

Key Points

1. The paper introduces Codex, a GPT language model fine-tuned on publicly available code from GitHub, and evaluates its Python code-writing capabilities. The model solves 28.8% of the problems on a new evaluation set for synthesizing programs from docstrings, while GPT-3 solved 0% and GPT-J solved 11.4%.

2. The study finds that repeated sampling from the model is an effective strategy for producing working solutions to difficult prompts, solving 70.2% of the problems with 100 samples per problem. The researchers carefully investigated the model's limitations, including difficulties with long chains of operations and with binding operations to variables.

3. The paper discusses the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

4. The evaluation framework focuses on the task of generating standalone Python functions from docstrings and evaluates the correctness of code samples automatically through unit tests. The researchers have created a dataset of 164 original programming problems with unit tests, assessing language comprehension, algorithms, and simple mathematics.

5. Codex effectively generates at least one correct function for 77.5% of the problems. The study suggests that accurate code samples can be selected via heuristic ranking instead of fully evaluating each sample, demonstrating that the sample with the highest mean log-probability passes unit tests for 44.5% of the problems.

6. The paper includes a comparison of pass rates of Codex models vs. GPT-3 on the HumanEval dataset as a function of model size, demonstrating that Codex significantly outperforms GPT-3, and describes the potential benefits of generating multiple samples per problem.

7. The study presents an evaluation of the functional correctness of the models, noting challenges in using match-based metrics for code generation and emphasizing the importance of evaluating functional correctness based on the ability to pass a set of unit tests.

8. The paper discusses the creation of a dataset of synthetic problems assembled from basic building blocks to illustrate model performance degradation as docstring length increases.

9. The researchers conducted a hazard analysis focusing on identifying risk factors with potential to cause harm and discussed legal implications, economic and labor market impacts, and other challenges associated with using code generation models. They also highlighted safety and security concerns associated with the deployment of such models.

Summary

The paper introduces Codex, a GPT language model finetuned on publicly available code from GitHub, and evaluates its capabilities in generating Python code from docstrings. The study emphasizes the performance of Codex models in solving programming problems and compares it to GPT models. The paper presents the findings of the evaluation, showcasing the effectiveness of Codex in solving a significant percentage of the problems, especially when multiple samples are generated from the model and evaluated.

It also discusses the potential broader impacts of deploying powerful code generation technologies, addressing safety, security, economic implications, and the limitations of these code generating models. The paper provides insights into the risks, potential uses, and future opportunities associated with large language models trained on code. Additionally, the study discusses the legal implications and potential hazards associated with using code generation models. The research also suggests various policy measures and approaches to mitigate the risks and negative impacts of code generation models.

Overall, the paper explores the capabilities of large language models in the domain of code generation and evaluates the potential impacts and limitations of these models.

Reference: https://arxiv.org/abs/2107.03374

ML and AI papers

Evaluating Large Language Models Trained on Code (AI summary)

Recent posts

Foundational Models Defining a New Era in Vision: A Survey and Outlook (AI summary)

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning (AI summary)

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents (AI summary)