Key Points

1. The paper introduces LLM360, an initiative advocating for fully open-sourced Large Language Models (LLMs) to enhance transparency and reproducibility in AI research.

2. It addresses challenges in LLM research, such as the need for data provenance, reproducibility, and open collaboration, and aims to resolve them by releasing all training and model details for LLMs.

3. LLM360 releases two 7B parameter LLMs, A MBER and C RYSTAL C ODER, pre-trained from scratch, and provides training code, data, intermediate checkpoints, and analyses, promoting open and collaborative AI research.

4. It compares LLM360 with similar projects like Pythia, emphasizing the focus on transparency and reproducibility and the release of up-to-date models.

5. The paper outlines the release artifacts of LLM360, including training datasets, code and configurations, model checkpoints, and training metrics, promoting comprehensive and transparent release of LLMs.

6. It presents the landscape of open-source LLMs, noting varying levels of transparency and reproducibility in the release artifacts over time, with recent LLMs exhibiting progressively less disclosure of important pretraining details.

7. It discusses the pre-training datasets, architectures, hyperparameters, and infrastructure for the A MBER and C RYSTAL C ODER models, along with their benchmark results and evaluations.

8. The paper presents an analysis of memorization in LLMs, demonstrating the distribution of memorization scores and correlation between sequences across model checkpoints.

9. It highlights potential use cases of LLM360, lessons learned from initial model training, and the initiative's commitment to responsible usage of LLMs, emphasizing the importance of controlling potential risks and inviting community collaboration.

Summary

The paper introduces LLM360, an initiative for fully open-sourcing Large Language Models (LLMs). It addresses the challenges associated with the increasing popularity and capability of LLMs and the limited visibility and access to their training, fine-tuning, and evaluation processes. The LLM360 framework advocates for the open-sourcing of all training code and data, model checkpoints, and intermediate results to promote transparency and reproducibility in LLM research. As part of the LLM360 initiative, two 7B parameter LLMs, A MBER and C RYSTAL C ODER, have been released, including their training code, data, intermediate checkpoints, and analyses. The authors plan to continually release larger and more powerful LLMs in the future as part of the LLM360 project.

The paper addresses specific challenges in LLM research, such as data provenance, reproducibility, and open collaboration, and provides a summary of the technical details of A MBER and C RYSTAL C ODER, including their training datasets, model architectures, hyperparameters, and pre-training procedures. Additionally, benchmark results for these models on various datasets are provided.

The paper also discusses the LLM360 framework's approach to releasing artifacts, including data chunks, model checkpoints, metrics, and analyses, to support future research and collaboration in the open-source community. The authors outline potential use cases of LLM360, such as experimental studies of model training, building domain-specific LLMs, and providing model initializations for algorithmic frameworks, while highlighting the importance of responsible usage of LLMs.

The paper concludes with a commitment to developing the LLM360 framework and invites the community to contribute to this initiative.

Reference: https://arxiv.org/abs/2312.06550