Key Points

1. LLaMA introduces a collection of large foundation language models ranging from 7B to 65B parameters, trained on trillions of tokens using publicly available datasets exclusively, without relying on proprietary and inaccessible datasets.

2. LLaMA-13B outperforms GPT-3 on most benchmarks, while LLaMA-65B competes with the best existing large language models such as Chinchilla-70B and PaLM-540B.

3. The research focuses on achieving the best performance at various inference budgets by training on more tokens than typically used, and the resulting LLaMA models are competitive with the best existing large language models.

4. LLaMA's training dataset consists of a mixture of English CommonCrawl, C4, Github, Wikipedia, ArXiv, Stack Exchange, and two book corpora, totaling approximately 1.4T tokens after tokenization.

5. The models are based on the transformer architecture and leverage improvements such as pre-normalization, SwiGLU activation function, and rotary embeddings. They are trained using the AdamW optimizer with specific hyperparameters.

6. Several optimizations are employed to improve the training efficiency of the models, including efficient implementation of the causal multi-head attention and employing checkpointing to reduce memory usage.

7. The models are evaluated on various benchmarks including common sense reasoning tasks, closed-book question answering benchmarks, and tasks related to free-form generation, multiple choice tasks, and code generation.

8. LLaMA models are evaluated for biases and toxicity, demonstrating competitive performance but also exhibiting biases in religion, age, gender, and other categories, as well as displaying the potential to generate toxic content and misinformation.

9. The training of the LLaMA models has been estimated to have consumed a massive quantity of energy, resulting in a significant carbon footprint, which is an important environmental consideration for large-scale model training.

Summary

The paper introduces LLaMA, a series of foundation language models ranging from 7B to 65B parameters, trained on trillions of tokens using publicly available datasets exclusively. LLaMA-13B outperforms GPT-3 on most benchmarks, while LLaMA-65B is competitive with Chinchilla-70B and PaLM-540B. The study challenges the assumption that more parameters lead to better performance, showing that smaller models trained on more data can achieve better performance for a given compute budget. The training dataset is a mixture of publicly available sources, covering a diverse set of domains, and contains roughly 1.4T tokens after tokenization.

The paper presents modifications to the transformer architecture, training method, and evaluates biases and toxicity. It details the performance of LLaMA models on various benchmarks, including zero-shot and few-shot tasks, natural language questions, mathematical reasoning, and code generation. The models achieve competitive performance in various domains, such as common sense reasoning, reading comprehension, and mathematical reasoning. The paper also evaluates biases and toxicity, demonstrating that LLaMA models capture societal biases related to gender and occupation, and highlight their potential to generate toxic content.

Furthermore, the study assesses the carbon footprint of training the LLaMA models, showing the energy consumption and emission of carbon dioxide. The paper concludes that releasing these models to the research community would accelerate the development of large language models and efforts to improve their robustness and mitigate known issues. The authors express plans to release larger models trained on larger pretraining corpora in the future, as they have seen constant improvement in performance with scaling. Finally, the paper acknowledges the contributions and support of individuals involved in the development and evaluation of the LLaMA models.

Reference: https://arxiv.org/abs/2302.13971