Key Points

1. The paper introduces INDUS, a suite of large language models (LLMs) specifically tailored for Earth science, biology, physics, heliophysics, planetary sciences, and astrophysics domains. These models are trained using curated scientific corpora drawn from diverse sources and include an encoder model, a contrastive-learning-based general text embedding model, and smaller versions of these models created using knowledge distillation techniques.

2. The research demonstrates that these INDUS models outperform both general-purpose encoders (RoBERTa) and existing domain-specific encoders (SCIBERT) on new tasks as well as existing benchmark tasks in the domains of interest.

3. The paper focuses on the limitations of existing large language models trained on open-domain corpora and the need for domain-specific models to achieve improved accuracy on in-domain natural language processing tasks.

4. The authors detail the meticulous curation of scientific corpora from diverse sources such as ADS, PMC, AMS, AGU, and CMR. They also describe the development of a customized BPE tokenizer and its comparison with the RoBERTa tokenizer.

5. The paper explains the training of encoder-only models using the curated scientific corpora and the INDUS BPE tokenizer, as well as the creation of sentence-embedding models through fine-tuning the encoder-only models with a contrastive learning objective.

6. New benchmark datasets were introduced to evaluate the language understanding capabilities of the proposed NLP models in the multidisciplinary field, including a CLIMATE-CHANGE NER dataset, NASA-QA for extractive question answering, and NASA-IR for information retrieval.

7. Performance evaluation of the INDUS models on various benchmark tasks demonstrated their superiority over existing models of similar sizes, such as RoBERTaBASE and SCIBERT for base models, and TINYBERT and MINILM for smaller models. The INDUS BASE and INDUS SMALL models showed strong performance across various tasks.

8. The paper describes the creation and annotation of benchmark datasets, namely CLIMATE-CHANGE NER, NASA-QA, and NASA-IR, to address the need for datasets tailored for the diverse and multidisciplinary fields of Earth science, biology, physics, heliophysics, planetary sciences, and astrophysics.

9. The authors conclude by highlighting the significant effectiveness of the custom tokenizer, in-domain data, and knowledge distillation techniques in developing high-quality encoder models and smaller versions of these models suitable for applications with latency or resource constraints. They also announce the release of the developed models and benchmark datasets for the scientific community on Hugging Face.

Summary

The research paper "INDUS: Effective and Efficient Language Models for Scientific Applications" introduces INDUS, a suite of large language models tailored for Earth science, biology, physics, heliophysics, planetary sciences, and astrophysics domains. The authors highlight the importance of domain-specific language models and the potential limitations of general-purpose language models on specialized tasks. They aim to address these limitations by developing INDUS, which consists of an encoder model trained using domain-specific vocabulary and corpora, a contrastive-learning-based general text embedding model, and smaller versions of these models using knowledge distillation techniques.

To train the suite of models, the authors utilized curated scientific corpora from a variety of sources, including SAO/NASA Astrophysics Data System, PubMed Central, American Meteorological Society, American Geophysical Union, and NASA's Common Metadata Repository. They also developed a custom tokenizer called INDUS BPE using the byte-pair encoding algorithm.

New Benchmark Datasets
The paper introduces three new scientific benchmark datasets, including CLIMATE-CHANGE NER, NASA-QA, and NASA-IR. The CLIMATE-CHANGE NER dataset focuses on entity recognition tasks related to climate change, with annotated abstracts sourced from climate-related literature. The NASA-QA dataset consists of extractive question answering tasks based on Earth science papers, and the NASA-IR dataset is a domain-specific information retrieval benchmark.

Model Performance Evaluation
The performance of the INDUS models was evaluated against existing general-purpose and domain-specific encoders on a range of tasks in the targeted scientific domains. The results showed that INDUS models outperformed general-purpose encoders like RoBERTa, as well as existing domain-specific encoders like SCIBERT, on both the new benchmark tasks and existing benchmark tasks in the specified domains.

Training Details and Performance Comparison
The research also discusses the specific training details for the models, including the use of contrastive learning objectives for sentence embedding models, knowledge distillation for smaller models, and the creation of new benchmark datasets. It compares the performance of INDUS models to baseline models on tasks such as CLIMATE-CHANGE NER, NASA-QA, and NASA-IR, showing the significant improvements achieved by INDUS models across various scientific benchmarks.

Smaller Models and Potential Applications
Additionally, the paper details the creation and training of smaller, more efficient models using knowledge distillation techniques, and the release of the developed models and benchmark datasets on the Hugging Face platform for the benefit of the scientific community. The results demonstrated the effectiveness of the custom tokenizer and in-domain data for training high-quality encoder models and sentence embedding models. The paper concludes by highlighting the potential applications and benefits of the developed INDUS models for research organizations and enterprises working in the interdisciplinary fields of Earth science, biology, physics, heliophysics, planetary sciences, and astrophysics.

Reference: https://arxiv.org/abs/2405.10725