Key Points

1. The paper introduces the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI’s ChatGPT or Google’s PaLM-2. The attack recovers the entire projection matrix of OpenAI’s ada and babbage language models. This confirms, for the first time, that these black-box models have a hidden dimension of 1024 and 2048, respectively.

2. The attack departs from prior approaches that reconstruct a model in a bottom-up fashion, starting from the input layer. Instead, it operates top-down and directly extracts the model’s last layer.

3. The attack is useful as it reveals the width of the transformer model, which is often correlated with its total parameter count. It slightly reduces the degree to which the model is a complete "black-box," which might be useful for future attacks.

4. The paper confirms that certain popular large language models have hidden dimensions of 1024 and 2048 and estimates the cost to retrieve the entire projection matrix of a specific model to be under $2,000.

5. The attack is effective and efficient, and it is applicable to production models whose APIs expose logprobs or a "logit bias."

6. The paper shares that responsible disclosure was made, and the attack was shared with all services vulnerable to this attack. OpenAI and Google have both modified their APIs to introduce mitigations and defenses.

7. The paper presents attacks for APIs with varying capabilities and motivations for performing model extraction attacks efficiently.

8. The paper discusses potential defenses, such as removing the logit bias assumption or using noise addition, rate limits, and architectural changes in the model.

9. The paper concludes by stating that while certain models' weights and internal details are not publicly accessible, the models themselves are exposed via APIs, and the researchers are motivated to study this problem to demonstrate the practical applicability of model stealing attacks.

Summary

Model-Stealing Attack on Production Language Models
In a study titled "Stealing Part of a Production Language Model," the authors investigate the potential for adversaries to extract model weights by making queries to the API of a language model. The focus is on production language models such as OpenAI's ChatGPT and Google's PaLM-2, and the study introduces the first model-stealing attack that extracts precise, nontrivial information from black-box production language models. The attack is capable of recovering the embedding projection layer of a transformer model, given typical API access. By making targeted queries to the model's API, the attack can extract the model's embedding dimension or its final weight matrix.

The study confirms for the first time that black-box models like OpenAI's ada and babbage language models have a hidden dimension of 1024 and 2048, respectively. Additionally, the study recovers the exact hidden dimension size of the gpt-3.5-turbo model and estimates the cost to recover the entire projection matrix. The authors present potential defenses and mitigations against such attacks, including the removal of logit bias, the addition of noise to the output logits, implementation of rate limits on logit bias queries, and more.

The study also explores the practical application of the attack on five different black-box models, and it is confirmed that the attack successfully steals parts of these models. The authors discuss the potential for preventing or mitigating these attacks, as well as future directions for improving on this attack. They also consider potential defenses, such as removing the logit bias assumption, adding noise to the output logits, and rate limits on logit bias queries.
The results of the study suggest that although there are potential defenses against these attacks, it is essential for model providers to reevaluate their security measures and remain vigilant against emerging threats.

Reference: https://arxiv.org/abs/2403.066...