Performance and Behavior of GPT-3.5 and GPT-4 on Different Tasks

The article evaluates the performance and behavior of the GPT-3.5 and GPT-4 language models on four different tasks: solving math problems, answering sensitive/dangerous questions, generating code, and visual reasoning. The study finds that the performance and behavior of both models can vary greatly over time. For example, GPT-4 (March 2023) showed high accuracy in identifying prime numbers, but its accuracy dropped significantly in June 2023. On the other hand, GPT-3.5 (June 2023) performed much better in solving math problems compared to its March 2023 version.

Changes in Answering Sensitive Questions

In terms of answering sensitive questions, both GPT-3.5 and GPT-4 showed changes in behavior. GPT-4 became less willing to answer sensitive questions in June compared to March, while GPT-3.5 answered more sensitive questions in June. However, both models became less verbose in explaining their refusal to answer in June.

Code Generation Capability of the Models

The study also explored code generation capability of the models. It found that the percentage of generated code that was directly executable decreased for both GPT-4 and GPT-3.5 from March to June. Additionally, both models became more verbose in their code generation. For visual reasoning tasks, the performance of both models improved slightly from March to June. However, there were cases where the models performed worse in June compared to March, indicating the need for fine-grained monitoring of their behavior. Overall, the study highlights the need for continuous monitoring and evaluation of language models like GPT-3.5 and GPT-4 as their behavior can change significantly over time. This is important for ensuring the quality and reliability of these models in real-world applications.

Reference: https://arxiv.org/abs/2307.090...