Key Points

1. This study evaluates the performance of ChatGPT variants, GPT-3.5, and GPT-4, against student work in university-level physics coding assignments using Python.

2. Students averaged 91.9% while the highest performing AI submission, GPT-4 with prompt engineering, scored 81.1%, indicating that student work surpassed AI submissions.

3. Blinded markers accurately identified the authorship of submissions, with an average accuracy rate of 85.3%, suggesting that AI-generated work is often detectable by human evaluators.

4. The study focuses on the potential impact of AI on university coding courses and aims to determine the validity and integrity of coding assignments in assessing student performance.

5. The assignments were structured as weekly online Jupyter notebooks with tasks accounting for marks, and submissions from both students and AI were evaluated by independent markers.

6. The prompt engineering significantly improved scores for both GPT-4 and GPT-3.5, indicating clear benefits to prompt engineering.

7. The study found that GPT-4 showed stricter superiority over GPT-3.5 and that the latest LLMs have not surpassed human proficiency in physics coding assignments.

8. Evaluation revealed that plots generated by LLMs are distinguishable from student-created ones, underscoring the potential for human markers to identify AI-generated content.

9. These findings prompt a reevaluation of how we measure AI performance and the role of human collaboration in harnessing AI’s full potential, as the potential for AI to surpass traditional teaching methods in coding education is explored.

Summary

Performance comparison of ChatGPT variants with and without prompt engineering
The study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, with and without prompt engineering, compared to student work in a university-level physics coding course. A total of 50 student and 50 AI-generated submissions were compared, and blinded markers evaluated the authorship of the submissions. The students averaged 91.9% and outperformed the highest-performing AI submission, which scored 81.1%. Prompt engineering significantly improved scores for both GPT-4 and GPT-3.5. Human markers accurately identified the authorship of the submissions, with 92.1% of the work categorized as 'Definitely Human' being human-authored, resulting in an average accuracy rate of 85.3% for AI or Human categorization. The findings suggest that AI-generated work closely approaches the quality of university students' work but is often detectable by human evaluators.

Impact of ChatGPT on a 10-week physics coding course at Durham University
The study explores the impact of AI, specifically ChatGPT, on a 10-week physics coding course at Durham University. The coding assignments involved creating clear, well-labeled plots that elucidate the underlying physics of a scenario, and 14 plots were evaluated against a specific marking scheme for both AI and student-authored submissions. The study found that GPT-4 with prompt engineering scored the highest among the AI categories. Blinded markers accurately identified the authorship of the submissions, and the study suggests that human markers can successfully distinguish AI-generated content from student-created work.

Importance of prompt engineering and human intervention in assessing AI performance
The study highlights the importance of prompt engineering in improving AI performance and emphasizes the significance of assessing the impact of human intervention in evaluating the effectiveness of AI. It also points out that while the latest large language models (LLMs) have not surpassed human proficiency in physics coding assignments, GPT-4 shows strict superiority over GPT-3.5, and prompt engineering significantly enhances performance. The findings prompt a reevaluation of how AI performance is measured and the role of human collaboration in harnessing AI’s full potential.

Reference: https://arxiv.org/abs/2403.16977