Check out our latest post: The Threat of AI Disinformation to Your Brand
By Anna Freeman
Recently, social media platforms such as X (formerly Twitter) and Reddit have been abuzz with anecdotal accusations that ChatGPT, OpenAI’s groundbreaking AI chatbot tool, has been demonstrating a generally worse performance on a variety of tasks. Now an indispensable aide in writing, editing, planning, and instruction tasks, frustrated users claim that ChatGPT is producing consistently less accurate and effective output than it has in the past. This has sparked concern and skepticism amongst those who were once awed by the model’s capabilities.
In response to this disillusionment, OpenAI’s VP of Product, Peter Wellinder, took to X to assure users that, “No, we have not made ChatGPT-4 dumber. Quite the opposite: we make each new version smarter than the previous one.” Wellinder instead argued that the perceived decline is actually a result of frequent users simply noticing issues that they did not before.
This has been the state of the debate for the past year, with users consistently reporting declining performance and OpenAI dissuading these reports by citing inherent issues of increased usage and a commitment to continuous, linear improvement. However, a recent study by researchers from Stanford and Berkeley provides evidence to support user claims of variant and declining performance in the model. The study, which has yet to be peer reviewed, tracks the performance of ChatGPT 3.5 (OpenAI’s current free version of the model) and 4 (the most recent version which is only available through paid subscription) on a number of tasks over the time period from March 2023 to June 2023.
We have reviewed the study to summarize three of the top tasks in which ChatGPT has demonstrated declining performance, highlighting applications in which caution about output should be exercised.
In June 2023, GPT-4 exhibited a noteworthy decline in accuracy across two basic math tasks compared to its performance three months earlier. Task 1 involved asking the model to determine if a given integer was a prime or composite number. This task was chosen as a benchmark because it is easy for humans to understand, but still requires reasoning. Unexpectedly, GPT-4 demonstrated a significant change in accuracy on this task, dropping from 84.0% in March to 51.4% in June. The June version of GPT-4 also demonstrated a stronger bias to view a given integer as a composite number, accurately identifying composite numbers at a much higher rate than prime numbers.
Task 2 involved asking the model to count the number of “happy” numbers in a given interval. A number qualifies as happy if the process of repeatedly calculating the sum of the square of its digits eventually produces 1. This task is used alongside prime number identification because it demands a quantitative response rather than the binary response prompted by Task 1. GPT-4 also demonstrated declining performance here, shifting from an 83.6% accuracy rate in March to a 35.2% accuracy rate in June. In conjunction, these benchmarks illuminate GPT-4’s declining performance in basic math tasks of both quantitative and binary nature.
Importantly, verbosity, which refers to the number of characters in a generated response, also changed substantially from March to June. GPT-4’s task 2 responses to the same prompt showed a drop from an average verbosity of 2163.5 in March to just 10.0 in June. This indicates that, over this time period, the model began producing answers with significantly less or no accompanying context and explanation.
Writing and revising code is one of the most popular applications of ChatGPT, especially for students and professionals. The study evaluated ChatGPT-4’s performance in coding tasks by a metric of direct executability, which categorizes the output as successful if it accomplishes the directed task and runs ‘as-generated’ without error. In both GPT-3.5 and GPT-4, the number of directly executable generations dropped significantly from March to June. While over 50% of responses from ChatGPT-4 qualified as directly executable in March, only 10% did in June.
A significant increase in verbosity observed from March to June led researchers to hypothesize that this shift might be due to an increase in labeling and comments within the generations. Thus, it might be the case that the June versions provide better explanations of the output at the cost of lower rates of direct executability. If this is true, then ChatGPT’s coding capacities have become significantly less useful for individuals who are looking for the model to produce code that does not require additional human alteration to run effectively.
Essentially, the model seems to have shifted from more execution-based to more instruction-based generations, which might benefit those who have a baseline of coding knowledge, but abandons those who do not. Due to this shift, users who do not possess substantial prior knowledge of the coding language they are employing should exercise increased caution when using the model for coding tasks.
The study also found variant and conflicting trends between GPT-3.5 and GPT-4 in their willingness to answer questions that were either deemed sensitive or opinion based. Such questions should be avoided for safe and reasonable use of the tool, which does not possess the ethical, context-informed reasoning skills of humans. While GPT-4 was 16% less likely to answer a sensitive question in June than it was in March, GPT-3.5 was 6% more likely to do so. A similar pattern was observed for opinion based questions in which GPT-4’s response rate dropped by 75.5% and GPT-3.5’s response rate increased by about 2% in June. While GPT-4 became more conservative and avoidant of sensitive or opinion based questions, GPT-3.5 became slightly more likely to respond to sensitive topics and offer opinions. Furthermore, for the opinion based questions that GPT-3.5 did answer, 27% of the opinions generated changed from March to June. GPT-4 demonstrated negligibly low rates of opinion change, likely due to the low rate of opinion questions answered.
While these specific findings align with Wellinder’s claim of improvement with each version, they do not address the ethical implications of obscuring a safer version of the model behind a paywall (the subscription price of GPT-4) or not being explicit about the increased harms that users of the free GPT-3.5 version are vulnerable to. Users of GPT-3.5 should exercise extreme caution in prompts that could lead to sensitive or opinion-based generations, given the model’s declining conservatism.
While the study effectively identifies where shifts and declines in performance are occurring, it does not delve deeply into why these changes are happening. One likely explanation comes in the form of the concept of model drift. Model drift describes the phenomenon in which machine learning models, such as ChatGPT, demonstrate performance degradation over time as they are exposed to new input that they were not initially trained to handle.
This is an unavoidable problem for highly general machine learning models like ChatGPT for which the context in which the model is being used is wide and often changing. Such models are guaranteed to encounter input data that they were not designed to handle. The problem is that, over time, this input affects the accuracy of the model. For example, if ChatGPT receives a significant amount of input containing inaccurate or misaligned information about basic math, the model’s base knowledge of basic math constructed in training is negatively affected and might, as a result, produce less consistently accurate responses to questions about basic math.
Furthermore, the ideal of linear development is complicated by the degree of complexity and integration in machine learning models. Often, attempts to fix one part of a system can lead to unintended and unpredictable consequences elsewhere. This explains why even the most well-intentioned and detail-oriented model updates can end up performing worse in some aspects.
At the development level, model drift is a highly complex and challenging issue. Identifying drift involves controlled testing over a period of time, and addressing it requires retraining of massive models, which is expensive and time-consuming. Though variant performance is an inevitable part of the development process, users must continue to vocalize their concerns about declining performance and advocate for transparency. Such accountability plays a critical role in the development of safe and human-centered AI.
As users, the most important thing we can do is stay informed about and critical of the processes through which these models generate information. It is important to exercise caution and discernment in deciding which tasks you employ the help of tools like ChatGPT. Research the strengths and weaknesses of various models, and, for now, just Google which numbers are prime.