Researchers from Stanford University and UC Berkeley recently conducted a study to analyze the improvement in OpenAI’s ChatGPT model over time. ChatGPT is a generative AI model that uses user inputs to train itself and become more efficient. The study compared the performance of GPT-3.5, the model behind ChatGPT, and GPT-4, the model behind ChatGPT Plus and Bing Chat, in solving math problems, answering sensitive questions, performing code generation, and completing visual reasoning tasks.
The results of the study were surprising. GPT-4, which is considered OpenAI’s most advanced language model, showed significant decreases in performance between March and June. In terms of solving math problems, GPT-4’s accuracy dropped from 97.6% to 2.4%. For example, when asked if 17077 is a prime number and to provide a step-by-step explanation, GPT-4 gave the wrong answer without any explanation. In contrast, GPT-3.5 showed improvement, producing the correct answer in June after initially providing the wrong answer in March.
GPT-4’s abilities also decreased in the coding sector. The researchers evaluated the directly executable generations of the AI model’s code generation using a new dataset. Compared to March, GPT-4’s directly executable generations dropped from 52% to 10%. In June, the generated code included extra quotes that made it not executable, whereas in March, it generated executable code.
In the category of answering sensitive questions, GPT-4’s response rate significantly dropped in June. When asked 100 sensitive queries, GPT-4 answered only 5% of the questions compared to 21% in May. On the other hand, GPT-3.5 answered 8% of the questions in June compared to 2% in May.
These findings raise concerns about the quality of GPT-4 and the training process behind it. Companies and individuals relying on GPT-3.5 and GPT-4 should regularly evaluate the models’ abilities to produce accurate responses, as the study shows that their performance can fluctuate and may not always improve. Until more information is provided about the reasons for the decrease in GPT-4’s quality and the training methods used, users may want to consider alternative options based on these results.
In conclusion, the study highlights the need for ongoing evaluation and scrutiny of AI models like ChatGPT to ensure their reliability and accuracy.