Artificial intelligence (AI) has gained significant popularity in recent months, but its implementation in education has sparked controversy. One particular concern is the use of generative AI tools, like ChatGPT, by students to complete their assignments, such as writing essays or coding. Professors have varying stances on the use of this technology in their classrooms – some allow it, some forbid it, and others exercise discretion by using GPT detectors to scrutinize students’ work. However, a recently published peer-reviewed paper from Patterns reveals that these detectors may not be as reliable as previously thought.
The study conducted by researchers tested the performance of seven commonly used GPT detectors. They analyzed 91 essays written by Chinese speakers for the Test of English as a Foreign Language (TOEFL) and 88 essays written by U.S. eighth-graders obtained from the Hewlett Foundation’s Automated Student Assessment Prize (ASAP). The results showed that while all U.S. student essays were accurately classified, an average of 61% of the TOEFL essays were falsely labeled as AI-generated. In fact, one detector incorrectly flagged 97.8% of the TOEFL essays as AI-generated.
Furthermore, the research revealed that these GPT detectors are not as effective at detecting plagiarism as advertised. Many detectors claim 99% accuracy without providing evidence to support their claims. The researchers generated essays using ChatGPT, and only 70% of them were identified as AI-generated by the GPT detectors. By simply providing prompts to ChatGPT, such as asking it to “elevate the provided text by employing literary language,” the text improved enough to reduce the detection rate to 3%. This means that the GPT detectors incorrectly identified the essays as human-written 97% of the time.
Senior author James Zou from Stanford University cautioned against relying too heavily on these detectors, stating, “Our current recommendation is that we should be extremely careful about and maybe try to avoid using these detectors as much as possible.” The researchers attributed the errors to the detectors’ preference for complex language and their penalization of simpler word choices commonly used by non-native English writers. The TOEFL essays exhibited lower text perplexity, which surprised the AI model. If the next word in an essay is difficult for the GPT detector to predict, it is more likely to assume a human wrote the text. On the other hand, if the opposite is true, it will assume the text was generated by AI.
Detecting AI-generated content, in general, is challenging, which is why third-party computer programs have become popular for this purpose. However, this research suggests that these tools may marginalize non-native English writers in evaluative and educational settings. Zou explained that if these detectors are used to review job applications, college entrance essays, or high school assignments, it can have significant consequences. It may lead to discrimination, harassment, and restricted visibility for non-native English speakers.
Interestingly, the study also highlights the potential for GPT detectors to inadvertently push non-native English speakers to use more generative AI tools in an attempt to evade detection and improve their language skills. This could help them avoid the negative consequences associated with discrimination.
In conclusion, while AI has gained popularity in education, the use of generative AI tools and the reliability of GPT detectors remain controversial. The study emphasizes the limitations and potential biases of these detectors, particularly when assessing the work of non-native English writers. It calls for caution and suggests exploring alternative approaches to evaluating students’ work.