xFakeSci: Exposes Fake Scientific Papers

type

status

date

slug

summary

The researchers analyzed two main characteristics of xFakeSci

Bigrams

Definition: A bigram is a phrase that frequently uses two words together, such as "climate change" or "clinical trials."

Comparative analysis: xFakeSci found that in real scientific articles, the number of big-word combinations is relatively large, and the diversity of the combinations is high. In fake articles, the number of big-word combinations is significantly smaller, but these limited big-word combinations are closely related to other concepts and words in the articles.

The association pattern between words and concepts

Characteristics of fake articles: Although there are fewer two-word combinations in fake articles, they are often widely connected with other parts of the content. This is due to the limitations of AI-generated content. AI tends to enhance the "credibility" of articles by repeating some significant words, rather than the breadth and rigor of real scientific research.

Characteristics of real articles: The purpose of human scientific researchers is to accurately report the research process and results, rather than deliberately "convince" readers. Therefore, the association patterns of words and concepts in real papers are more complex and diverse.

Differences in writing style

When generating academic articles, generative AI often tries to "convince" readers by using prominent keywords and key concepts, creating a seemingly "in-depth" impression. However, real scientific articles focus more on the detailed description of the research process and data analysis, and are extensive and objective.

AI-generated articles often provide overly focused and simplistic explanations on a topic, while real articles provide richer background information and multi-angle discussions.

Technical Principle of xFakeSci

xFakeSci uses an algorithm called TF-IDF to find which words in an article are the most important. It looks at how often a word appears in an article (term frequency), and then compares its frequency in the entire dataset. Words that appear a lot but are rarely seen in other articles are particularly important. This helps xFakeSci pick out "keywords" in the article to make judgments.

Two training modes

xFakeSci can train itself with two different kinds of data

Single-mode training: Use only one source of data, such as all real scientific literature or all AI-generated articles. This method is suitable for simple cases.

Multimodal training: Using both real literature and AI-generated articles for training is suitable for dealing with complex mixed content and can more accurately distinguish between the two.

Calibration steps

In order to make the results more accurate, xFakeSci will perform a calibration process after training . It will divide the data into several groups (called k-Fold cross-validation ) and repeatedly verify whether the model is stable on different data subsets. This step can help it avoid being too "biased" towards a certain type of data and improve accuracy.

Distance Algorithm

xFakeSci also uses a distance algorithm to help determine whether an article is true or false. In simple terms, it calculates the "distance" between an article and AI-generated content or real scientific articles, and determines which category it belongs to based on which one is closer. This method ensures that even if the characteristics of some articles are not particularly obvious, it can make a relatively accurate judgment.

Experimental Results of xFakeSci

Detection accuracy (F1 score)

xFakeSci’s F1 score is between 80% and 94%, which means it can distinguish AI-generated pseudoscience content from real scientific literature with high accuracy. Specifically:

Depression dataset: F1 score is 80%

Cancer dataset: F1 score is 91%

Alzheimer’s disease dataset: F1 score of 89%

These scores are much higher than traditional algorithms such as support vector machine (SVM), logistic regression, naive Bayes, etc., whose F1 scores are only between 38% and 52%.

Beyond traditional algorithms

xFakeSci significantly outperforms traditional algorithms in multiple experiments:

Traditional algorithms often misclassify real scientific literature, especially when faced with mixed data sets. xFakeSci is better able to handle these challenges, correctly identifying real literature and correctly classifying AI-generated pseudoscience content.

Its calibration step and distance approximation algorithm are the key to its success, which can effectively avoid the overfitting problem and enable the model to perform well in different data sets and time ranges.

In a new paper published in the journal Nature Scientific Reports , Hamed teamed up with Wu Xindong, a professor at Hefei University of Technology in China, to create 50 fake articles on each of three popular medical topics — Alzheimer’s disease, cancer, and depression — and compared them with an equal number of real articles on the same topics.

Hamed says that when he asked ChatGPT to generate AI-generated articles, “I tried to use the same keywords that I used when extracting literature from the National Institutes of Health’s PubMed database , so that we had a common basis for comparison. My hunch was that there must be some pattern between the fake world and the real world, but I didn’t know what that pattern was.”

After some experimentation, he programmed xFakeSci to analyze two main features of paper writing. One was the number of bigrams, which are two words that frequently appear together, such as “climate change,” “clinical trials,” or “biomedical literature.” The other was how those bigrams were related to other words and concepts in the text.

“The first striking finding is that the number of bigrams in the forged papers is very small, while the bigrams in the authentic papers are much more abundant,” Hamed said. “Also, although the bigrams in the forged papers are few in number, they are highly correlated with other content.”

Hamed and Wu speculate that the differences in writing styles are because the goals of human researchers are different from those of an AI that is asked to write about a specific topic.

“Because ChatGPT is still intellectually limited, it will try to convince you by using the most important words,” Hamed said. “A scientist’s job is not to convince you, but to honestly report what happened during the experiment and the methods used. ChatGPT emphasizes the depth of a single point, while real science emphasizes breadth.”

To further develop xFakeSci, Hamed plans to expand the range of research topics to see if this striking vocabulary pattern applies to other research fields, beyond medicine and expanding to engineering, other scientific topics, and the humanities. He also predicts that as AI technology continues to improve, it will become increasingly difficult to distinguish between real and fake content.

“If we don’t design comprehensive solutions, we’ll always be playing catch-up,” he said. “We still have a lot of work to do to find a general pattern or universal algorithm that doesn’t depend on which version of generative AI is used.”

Although their algorithm detected 94% of AI-generated papers, he added, that meant six fakes still escaped detection: “We need to be humble about our achievements. We have done something very important by raising awareness.”

Paper: https://www.nature.com/articles/s41598-024-66784-6

🔥

Rock-Solid Reliability: Count on RackNerd for uninterrupted uptime