Active Research

AI Writing Checker

Detecting AI-generated text through statistical analysis, linguistic patterns, and machine learning

AI-generated text can be detected through statistical analysis, linguistic pattern recognition, and ML classifiers— but no method is foolproof. The best tools achieve 85-99% accuracy on unmodified AI text while maintaining 1-5% false positive rates, though detection degrades sharply against paraphrased content, short texts, and non-English languages. This research synthesizes findings from 50+ academic papers and commercial benchmarks.

85-99%

Best Tool Accuracy

50+

Papers Synthesized

15-25%

Vendor Overstatement Gap

6.2M+

RAID Benchmark Samples

Core Detection Methods

Statistical Analysis

Accuracy: 85-95%

Measures perplexity, burstiness, and entropy to identify the statistical fingerprint of AI-generated text

Perplexity scoring (PPL threshold ~20)

Burstiness coefficient of variation

Token probability distribution

N-gram frequency analysis

Linguistic Pattern Recognition

Accuracy: 75-90%

Identifies vocabulary overuse, sentence uniformity, and structural rigidity characteristic of LLM output

Vocabulary diversity measurement

Sentence length variation (CV)

Transition word frequency

Hedging pattern detection

ML-Based Classification

Accuracy: 85-99%

Transformer-based deep learning models trained on millions of human and AI text samples

Fine-tuned RoBERTa/DeBERTa

Multi-component ensemble models

Adversarial-aware training

Domain-specific calibration

Zero-Shot Detection

Accuracy: 90-95%

Uses reference language models to score text without task-specific training data

DetectGPT (probability curvature)

Fast-DetectGPT (340x speedup)

Binoculars (dual-model contrast)

DNA-GPT (n-gram overlap)

Statistical Foundation: Perplexity, Burstiness & Entropy

Perplexity (PPL)

PPL(X) = exp(-1/T × Σ log P(xₜ|x<ₜ))

Measures how "surprised" a language model is by text. AI produces low-perplexity text because LLMs minimize it during training — they select statistically probable tokens at each step.

Human Range

20-50

AI Range

5-10

Key Insight

GPTZero uses threshold PPL > ~85 for human authorship

Burstiness

B = (σ / μ) × 100

Captures variation in complexity across a document. Humans naturally create "bursty" text — mixing registers, varying vocabulary. AI maintains a monotonous tempo through uniform next-token prediction.

Human Range

0.61 avg

AI Range

0.38 avg

Key Insight

Low burstiness + low perplexity = likely AI; High burstiness + moderate perplexity = likely human

Entropy

H(X) = -Σ P(x) log P(x)

Measures the confidence of token predictions. AI-generated text has lower predictive entropy because the model selects from a narrower, more concentrated probability distribution.

Human Range

Higher

AI Range

Lower

Key Insight

Perplexity + entropy logistic regression achieves AUC of 0.963

Linguistic Markers of AI Text

Kobak et al. (2024) analyzed 14.2M PubMed abstracts and identified 280 excess style words. Lexical diversity is ~15% lower in AI text with ~20% higher repetition rates.

Overused Verbs

delve (25.2x increase)showcase (9.2x)underscore (9.1x)leveragefacilitateharnessnavigateilluminatefosterstreamlineempowerunleash

Overused Adjectives

comprehensiverobustmultifacetednuancedpivotalcrucialintricateparamountgroundbreakingseamlessholistictransformative

Overused Nouns & Phrases

landscaperealmparadigmtapestrysynergycornerstoneplethoraecosystemframeworkIt's important to note...In today's rapidly evolving...Cannot be overstated

Transition & Adverbs

moreoverfurthermorenotablyessentiallyfundamentallyundoubtedlyconsequentlyaccordinglymeticulouslyseamlesslyincreasinglythereby

Additional Detection Signals

The “ChatGPT Hyphen”

GPT-4o uses ~10x more em dashes (—) than GPT-3.5. Most humans use regular hyphens since keyboards lack a dedicated em dash key. Overuse frequency in context is the key signal.

Excessive Formatting

Headers, bullet points, numbered lists, and bold text even when continuous prose would be more natural. Lists with redundant items and near-every-sentence bolding.

Emotional Flatness

Neutral, encyclopedic, emotionally detached tone. Perpetual hedging (“it's important to note”), multiple perspectives even when unnecessary, and restating rather than synthesizing.

VERMILLION Detection Framework

A ten-signal heuristic framework for human-readable AI text detection. When multiple signals co-occur, confidence in AI authorship increases significantly.

Vocabulary

Overuse of sophisticated/formal words

Echoed structures

Repeated sentence patterns across paragraphs

Rigid transitions

"Moreover," "Furthermore," used mechanically

Mechanical connectives

Stock phrases replacing organic bridges

Impersonal voice

No lived experience or personal anecdotes

Lack of variation

Consistent tone without stylistic shifts

Instead of post-hoc detection, watermarking embeds detectable signals during text generation. Requires model provider cooperation.

Kirchenbauer et al. (Green/Red List)

ICML 2023

Mechanism

Partition vocabulary into green/red lists using previous token as hash; bias δ added to green-list logits

Detection

Z-score of green token fraction; detectable from ~200 tokens

Limitation

Vulnerable to paraphrasing

SynthID (Google DeepMind)

Nature, Oct 2024

Mechanism

Logits processor encoding watermark by adjusting token probability scores throughout generation

Detection

Weighted mean detector or Bayesian detector; open-sourced via Hugging Face

Limitation

Requires model provider cooperation

OpenAI Watermarking (unreleased)

Internal, 2024

Mechanism

99.9% effective watermarking system developed but not deployed

Detection

Not released — 30% of ChatGPT users said they'd use it less if implemented

Limitation

Trivially bypassed by translation/rewording

False Positives: The Critical Weakness

“The distribution of what AI-generated text and human-generated text looks like are converging on each other.” — Soheil Feizi, University of Maryland

Non-Native English (ESL) Writers

61.3% avg false positive rate

Simpler vocabulary, memorized phrases, and limited sentence complexity trigger detection signals

Source: Stanford (Liang et al., 2023)

Academic & Technical Writers

Up to 8% false positive

Standardized terminology, consistent voice, and rigid structure mimic AI patterns

Source: 250 pre-ChatGPT scientific articles

Neurodivergent Writers

Elevated risk

Repeated phrases, high structure, and unusual syntax patterns in autism/ADHD/dyslexia

Source: Multiple studies

Legal & Formulaic Writing

92-98% flagged as AI

The US Constitution was flagged 92-98% AI-generated; Declaration of Independence scored 98.51%

Source: Multiple detector tests

Racial & Demographic Bias

2x higher false accusation

Black students approximately twice as likely as white/Latino peers to be falsely accused

Source: Common Sense Media

Documented Harm from False Positives

UC Davis

Senior William Quarterman received a failing grade after GPTZero flagged his history exam — suffered panic attacks before being exonerated.

Australian Catholic University

~6,000 cheating allegations in 2024 (~90% AI-related); ~25% dismissed. One nursing student waited 6 months, losing a graduate position.

Vanderbilt University

Disabled Turnitin AI detector entirely — calculated 1% FPR means ~750 incorrectly labeled papers per year from 75,000 submissions.

Doe v. Yale (2025)

French-native EMBA student suspended after GPTZero flagged exam — first major legal test of AI detection, alleging Title VI discrimination.

Adversarial Attacks That Break Detection

Paraphrasing is the dominant attack vector. No current detector is robust against determined adversarial manipulation.

Attack Method	Impact on Detection	Difficulty	Cost
Paraphrasing (DIPPER)	70.3% → 4.6% detection at 1% FPR	Easy	$0.88/M tokens (SIRA)
Adversarial Paraphrasing	87.88% avg TPR reduction	Moderate	Low (LLM-guided)
Homoglyph Substitution	40.6% avg accuracy loss	Trivial	Free
Back-Translation	Significant degradation	Easy	Free (Google Translate)
Prompt Engineering	100% → 13% detection	Easy	Free
Repetition Penalty Tuning	Up to 95% accuracy loss	Moderate	API access needed
GradEscape (Black-box)	90% evasion effectiveness	Advanced	~2,000 queries
High Temperature Decoding	Disrupts distributional patterns	Easy	API access needed

Key Academic Papers (2023-2025)

DetectGPT

Mitchell et al., Stanford — ICML 2023

foundational

AI text occupies negative curvature regions of log probability function; 0.95 AUROC for GPT-NeoX

Fast-DetectGPT

Bao et al. — ICLR 2024

improvement

75% improvement and 340x speedup over DetectGPT via conditional probability estimation

Binoculars

Hans et al., UMD — ICML 2024

state-of-the-art

>90% TPR at 0.01% FPR — strongest low-FPR performance; minimal ESL bias (99.67% accuracy)

RAID Benchmark

Dugan et al. — ACL 2024

benchmark

6.2M+ generations, 11 models, 8 domains — "detectors not yet robust enough for widespread deployment"

Kobak et al. (PubMed Analysis)

University of Tübingen — 2024

empirical

14.2M abstracts analyzed; "delves" 25.2x increase; 10% of 2024 abstracts processed with LLMs

Sadasivan et al. vs Chakraborty et al.

University of Maryland — ICML 2024

theoretical

Detection remains feasible except when distributions become indistinguishable across entire support

Detecting AI Text in Indonesian (Bahasa Indonesia)

Indonesian training data represents only 0.6% of web content. Most detectors drop up to 30% accuracy for non-English text. No major academic paper specifically on Indonesian AI detection has been published — an underserved research area.

Formality Analysis

Ratio of baku (standard) vs. non-baku vocabulary — AI defaults to extremely formal bahasa baku

Code-Mixing Detection

Natural Indonesian frequently mixes with English or local languages; AI rarely does this

Affixation Patterns

Check if prefixes (me-, ber-, di-) are used naturally or formulaically

Cultural Reference Presence

Local idioms, personal experiences, and references to Indonesian culture signal human authorship

Vocabulary Diversity

AI uses safe defaults ("penting", "menarik") where humans vary ("krusial", "seru", "greget")

Grammar Perfection Score

Near-perfect spelling/grammar signals AI — natural Indonesian writing includes small imperfections

Tools Supporting Indonesian Detection

Originality.ai

97.8% claimed

Winston AI

Multi-language

Copyleaks

30+ languages

Smodin

ID interface

Isgen.ai

ID-focused

Recommendation: Use minimum 2-3 different tools for Indonesian detection. The M4 dataset (EACL 2024 Best Resource Paper) includes ~6,000 Indonesian news samples as a training starting point.

Temperature, Sampling & Detectability

P(xᵢ) = exp(zᵢ / T) / Σ exp(zⱼ / T)

T < 1 (Low Temperature)

Distribution sharpens, model becomes deterministic. Text is more detectable — perplexity signals become extremely pronounced.

T = 1 (Default)

Standard probability distribution. Balanced between quality and diversity. Most commercial models use T=0.7-1.0 with top-p sampling for production output.

T > 1 (High Temperature)

Distribution flattens, more randomness introduced. Text is harder to detect — disrupts distributional patterns detectors rely on.

Detection Accuracy by Model

Larger LLMs are harder to detect. Base models are more difficult to detect than chat-fine-tuned counterparts — instruction tuning introduces more detectable patterns. Different models have distinct “aidiolects.”

GPT-4

98-100%

Turnitin detection rate

Gemini

98-100%

Turnitin detection rate

ChatGPT o1

Struggling

Turnitin detection rate

Claude 3.5

53-60%

Turnitin detection rate

Detection Reliability by Content Type

Academic Essays

85-99%

Best Case

Long-form, single-author. Drops to 20-63% for hybrid/paraphrased content.

Creative Writing

<20%

Hardest

Abstract structure confuses classifiers. Fewer than 20% correctly identified.

Emails & Short Text

Low

Unreliable

Too short (<300 words). Professional email tone naturally mimics AI patterns.

Technical Documentation

Variable

Nearly Indistinguishable

Inherently structured, formal conventions make human/AI separation very difficult.

Text length is the single strongest predictor of detection reliability. Under 100 words: unreliable. 300-700 words: acceptable. 500+ words: trustworthy results.

Best Practices for Reliable Detection

Use multiple detectors in aggregate — reduces false positives to ~0%

Conservative thresholds: 85-90% confidence minimum

Minimum text length: 300 words (ideally 500+)

Suppress low-confidence results (display ranges, not exact scores)

Domain-specific calibration (academic vs. creative vs. technical)

Never treat any single score as definitive proof

Include explicit ESL/non-native awareness

Combine product-based detection with process-based verification

Expert & Industry Consensus

“No AI detector is 100% reliable. They make probabilistic guesses, not deterministic judgments.”

— Arvind Narayanan, Princeton University

“We are opposed to the use of detectors in any sort of disciplinary or punitive context.”

— RAID Study Ethics Statement, ACL 2024

“Our tool is intended to assist educators — not to serve as a definitive judgment.”

— Turnitin, Official Position

“We do not believe that AI detection scores alone should be used for academic honesty purposes.”

— Originality.ai, Official Position

Five Principles for Building Detection Systems

Multi-Signal Ensemble Detection is Non-Negotiable

No single method is sufficient. Best commercial systems use 7+ components. Aggregating multiple detectors reduces false positives to near zero while maintaining reasonable true positive rates.

False Positive Prevention Must Be a Design Priority

Conservative thresholds (85-90%), 300-word minimums, domain-specific calibration, ESL awareness. Results should never be presented as verdicts — only as probability indicators requiring human judgment.

Indonesian Detection Requires Purpose-Built Features

XLM-RoBERTa or mDeBERTa-v3-base as backbone, fine-tuned on Indonesian data. Indonesian-specific features: formality analysis, code-mixing detection, cultural reference assessment. M4 dataset provides 6,000 samples.

The Adversarial Landscape Demands Continuous Adaptation

Paraphrasing attacks reduce detection by 88% on average. Paraphrase-aware training, retrieval-based defenses, and regular model updates are essential. No permanent solution, only an ongoing process.

Transparency About Limitations Protects Credibility

Detection scores are probabilistic assessments, not proof. The lab-to-real gap is 15-25%. Short texts, creative writing, and non-English content have lower reliability. Human judgment is an irreplaceable component.

Back to Research Lab AI Pillar

Attack Method

Impact on Detection

Paraphrasing (DIPPER)

70.3% → 4.6% detection at 1% FPR

Adversarial Paraphrasing

87.88% avg TPR reduction

Homoglyph Substitution

40.6% avg accuracy loss

Back-Translation

Significant degradation

Prompt Engineering

100% → 13% detection

Repetition Penalty Tuning

Up to 95% accuracy loss

GradEscape (Black-box)

90% evasion effectiveness

High Temperature Decoding

Disrupts distributional patterns