Detecting AI-generated text through statistical analysis, linguistic patterns, and machine learning
AI-generated text can be detected through statistical analysis, linguistic pattern recognition, and ML classifiers— but no method is foolproof. The best tools achieve 85-99% accuracy on unmodified AI text while maintaining 1-5% false positive rates, though detection degrades sharply against paraphrased content, short texts, and non-English languages. This research synthesizes findings from 50+ academic papers and commercial benchmarks.
Measures perplexity, burstiness, and entropy to identify the statistical fingerprint of AI-generated text
Identifies vocabulary overuse, sentence uniformity, and structural rigidity characteristic of LLM output
Transformer-based deep learning models trained on millions of human and AI text samples
Uses reference language models to score text without task-specific training data
PPL(X) = exp(-1/T × Σ log P(xₜ|x<ₜ))Measures how "surprised" a language model is by text. AI produces low-perplexity text because LLMs minimize it during training — they select statistically probable tokens at each step.
B = (σ / μ) × 100Captures variation in complexity across a document. Humans naturally create "bursty" text — mixing registers, varying vocabulary. AI maintains a monotonous tempo through uniform next-token prediction.
H(X) = -Σ P(x) log P(x)Measures the confidence of token predictions. AI-generated text has lower predictive entropy because the model selects from a narrower, more concentrated probability distribution.
Kobak et al. (2024) analyzed 14.2M PubMed abstracts and identified 280 excess style words. Lexical diversity is ~15% lower in AI text with ~20% higher repetition rates.
GPT-4o uses ~10x more em dashes (—) than GPT-3.5. Most humans use regular hyphens since keyboards lack a dedicated em dash key. Overuse frequency in context is the key signal.
Headers, bullet points, numbered lists, and bold text even when continuous prose would be more natural. Lists with redundant items and near-every-sentence bolding.
Neutral, encyclopedic, emotionally detached tone. Perpetual hedging (“it's important to note”), multiple perspectives even when unnecessary, and restating rather than synthesizing.
A ten-signal heuristic framework for human-readable AI text detection. When multiple signals co-occur, confidence in AI authorship increases significantly.
Real-world accuracy is typically 15-25 percentage points below vendor claims. Vendors test on controlled datasets, longer texts, unmodified AI output, and English-only content.
Instead of post-hoc detection, watermarking embeds detectable signals during text generation. Requires model provider cooperation.
“The distribution of what AI-generated text and human-generated text looks like are converging on each other.” — Soheil Feizi, University of Maryland
Simpler vocabulary, memorized phrases, and limited sentence complexity trigger detection signals
Source: Stanford (Liang et al., 2023)
Standardized terminology, consistent voice, and rigid structure mimic AI patterns
Source: 250 pre-ChatGPT scientific articles
Repeated phrases, high structure, and unusual syntax patterns in autism/ADHD/dyslexia
Source: Multiple studies
The US Constitution was flagged 92-98% AI-generated; Declaration of Independence scored 98.51%
Source: Multiple detector tests
Black students approximately twice as likely as white/Latino peers to be falsely accused
Source: Common Sense Media
UC Davis
Senior William Quarterman received a failing grade after GPTZero flagged his history exam — suffered panic attacks before being exonerated.
Australian Catholic University
~6,000 cheating allegations in 2024 (~90% AI-related); ~25% dismissed. One nursing student waited 6 months, losing a graduate position.
Vanderbilt University
Disabled Turnitin AI detector entirely — calculated 1% FPR means ~750 incorrectly labeled papers per year from 75,000 submissions.
Doe v. Yale (2025)
French-native EMBA student suspended after GPTZero flagged exam — first major legal test of AI detection, alleging Title VI discrimination.
Paraphrasing is the dominant attack vector. No current detector is robust against determined adversarial manipulation.
| Attack Method | Impact on Detection |
|---|---|
| Paraphrasing (DIPPER) | 70.3% → 4.6% detection at 1% FPR |
| Adversarial Paraphrasing | 87.88% avg TPR reduction |
| Homoglyph Substitution | 40.6% avg accuracy loss |
| Back-Translation | Significant degradation |
| Prompt Engineering | 100% → 13% detection |
| Repetition Penalty Tuning | Up to 95% accuracy loss |
| GradEscape (Black-box) | 90% evasion effectiveness |
| High Temperature Decoding | Disrupts distributional patterns |
Mitchell et al., Stanford — ICML 2023
AI text occupies negative curvature regions of log probability function; 0.95 AUROC for GPT-NeoX
Bao et al. — ICLR 2024
75% improvement and 340x speedup over DetectGPT via conditional probability estimation
Hans et al., UMD — ICML 2024
>90% TPR at 0.01% FPR — strongest low-FPR performance; minimal ESL bias (99.67% accuracy)
Dugan et al. — ACL 2024
6.2M+ generations, 11 models, 8 domains — "detectors not yet robust enough for widespread deployment"
University of Tübingen — 2024
14.2M abstracts analyzed; "delves" 25.2x increase; 10% of 2024 abstracts processed with LLMs
University of Maryland — ICML 2024
Detection remains feasible except when distributions become indistinguishable across entire support
Indonesian training data represents only 0.6% of web content. Most detectors drop up to 30% accuracy for non-English text. No major academic paper specifically on Indonesian AI detection has been published — an underserved research area.
Ratio of baku (standard) vs. non-baku vocabulary — AI defaults to extremely formal bahasa baku
Natural Indonesian frequently mixes with English or local languages; AI rarely does this
Check if prefixes (me-, ber-, di-) are used naturally or formulaically
Local idioms, personal experiences, and references to Indonesian culture signal human authorship
AI uses safe defaults ("penting", "menarik") where humans vary ("krusial", "seru", "greget")
Near-perfect spelling/grammar signals AI — natural Indonesian writing includes small imperfections
Originality.ai
97.8% claimed
Winston AI
Multi-language
Copyleaks
30+ languages
Smodin
ID interface
Isgen.ai
ID-focused
Recommendation: Use minimum 2-3 different tools for Indonesian detection. The M4 dataset (EACL 2024 Best Resource Paper) includes ~6,000 Indonesian news samples as a training starting point.
P(xᵢ) = exp(zᵢ / T) / Σ exp(zⱼ / T)Distribution sharpens, model becomes deterministic. Text is more detectable — perplexity signals become extremely pronounced.
Standard probability distribution. Balanced between quality and diversity. Most commercial models use T=0.7-1.0 with top-p sampling for production output.
Distribution flattens, more randomness introduced. Text is harder to detect — disrupts distributional patterns detectors rely on.
Larger LLMs are harder to detect. Base models are more difficult to detect than chat-fine-tuned counterparts — instruction tuning introduces more detectable patterns. Different models have distinct “aidiolects.”
GPT-4
98-100%
Turnitin detection rate
Gemini
98-100%
Turnitin detection rate
ChatGPT o1
Struggling
Turnitin detection rate
Claude 3.5
53-60%
Turnitin detection rate
Long-form, single-author. Drops to 20-63% for hybrid/paraphrased content.
Abstract structure confuses classifiers. Fewer than 20% correctly identified.
Too short (<300 words). Professional email tone naturally mimics AI patterns.
Inherently structured, formal conventions make human/AI separation very difficult.
Text length is the single strongest predictor of detection reliability. Under 100 words: unreliable. 300-700 words: acceptable. 500+ words: trustworthy results.
“No AI detector is 100% reliable. They make probabilistic guesses, not deterministic judgments.”
— Arvind Narayanan, Princeton University
“We are opposed to the use of detectors in any sort of disciplinary or punitive context.”
— RAID Study Ethics Statement, ACL 2024
“Our tool is intended to assist educators — not to serve as a definitive judgment.”
— Turnitin, Official Position
“We do not believe that AI detection scores alone should be used for academic honesty purposes.”
— Originality.ai, Official Position
No single method is sufficient. Best commercial systems use 7+ components. Aggregating multiple detectors reduces false positives to near zero while maintaining reasonable true positive rates.
Conservative thresholds (85-90%), 300-word minimums, domain-specific calibration, ESL awareness. Results should never be presented as verdicts — only as probability indicators requiring human judgment.
XLM-RoBERTa or mDeBERTa-v3-base as backbone, fine-tuned on Indonesian data. Indonesian-specific features: formality analysis, code-mixing detection, cultural reference assessment. M4 dataset provides 6,000 samples.
Paraphrasing attacks reduce detection by 88% on average. Paraphrase-aware training, retrieval-based defenses, and regular model updates are essential. No permanent solution, only an ongoing process.
Detection scores are probabilistic assessments, not proof. The lab-to-real gap is 15-25%. Short texts, creative writing, and non-English content have lower reliability. Human judgment is an irreplaceable component.