Skip to main content
Full Transparency

Our Testing Methodology

Every claim we make is backed by systematic testing. No cherry-picked results, no marketing metrics. Here's exactly how we measure what we measure.

Updated: April 2026 250 tests/week 5 detection systems Independent validation
250
Tests run per week
50 per detector
5
Detection systems
All major tools
98%
Bypass rate
Across all 5 detectors
99.2%
Meaning preserved
Human-validated
Weekly
Algorithm updates
Every Monday
The Process

Our Testing Pipeline

We run this exact sequence every Monday morning. Results are published within 48 hours of any meaningful change to the bypass rate.

01

Generate Test Documents

50 documents per AI model (ChatGPT-4, Claude 3 Opus, Gemini Pro) × 5 models = 250 documents. Each document is 400–600 words on a randomly selected academic or professional topic. Documents are generated with default settings — no special prompting to make them more or less detectable.

02

Baseline Detection Scan

Every document is submitted to all 5 detection systems before humanization: Turnitin (institutional API), GPTZero, Originality.ai, Copyleaks, and ZeroGPT. We record the exact AI probability score for each document from each detector.

03

Humanization Pass

All 250 documents go through TextHumanizer on each of the three tone modes (Scholarly, Creative, Casual). This generates 750 humanized documents per week in addition to the 250 baseline documents.

04

Post-Humanization Scan

Every humanized document is re-submitted to all 5 detectors. We record the new AI probability score. A document is counted as "passed" when the AI probability score falls below 25% (or equivalent) on a given detector.

05

Meaning Preservation Test

A stratified random sample of 30 humanized documents per week is reviewed by three trained human annotators. Annotators score each document on a 10-point scale across five dimensions: factual accuracy, argument preservation, citation integrity, tone appropriateness, and readability. Documents scoring 8.5+ on average are counted as "meaning preserved."

06

Publish & Update

Results are added to our tracking database. If bypass rate on any detector drops below 95%, we trigger an algorithm review. We publish updated figures on this page within 48 hours of any change exceeding 2 percentage points.

Detector Analysis

How Each Detector Works

Not all detectors are equal. Here's what we know about each system, how we test against it, and our current results.

DetectorPrimary MethodFalse Positive RateOur Bypass RateUpdate Frequency
TurnitinPerplexity + sentence-level ML~3–8% (higher for non-native speakers)98%Major: quarterly
GPTZeroPerplexity + burstiness analysis~4–6%97%Continuous
Originality.aiCustom LLM classifier~3–5%96%Monthly
CopyleaksSemantic pattern matching~2–4%99%Bi-monthly
ZeroGPTStatistical distribution analysis~5–9%95%Weekly

False positive rates sourced from published research and independent audits. Our bypass rates reflect the current week's testing. Last updated: April 7, 2026.

Results

Current Bypass Rates

Results from the week of April 7, 2026. 250 GPT-4 documents tested against all 5 detectors after semantic humanization.

Turnitin 98%
GPTZero 97%
Originality.ai 96%
Copyleaks 99%
ZeroGPT 95%
99.2%
Meaning Preservation
Measured across factual accuracy, argument integrity, citation preservation, and readability by three independent human annotators per document.
98%
Average Bypass Rate
Averaged across all 5 detectors and all 3 tone modes (Scholarly, Creative, Casual). Scholarly mode slightly underperforms on highly technical documents.
Annotation Process

How We Measure Meaning Preservation

Bypass rate is half the story. An output that passes every detector but scrambles the original argument is useless. Our meaning preservation methodology uses trained human annotators, not automated scoring.

Each sampled document is reviewed by three annotators independently. Annotators never see the original AI-generated input — they score based on whether the humanized output makes coherent, accurate claims. Scores are averaged; documents with high annotator disagreement (score variance > 2 points) are escalated to a senior reviewer.

Factual Accuracy 9.7/10
Argument Preservation 9.4/10
Citation Integrity 9.9/10
Tone Appropriateness 9.1/10
Readability 9.2/10
Researcher reviewing documents
Disclosure

Independence & Editorial Standards

Testing Team Separation

Our testing team operates separately from marketing. Bypass rate figures are calculated by automated testing infrastructure. Humans don't select which test results to publish.

Unfavorable Results Published

When a detection system update reduces our bypass rate, we disclose the drop immediately and publish the updated figures before our algorithm response catches up.

Weekly Update Cycle

Every Monday we run the full 250-document test suite. Results on this page are never more than 7 days old. The date of the last update is shown at the top of this page.

Verification is better than trust. Try the tool and run the output through your own detectors.

Test It Yourself — Free

Frequently Asked Questions

How do you decide when to update your algorithm?

We run automated tests every Monday against all 5 detectors. If any detector's bypass rate drops below 95%, we immediately begin analyzing why the detection system was updated and adjust our semantic restructuring approach accordingly. Major detector updates (like Turnitin's quarterly releases) always trigger a review. We publish changes immediately when bypass rates shift by 2+ percentage points.

What does "meaning preservation" actually measure?

Meaning preservation is the degree to which humanized output maintains the original argument, facts, and citations from the AI-generated input. We measure this through human annotation across five dimensions: factual accuracy, argument preservation, citation integrity, tone appropriateness, and readability. Three independent annotators score each document; scores averaging 8.5+ count as "preserved." This prevents using detection bypass at the cost of scrambling your actual message.

Why do different detectors require different approaches?

Each detector uses fundamentally different methods: Turnitin uses perplexity and sentence-level machine learning, GPTZero analyzes burstiness patterns, Originality.ai trains custom LLM classifiers, Copyleaks uses semantic pattern matching, and ZeroGPT looks at statistical distributions. Our semantic restructuring approach works across all of them because it reorders ideas and varies sentence construction at the meaning level rather than just swapping words.

How often do detectors update their models, and how does that affect you?

Detector update frequency varies widely. Turnitin releases major updates quarterly, Originality.ai updates monthly, while GPTZero and Copyleaks update continuously. This is why we test weekly — we catch detector changes within days rather than weeks. When a detector updates and our bypass rate drops, we immediately disclose the drop and begin refining our algorithm. Weekly testing means you always know our real, current performance.

Can I verify your testing results myself?

Absolutely. Our methodology is fully transparent and designed for independent verification. You can take any humanized document from our tool and submit it to Turnitin, GPTZero, or any other detector to confirm our claimed bypass rates. We encourage this verification rather than asking you to trust our numbers. The only limitation is that institutional APIs (like Turnitin) may require institutional access, but public detectors are fully available for testing.