AI Boosted Test Scores 127%. Then Students Couldn't Think Without It.

Students engaged in hands-on virtual laboratory learning
Hands-on learning builds skills that survive beyond the AI assistant

The numbers seemed like a breakthrough: students using AI tutoring tools saw test scores rise by up to 127%. Schools celebrated. Headlines proclaimed the future of personalised learning had arrived.

Then researchers took the AI away.

In follow-up assessments without AI assistance, those same students scored 17% lower than peers who had never used AI tools at all. The OECD's new Digital Education Outlook 2026 report puts hard data behind what many educators have quietly suspected: when AI does the thinking, students may stop learning how to think.

The Study That Changed the Conversation

The OECD report highlights research conducted by the Wharton School at the University of Pennsylvania, following over 1,000 high school students in Türkiye across a full academic year. Students were divided into three groups:

  • Answer-based AI: A chatbot providing direct solutions to problems
  • Tutor-style AI: A chatbot offering step-by-step hints rather than answers
  • Control group: Traditional study methods without AI

During the AI-assisted period, results looked promising. Students using answer-based chatbots scored 48% higher than the control group. Those with tutor-style AI performed even better, with gains up to 127%.

The reversal came when AI access was removed. Students who had relied on AI scored an average of 17% lower than those who never used it. Their ability to solve problems independently had weakened.

The OECD Calls It "False Mastery"

The report introduces a term that should concern every educator: false mastery. Students feel they understand concepts because they've seen polished AI explanations. Grades improve. Confidence rises. But the underlying cognitive work, the struggle that builds genuine understanding, has been outsourced.

As the OECD puts it: "The thinking happens elsewhere. What remains is a sense of understanding that collapses under pressure."

This isn't an argument against technology in education. It's a warning about the type of technology we deploy. Tools that do the thinking for students are fundamentally different from tools that help students think.

Why Struggle Matters in Science Education

Consider what happens when a student learns to titrate in a chemistry lab. They overshoot the endpoint. The solution turns too pink. They have to start again.

That failure is the learning.

The careful hand coordination required to control a burette. The visual attention needed to spot the colour change. The procedural memory built through repetition. None of this can be acquired by reading an AI's explanation of how titration works.

The OECD report includes a line that could serve as a manifesto for practical science education:

"The struggle, the confusion, and the slow progress are not flaws in education. They are the point."

Virtual Labs That Build Real Skills

At WhimsyLabs, this research validates what we've built into our platform from day one. Our virtual laboratories are designed around a simple principle: students must do the work.

When a student performs an experiment in WhimsyLabs:

  • They make decisions: Which reagents to use, how much, in what order
  • They make mistakes: Spill liquids, mix incorrectly, forget safety steps
  • They generate unique data: Our physics engine produces authentic results based on what they actually did, not predetermined outcomes
  • They interpret results: Drawing conclusions from their own experimental data, not copying AI-generated analysis

Our AI tutor, WhimsyCat, provides feedback and guidance, but never does the experiment for the student. There's no "show me the answer" button. The learning happens through doing.

AI-Proof Assessment by Design

Perhaps most importantly, WhimsyLabs' dynamic assessment approach makes using external AI tools for answering questions fundamentally ineffective. Here's why:

  • We grade the process, not just results: Our system tracks physical inputs within the virtual lab (equipment handling, reaction times, procedural accuracy) which AI cannot simulate or fake
  • Questions tied to unique data: Follow-up assessment questions are strictly generated from each student's own experimental data, meaning generic AI-generated answers are useless
  • No two experiments are identical: Our physics engine introduces realistic variation (temperature perturbations, impurities, sample deviation) so every student's results are genuinely unique

When a student asks ChatGPT "What pH did I measure?", the AI has no way to know. When asked "Why did your titration require more NaOH than the theoretical amount?", only the student who performed that specific experiment can answer meaningfully. This isn't a workaround for AI cheating; it's a fundamental reimagining of how assessment works.

The Difference Between Assistance and Replacement

Not all educational AI is problematic. The OECD's own research shows that well-designed AI tutoring, which provides hints rather than answers, can be genuinely beneficial. The key distinction is whether technology assists cognitive work or replaces it.

WhimsyLabs falls firmly in the assistance category:

  • We simulate reality: Students interact with physics-accurate equipment and materials
  • We provide feedback: WhimsyCat identifies where technique could improve, without doing the technique for the student
  • We enable practice: Unlimited attempts mean students can build genuine proficiency through repetition
  • We preserve struggle: Experiments can fail, and that failure is educational

What Schools Should Ask Before Adopting AI Tools

The OECD report prompts important questions for any school considering AI-enhanced learning tools:

  1. Does this tool require students to think, or does it think for them?
  2. What happens to learning outcomes when the tool is removed?
  3. Does this build skills that transfer to real-world contexts?
  4. Is there productive struggle, or just polished answers?

Virtual laboratories that provide scripted, click-through experiences fail these tests just as surely as AI chatbots that generate essay answers. The question isn't whether technology is involved. It's whether the student remains the one doing the cognitive work.

Preparing Students for a World With AI

Here's the irony: students will need to work alongside AI throughout their careers. But to use AI effectively, they need the foundational understanding to evaluate AI outputs, recognise errors, and know when human judgment is required.

You cannot critically assess an AI's chemistry analysis if you've never developed your own understanding of chemistry through hands-on practice. The OECD calls this the need for "hybrid human-AI skills": knowing when to use AI and when to step away.

Building those hybrid skills requires exactly what WhimsyLabs provides: authentic experiences that develop genuine understanding, which students can then apply whether or not AI tools are available.

The Path Forward

The OECD's findings shouldn't discourage technology use in education. They should sharpen our focus on the right technology. Tools that enhance human capability rather than replacing it. Platforms that preserve the productive struggle essential for deep learning.

In science education, that means virtual laboratories where students actually experiment, actually fail, and actually learn: the WhimsyLabs approach.

Grades are rising in AI-assisted classrooms. But as the research shows, grades aren't the same as learning.

Ready to see virtual labs that build real skills? Get in touch to experience the WhimsyLabs difference.


Sources:

All Posts