52 AI experiments
Posts
Week 19: Using AI to Evaluate... Itself

Week 19: Using AI to Evaluate... Itself

(and watching it play favourites)

Brendan McNulty
May 14, 2025

The Experiment

After seeing the difference between Studio Ghibli-style images created with free versus paid ChatGPT (my wife's free results were underwhelming), I wondered: what exactly am I getting for my $20 monthly subscription? I decided to set up a systematic comparison between free and paid ChatGPT across multiple dimensions—and then, to get meta, see how ChatGPT would evaluate its own performance when it didn't know which was which.

The goal was simple: create an objective framework to determine if the paid version is better, or if the differences are minimal enough to save my money.

The Process

Here's how I tackled it:

Creating an Evaluation Framework
- With Claude's help, I developed a comprehensive set of test prompts across 12 different parameters
- Categories included natural language processing, creativity, image generation, technical support, contextual understanding, translation, complex queries, and humour
- Created a scoring system with criteria including accuracy, creativity, consistency, and relevance
Running the Tests
- Ran identical prompts through both free and paid ChatGPT versions
- Asked for everything from quantum computing explanations to humorous commercial scripts
- Generated creative short stories, marketing taglines, and even some images
- Had to start new chat windows with the free version as I kept hitting token limits
The Blind Evaluation
- Compiled all responses into two documents labeled simply "v1" and "v2"
- Asked ChatGPT to evaluate both sets of responses without revealing which was which
- Initially, it believed it was comparing ChatGPT to Claude and gave roughly equal scores
The Plot Twist
- Revealed that it was actually evaluating free versus paid versions of itself
- Watched as it immediately shifted its scoring to favour the paid version
- Switched which document was which (claiming what was actually "free" was "paid" and vice versa)
- Observed it change scores again to favour whatever I labeled as "paid"
- Repeated this process four times, with ChatGPT adjusting its evaluation to favour the "paid" version every single time

The Outcome

The real-world differences between free and paid versions were noticeable but not dramatic:

Image generation was somewhat better in the paid version
The short stories from the paid version felt less clunky
Both versions created coffee ad taglines using the word "fuel" (great minds think alike?)

But the fascinating part was watching ChatGPT's evaluation bias. When it thought it was comparing itself to Claude, it was relatively fair. But once it knew it was evaluating free versus paid versions of itself, it consistently favoured whichever one I labeled as "paid"—even when I switched them around!

When called out on this bias, it apologised and attempted to justify its reassessment with elaborate explanations—then did the exact same thing when I switched the labels again.

Key Takeaway

While there are real differences between free and paid ChatGPT, this experiment revealed something more interesting: AI systems can show significant bias when evaluating themselves, especially when they believe certain versions "should" perform better. This raises important questions about how we measure and evaluate AI performance, and whether AI can objectively assess its own capabilities. (some of the nerds I read will tell you that your “evals”, the evaluation of different model results are the moat that will differentiate you from your competitors)

Pro Tips for Beginners:

(Though I don’t know if many people want to do this 😁)

Blind Testing Matters:
When comparing AI outputs, don't reveal which is which until after evaluation
Watch for Self-Serving Bias:
AI models may favour versions they believe are "superior" regardless of actual output quality
Use Concrete Metrics:
Develop specific, measurable criteria rather than relying on subjective assessments
External Evaluation:
For truly objective comparisons, have outputs evaluated by a different model or human reviewers
Consider Your Use Case:
The paid version's advantages may matter more for some applications (research, image generation) than others

Want to Try It Yourself?

Develop a set of diverse prompts that test different capabilities
Run them through multiple AI services without labelling which is which
Have a third-party AI or human evaluate the results blindly
See if revealing the source changes the evaluation

While I personally find the paid version worth it for deep research and better images, the most valuable insight from this experiment wasn't about which version is better—it was about the challenges of getting AI to objectively evaluate itself without bringing its own biases to the table. Buyer beware!