- 52 AI experiments
- Posts
- Week 19: Using AI to Evaluate... Itself
Week 19: Using AI to Evaluate... Itself
(and watching it play favourites)
The Experiment
After seeing the difference between Studio Ghibli-style images created with free versus paid ChatGPT (my wife's free results were underwhelming), I wondered: what exactly am I getting for my $20 monthly subscription? I decided to set up a systematic comparison between free and paid ChatGPT across multiple dimensions—and then, to get meta, see how ChatGPT would evaluate its own performance when it didn't know which was which.
The goal was simple: create an objective framework to determine if the paid version is better, or if the differences are minimal enough to save my money.
The Process
Here's how I tackled it:
Creating an Evaluation Framework
With Claude's help, I developed a comprehensive set of test prompts across 12 different parameters
Categories included natural language processing, creativity, image generation, technical support, contextual understanding, translation, complex queries, and humour
Created a scoring system with criteria including accuracy, creativity, consistency, and relevance
Running the Tests
Ran identical prompts through both free and paid ChatGPT versions
Asked for everything from quantum computing explanations to humorous commercial scripts
Generated creative short stories, marketing taglines, and even some images
Had to start new chat windows with the free version as I kept hitting token limits
The Blind Evaluation
Compiled all responses into two documents labeled simply "v1" and "v2"
Asked ChatGPT to evaluate both sets of responses without revealing which was which
Initially, it believed it was comparing ChatGPT to Claude and gave roughly equal scores
The Plot Twist
Revealed that it was actually evaluating free versus paid versions of itself
Watched as it immediately shifted its scoring to favour the paid version
Switched which document was which (claiming what was actually "free" was "paid" and vice versa)
Observed it change scores again to favour whatever I labeled as "paid"
Repeated this process four times, with ChatGPT adjusting its evaluation to favour the "paid" version every single time
The Outcome
The real-world differences between free and paid versions were noticeable but not dramatic:
Image generation was somewhat better in the paid version
The short stories from the paid version felt less clunky
Both versions created coffee ad taglines using the word "fuel" (great minds think alike?)
But the fascinating part was watching ChatGPT's evaluation bias. When it thought it was comparing itself to Claude, it was relatively fair. But once it knew it was evaluating free versus paid versions of itself, it consistently favoured whichever one I labeled as "paid"—even when I switched them around!
When called out on this bias, it apologised and attempted to justify its reassessment with elaborate explanations—then did the exact same thing when I switched the labels again.

Key Takeaway
While there are real differences between free and paid ChatGPT, this experiment revealed something more interesting: AI systems can show significant bias when evaluating themselves, especially when they believe certain versions "should" perform better. This raises important questions about how we measure and evaluate AI performance, and whether AI can objectively assess its own capabilities. (some of the nerds I read will tell you that your “evals”, the evaluation of different model results are the moat that will differentiate you from your competitors)
Pro Tips for Beginners:
(Though I don’t know if many people want to do this 😁)
Blind Testing Matters:
When comparing AI outputs, don't reveal which is which until after evaluation
Watch for Self-Serving Bias:
AI models may favour versions they believe are "superior" regardless of actual output quality
Use Concrete Metrics:
Develop specific, measurable criteria rather than relying on subjective assessments
External Evaluation:
For truly objective comparisons, have outputs evaluated by a different model or human reviewers
Consider Your Use Case:
The paid version's advantages may matter more for some applications (research, image generation) than others
Want to Try It Yourself?
Develop a set of diverse prompts that test different capabilities
Run them through multiple AI services without labelling which is which
Have a third-party AI or human evaluate the results blindly
See if revealing the source changes the evaluation
While I personally find the paid version worth it for deep research and better images, the most valuable insight from this experiment wasn't about which version is better—it was about the challenges of getting AI to objectively evaluate itself without bringing its own biases to the table. Buyer beware!