There are several AI generation tools, but it’s fair to say that ChatGPT and Gemini are some of the most well-known. Here’s how they compare.
There are tens of thousands of different AI products out there, although most of us have only heard of a handful of them. Comparing two of the biggest AI systems — ChatGPT and Gemini – isn’t a straightforward undertaking. For one thing, things can change overnight. Back in December 2025, people were speculating on whether OpenAI was losing the AI arms race, then a couple of days later, it released ChatGPT-5.2 and started topping the leaderboards again.
So how can you tell which AI does stuff better? A few years ago, we could have run some side-by-side comparisons. Earlier generations of AI large language models (LLMs) could be quite noticeably different from one another. But the gaps are closing fast, especially when you’re talking about big-name brands like OpenAI and Google. Although you’ll still find some recent articles where someone has put a single prompt into both systems and ranked which response they prefer, this method is hopelessly flawed. For one thing, LLM outputs are « stochastic », meaning that responses include an element of randomness, so the same prompt can result in different responses. Also, there’s very little that ChatGPT and Gemini can’t do these days. Any preference in responses would really be about a preferred chatbot style. And that’s only going to be its out-of-the-box personality. A chatbot’s tone and conversational style can be customized to suit your preferences.
So, given that we’re not going to undertake multiple trials using blind evaluations and aggregated results, we shall leave the rankings to the experts. There are a variety of benchmarks that test AI systems on things like reasoning, logic, and problem-solving. We’ll cover three of the significant ones where ChatGPT performs well. There’s an explanation of how we chose which benchmarks to include at the end of this article.Answer difficult Google-proof science questions
The first benchmark we’ll look at is GPQA Diamond. This is designed to test PhD-level reasoning in physics, chemistry, and biology. GPQA stands for Google-Proof Questions and Answers. There’s a standard test and the ‘Diamond’ one, which has particularly difficult questions. Being Google-proof means these aren’t just questions with one simple answer you can look up. They require complex reasoning skills.
To answer correctly, an AI would need to apply multiple scientific concepts, resist making assumptions or taking shortcuts, and ignore red herrings.