I Found a Test That Breaks AI, and It’s Not What You Think

How a notoriously difficult Japanese language test reveals the surprising strengths and weaknesses of today’s top AI.

You’ve probably seen AI do some incredible things, from writing code to creating stunning art. But how do we really know how smart these models are? We have standardized tests, sure, but I recently stumbled upon a fascinating idea for a true Kanken AI Benchmark—a test so difficult it pushes even the most advanced AI to its absolute limits. It’s not about math or logic; it’s about understanding one of the most complex writing systems in the world.

It all revolves around the Japan Kanji Aptitude Test, usually just called the Kanken. And believe me, it’s not your average language quiz.

So, What Exactly is the Kanken?

If you’ve ever studied Japanese, you know that kanji are the complex characters borrowed from Chinese. There are thousands of them, and the Kanken is a test designed to measure one’s mastery of them. The levels range from 10 (the easiest, for elementary students) all the way up to 1, which is notoriously difficult even for native Japanese speakers.

The test doesn’t just ask for definitions. It demands that you can read obscure words, understand their nuanced usage in literary contexts, and write them correctly from memory. It’s a deep dive into the history and artistry of the Japanese language. You can learn more about its structure on the official Kanken website. It’s this multi-layered complexity that makes it such a perfect, and brutal, test for an AI.

Why the Kanken AI Benchmark is So Tough

So, what happens when you put a model like Gemini or ChatGPT up against a high-level Kanken test? It turns out to be an incredible stress test that challenges AI in two very distinct ways.

1. The Vision Challenge (OCR)

First, the AI has to see the text. The test is on paper, written in vertical columns, just as traditional Japanese is. The AI needs to use Optical Character Recognition (OCR) to even begin. This isn’t like reading a clean line of Times New Roman. We’re talking about intricate, multi-stroke characters, some of which are rare and look incredibly similar to others.

This is the first major hurdle. If the AI misreads a single character, the entire meaning of a word can change, and the question becomes impossible to answer correctly. It’s a massive bottleneck.

2. The Understanding Challenge

Let’s say the AI’s vision system works perfectly. It still has to understand the question. High-level Kanken questions use classical and literary Japanese, which can feel worlds away from the modern language used in everyday conversation or on the internet. The AI needs a deep, contextual grasp of history, literature, and idiomatic expressions to choose the correct kanji or reading. It’s one thing to know a character’s meaning; it’s another to know how it behaves in a sentence written a century ago.

The Surprising Results of the Kanken AI Benchmark

When one of these tests was run on several major AI models, the results were pretty eye-opening.

When the models were just given the transcribed text (skipping the vision part), they did okay. Gemini and Claude scored a 15 out of 20, showing a solid grasp of the language itself.

But when they had to read the questions from an image and understand them? The scores plummeted. Every model except one scored a flat zero. They couldn’t get past the vision challenge. The only one that could handle both tasks was Gemini, and even it only managed to score an 8 out of 20.

This tells us something huge: AI is still struggling with the fundamental task of reading complex, real-world text accurately. The technology behind OCR has come a long way, but this benchmark shows it still has a long way to go.

Putting It to the Test Myself

Curious, I wanted to see it in action. I found a sample test page filled with 20 questions—10 asking for the hiragana reading of an underlined kanji word, and 10 asking for the correct kanji for an underlined katakana word.

The AI’s performance was impressive. It correctly identified and answered all 20 questions.

For example, it had to:
* Read complex words like 憂鬱 (Yuuutsu – Melancholy) and 枯淡 (Kotan – Refined simplicity).
* Correctly write the kanji for words written phonetically, like turning Doyomeki into 響めき (to stir or resound) and Hirugaeshi into 翻し (to flip or flutter).

To do this, it had to successfully OCR the vertical text, understand the literary context of the sentences, and draw on a massive well of linguistic knowledge. It was a clear demonstration of just how powerful these models can be, even if they aren’t perfect yet.

This whole experiment shows that while AI is getting scarily smart, we can still find its pressure points. The Kanken AI Benchmark is a beautiful example of how the richness and complexity of human culture—in this case, language—provides the ultimate challenge. It’s a reminder that true intelligence isn’t just about processing data; it’s about seeing, reading, and understanding nuance. And for now, that’s a test where humans still have the edge.