OpenAI has launched IndQA, a novel benchmark designed to assess AI models’ ability to comprehend and reason about Indian languages and cultural nuances. Unlike traditional benchmarks that focus on tasks like translation or multiple-choice questions, IndQA emphasizes the model’s capacity for complex reasoning and understanding of cultural contexts.
“As part of our mission to ensure AGI benefits all of humanity, it’s crucial that AI performs well across diverse languages and cultures,” OpenAI stated. “For AI to be truly universal, it must be effective in a variety of cultural settings.”
OpenAI also pointed out that existing multilingual benchmarks, like MMMLU, have become less useful, as top models are now achieving near-perfect scores. IndQA seeks to fill this gap by offering more challenging tasks that require models to engage with Indian cultural contexts and advanced reasoning skills.
IndQA encompasses 2,278 questions across 12 languages, such as Bengali, English, Hindi, Telugu, and Tamil, as well as 10 cultural domains, including architecture, design, food and cuisine, history, media, entertainment, and sports.
Each question is crafted by subject matter experts and includes a detailed rubric for assessment, an English translation for transparency, and an ideal response. The grading is handled by a model-based system, which evaluates whether the answers meet specific criteria defined by the experts.
Expert Collaboration and Rigorous Testing
OpenAI highlighted that the IndQA benchmark was developed with the input of 261 experts from diverse fields across India, including linguists, journalists, artists, professors, and practitioners.
“We collaborated with partners to identify experts in India across 10 specialized domains,” the company explained. “These experts created reasoning-centric prompts based on their regional knowledge and expertise.”
The questions were then subjected to adversarial filtering, where they were tested against some of OpenAI’s most advanced models, including GPT-4, GPT-4.5, GPT-5, and OpenAI’s o3 model. Only the questions that consistently challenged these models were kept. “This process ensures there is still room for growth and development,” OpenAI remarked.
Performance Analysis of Leading AI Models
OpenAI used IndQA to assess the performance of top AI models, revealing that GPT-5 (Thinking High) achieved the highest overall score of 34.9%, closely followed by Gemini 2.5 Pro at 34.3%, and Grok 4 at 28.5%. Older models like GPT-4o scored lower, reflecting clear advancements in the more recent iterations.
When the results were broken down by language, GPT-5 excelled in most Indian languages. However, OpenAI cautioned that IndQA shouldn’t be viewed as a direct cross-language comparison. “Since the questions differ across languages, scores should not be interpreted as straightforward comparisons,” the company clarified.
Cultural Insight and Regional Expertise
IndQA showcases the rich cultural diversity of India, with contributions from a wide range of experts. These include a Nandi Awards-winning Telugu actor and screenwriter, a Marathi journalist from Tarun Bharat, a Kannada linguistics scholar, a Tamil writer and activist, and a Gujarati heritage curator.
“IndQA challenges AI systems to move beyond basic translation and exhibit a deeper understanding of cultural and contextual nuances,” shared one of the experts involved in the project.
Expanding to Global Benchmarks
OpenAI stated that IndQA is part of its larger initiative to enhance AI accessibility in India, ChatGPT’s second-largest market, while also developing comparable benchmarks for other languages and regions.
“We hope IndQA will encourage the research community to create culturally relevant evaluation tools,” the company explained. “By developing similar benchmarks, we can help AI systems better understand languages and domains they currently face challenges with.”









