Researchers develop cybersecurity test for AI being used by Google

ritnews@rit.edu (Scott Bureau) | Tue, Jun 10, 2025

Rochester Institute of Technology experts have created a new tool that tests artificial intelligence (AI) to see how much it really knows about cybersecurity. And the AI will be graded.

The tool, called CTIBench, is a suite of benchmarking tasks and datasets used to assess large language models (LLMs) in Cyber Threat Intelligence (CTI). CTI is a crucial security process that enables security teams to proactively defend against evolving cyber threats.

The evaluation tool comes at a time when AI assistants claim to have security knowledge and companies are developing cybersecurity-specific LLMs. For example, Microsoft Copilot has an integrated security platform.

Until now, there has been no way to tell if an LLM has the capability to work as a security assistant.

<p>CREDIT</p>">

A headshot of a woman

Nidhi Rastogi, assistant professor in the Department of Software Engineering

“Is the LLM reliable and trustworthy?” asked Nidhi Rastogi, assistant professor in RIT’s Department of Software Engineering. “Can I ask it a question and expect a good answer? Will it hallucinate?”

CTIBench is the first and most comprehensive benchmark in the Cyber Threat Intelligence space. The tool is already being used by Google, Cisco, and Trend Micro.

“We should embrace using AI, but there should always be a human in the loop,” said Rastogi. “That’s why we are creating benchmarks—to see what these models are good at and what their capabilities are. We’re not blindly following AI but smartly integrating it into our lives”.”

In her AI4Sec Research Lab, Rastogi is studying at the crossroads of cybersecurity and AI. She developed CTIBench along with computing and information sciences Ph.D. students Md Tanvirul Alam, Dipkamal Bhusal, and Le Nguyen.

The RIT team began by working on SECURE, a benchmark focused on evaluating LLMs in the context of industrial control systems. A paper on SECURE was later accepted to the 2024 Annual Computer Security Applications Conference.

“That experience made us realize how critical it is to evaluate LLMs in other high-stakes domains,” said Bhusal. “Since there was no reliable benchmark for CTI, we felt it was the right time to build one.”

CTIBench is like a test on how much a LLM knows. Throughout the five different benchmarks, the AI completes tasks as if it is a security analyst at a security operations center. Tasks include root cause mapping and calculating Common Vulnerability Scoring System scores.

<p>CREDIT</p>">

A bar graph indicating overestimates and underestimates

A graph from the paper illustrates the number of overestimations and underestimations made by different LLMs when predicting the severity score of security flaws in information systems. All models exhibit a higher frequency of overestimation compared to underestimation, which suggests that LLMs may need calibration to improve their accuracy in threat severity prediction.

The RIT team also created 2,000 cybersecurity questions using ChatGPT—with a lot of trial and error in prompt engineering the questions. All the questions were validated by real security professionals and cybersecurity graduate students. Questions in the evaluation range from basic security specialist definitions to technical NIST specifications to determining the next steps of a threat situation.

“One of the most challenging and rewarding aspects was designing appropriate tasks to quantitatively evaluate the capabilities of LLMs in the domain of Cyber Threat Intelligence,” said Alam.

While creating CTIBench over several months, the RIT team ran the tool through five different LLMs. In the end, the benchmark provides an evaluation of the LLM it is testing—showing its percentage of accuracy on the different tasks.

The researchers published “CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence” at NeurIPS 2024, the Conference on Neural Information Processing Systems. It was a spotlight paper among the top 2 percent accepted at NeurIPS.

Now, industry has taken notice of CTIBench. It is free and open access—available on the Hugging Face API and GitHub.

Google is using CTIBench to evaluate its new experimental cybersecurity model Sec-Gemini v1. Cisco and Trend Micro are using CTIBench to evaluate cybersecurity applications in their own LLMs. Chris Madden, a distinguished technical security engineer at Yahoo Security, has also brought attention to CTIBench in his Common Weakness Enumeration benchmark effort in collaboration with the MITRE Corporation.

“The quick adoption of CTIBench validates our research impact and positions us well in cybersecurity and LLM research,” said Rastogi. “This is opening doors to new collaborations, funding, and real-world industry impact.”

Latest All News

View all