Trust But Verify: Evaluating the Accuracy of LLMs in Normalizing Threat Data Feeds

This paper examines whether Large Language Models (LLMs) can be reliably applied to the normalization of Indicators of Compromise (IOCs) into Structured Threat Information Expression (STIX) format. Using benchmark datasets of 200 IOCs across three types (MD5 hashes, URLs, and IPv4 addresses), the performance of Google’s Gemini 2.0 Flash and OpenAI’s ChatGPT-4o will be evaluated.

While both models achieved 100% validity in generating syntactically correct STIX outputs, their fidelity in accurately preserving IOC values varied significantly. Gemini outperformed ChatGPT overall, though both models struggled with hash values, exhibiting frequent omissions and erroneous pattern translations. The inconsistencies in these errors pose a major obstacle to the reliable use of LLMs in operational security and data engineering pipelines.

Download file

SANS-Trust-But-Verify-Peterson (PDF, 0.43MB)

16 Jul 2025

ByNicholas Peterson

All papers are copyrighted

No re-posting of papers is permitted

Related Content

From Alert to Evidence: Evaluating AI Agents for Cyber Forensic Triage

Research Paper

Cyber defense teams are beginning to experiment with large language models in security operations, but their usefulness in digital forensics and incident triage is still uncertain.

11 Jun 2026
Connor Blackard

Leveraging Large Language Models for Cross-Vendor Firewall Configuration Migration: A Comparative Case Study of Claude and ChatGPT

Research Paper

This paper investigates how two current-generation large language models (LLMs) perform on a single, representative firewall migration task.

12 May 2026
Omar Zaman

AI-Driven SecOps: Unifying Controls, Automating Response, and Advancing the Modern SOC Using Cortex XSIAM

Research Paper

New research from IDC reveals the tangible business value of rigorous, practitioner-led training from SANS: faster threat detection and response, reduced operational risk, stronger team cohesion, and millions in annual cost savings.

29 Jul 2025
Dave Shackleford

Do AI Coding Assistants Make Bad Coders Worse? A Security Evaluation of GitHub Copilot

Research Paper

This paper examines whether the overall security posture of a project affects the quality of the code produced by Copilot.

11 Jul 2025
Andrew Hannaford

Dropzone AI Can Make Internal SOC Teams More Effective

Research Paper

In this paper, SANS Certified Instructor Mark Jeanmougin examines how Dropzone AI can integrate into existing security stacks and help SOC teams stay focused on high-impact decisions.

17 Jun 2025
Mark Jeanmougin

Beneath the Mask: Can Contribution Data Unveil Malicious Personas in Open-Source Projects?

Research Paper

In February 2024, after building trust over two years with project maintainers by making a significant volume of legitimate contributions, GitHub user "JiaT75" self-merged a version of the XZ Utils project containing a highly sophisticated well-disguised backdoor targeting sshd processes running on systems with the backdoored package installed.

13 May 2025
SANS Institute

AI-Driven Insecurity: Assessing Security Gaps in AI Generated IT Guidance

Research Paper

The increasing reliance on AI-generated technical guidance for IT system configuration introduces significant security risks. This study assesses these risks through a case study: setting up an Apache web server on a Rocky Linux system using instructions from seven AI models.

13 May 2025
Edward Abbott

Leveraging Large Language Models for Security-Focused Code Reviews

Research Paper

This study investigates the potential application of Large Language Models (LLMs) in enhancing software security through automated vulnerability detection during the code review process.

26 Mar 2025
Daniel McQuade

MITRE ATT&CK Labeling of Cyber Threat Intelligence via LLM

Research Paper

This paper explores the effectiveness of various online and locally hosted LLMs in classifying an arbitrary statement as containing an MITRE ATT&CK Framework (MAF) technique or not and then producing the technique number if it does.

7 Jan 2025
Terence O’Brien

AI Hunting with the Cybereason Platform: A SANS Review

Research Paper

SANS reviewed Cybereason's AI hunting platform, which offers a lightweight, behavior-focused model...

23 Jul 2018
Dave Shackleford

Applying Machine Learning Techniques to Measure Critical Security Controls

Research Paper

Implementing and measuring Critical Security Controls (CSC) requires analyzing all data types...

6 Sep 2016
Balaji Balakrishnan

Trust But Verify: Evaluating the Accuracy of LLMs in Normalizing Threat Data Feeds

Related Content

From Alert to Evidence: Evaluating AI Agents for Cyber Forensic Triage

Leveraging Large Language Models for Cross-Vendor Firewall Configuration Migration: A Comparative Case Study of Claude and ChatGPT

AI-Driven SecOps: Unifying Controls, Automating Response, and Advancing the Modern SOC Using Cortex XSIAM

Do AI Coding Assistants Make Bad Coders Worse? A Security Evaluation of GitHub Copilot

Dropzone AI Can Make Internal SOC Teams More Effective

Beneath the Mask: Can Contribution Data Unveil Malicious Personas in Open-Source Projects?

AI-Driven Insecurity: Assessing Security Gaps in AI Generated IT Guidance

Leveraging Large Language Models for Security-Focused Code Reviews

MITRE ATT&CK Labeling of Cyber Threat Intelligence via LLM

AI Hunting with the Cybereason Platform: A SANS Review

Applying Machine Learning Techniques to Measure Critical Security Controls

Subscribe to GIAC’s Monthly Newsletter