LLMs fail in alert triage without structured frameworks

The latest research from the University of Oslo and the Norwegian Defence Research Establishment shows that language models (LLMs) alone achieve zero percent accuracy in detecting malicious activity when they receive only alert descriptions and network log summaries. Even advanced models like GPT-5-mini, Claude 3 Haiku, Qwen3:30B, and Gemma 3:27B failed to recognize real attacks.

Quick Answer

Language models (LLMs) alone are not effective in detecting cyber threats. To achieve 93% accuracy, it is necessary to integrate LLMs into structured workflows with defined tools and guardrails. LLMs can significantly improve alert triage only when they operate in a well-defined context, similar to the work of a junior analyst.

Critical Tests: The Systematic Error of LLMs

The study tested four popular language models with a standard attack: reconnaissance, brute-force access attempts, and initial access attempt to a web server, taken from the AIT Log Data Set V1.1. Despite the signal being present in the logs, all models failed to detect the malicious activity when they received only a high-level summary. Gemma, in particular, classified every input as benign, regardless of the content.

The Breakthrough: Structured Workflows Improve Accuracy to 93%

When the same models were integrated into a structured workflow, accuracy jumped to an impressive 93%. The framework included a model for planning the investigation, a second for summarizing the evidence collected, and a third for issuing a verdict. This approach allowed the LLMs to operate more like a junior analyst, extracting specific evidence and deciding the next steps.

Implications for SOC Automation

The results have significant implications for Security Operations Centers (SOCs) seeking to automate alert triage. Integrating LLMs into structured workflows can transform linguistically advanced models into effective tools for threat detection. However, it is essential to consider practical implications, such as the need for human review for uncertain cases.

Study Limitations and Next Steps

The study is a proof-of-concept that covers a single attack scenario and a synthetic dataset. Researchers emphasize the need to further test against more diverse data and real outputs from intrusion detection systems. This approach could reveal additional challenges and opportunities for integrating LLMs into security processes.

How to Choose AI-Based Security Products

For any AI-based security product, the crucial question is what the system around the model can do. The research highlights that a capable model, but lacking a structured framework, will tend to make inaccurate assumptions. In contrast, the same model, provided with defined tools and a clear process, can reason effectively through security problems.

Download the Pentest Automation Guide

To further explore the automation of security processes, a comprehensive guide on automating pentest delivery is available. This resource offers practical insights into how to integrate AI into penetration testing and improve the efficiency of security operations.

The Crucial Role of Human Review in Security Workflows

One of the most interesting aspects of the study is the need to integrate human review into automated security workflows. Despite the high accuracy achieved by LLMs, the models tend to classify many cases as uncertain, especially GPT-5-mini, which labeled all benign cases as such. This conservative approach, although preferable to avoid false negatives, has significant implications for the operational efficiency of SOCs.

The Importance of Real Data and Diverse Scenarios

The current study represents a starting point, not a definitive conclusion. Researchers emphasize the need to further test LLMs against more diverse data and real outputs from intrusion detection systems. This step is essential to fully understand the capabilities and limitations of LLMs in real operational contexts.

Guardrails and Defined Tools: The Key to Success

The success of LLMs in alert triage does not lie in the models themselves, but in the structured framework that surrounds them. Well-defined guardrails and specific tools allow LLMs to operate more effectively, reducing the risk of inaccurate assumptions. This approach not only improves accuracy but also makes the security process more robust and reliable.

Implications for the Future of SOC Automation

The results of the study open new perspectives for SOC automation. Integrating LLMs into structured workflows can radically transform how Security Operations Centers handle alerts. However, it is fundamental to continue exploring and developing these technologies to ensure their effectiveness in complex operational contexts.

Tools and Resources for Further Exploration

For anyone interested in further exploring the automation of security processes, a comprehensive guide on automating pentest delivery is available. This resource offers practical insights into how to integrate AI into penetration testing and improve the efficiency of security operations.

Final Considerations

While LLMs alone are not sufficient to ensure high accuracy in detecting cyber threats, their integration into structured workflows can make a significant difference. This approach not only improves the effectiveness of LLMs but also makes the security process more robust and reliable. However, it is essential to continue exploring and developing these technologies to ensure their effectiveness in real operational contexts.

Editorial Note and Disclaimer

The guides and content published on GoYou are the result of independent research and analysis activities, for informational, educational, and in-depth purposes.

GoYou does not constitute a journalistic publication or an editorial product pursuant to Law No. 62/2001 and does not provide real-time information.

The GoYou project does not provide professional, technical, legal, or financial advice and disclaims all responsibility for the improper use of the information published.

In the Crypto sector, every investment involves risks: readers are invited to always inform themselves autonomously before making any decision.