The use of Large Language Models (LLMs) in machine learning technology is rapidly expanding, with various open-source and proprietary architectures now available. While platforms like ChatGPT are known for generative text tasks, LLMs have shown utility in a wide range of text-processing applications, including code writing assistance and content categorization. SophosAI has explored different ways to leverage LLMs in cybersecurity tasks, but researchers face the challenge of selecting the most suitable model for specific machine learning problems. One approach to this selection process is to create benchmark tasks that can easily and quickly assess the capabilities of different models.
Currently, LLMs are evaluated on benchmarks that test their general abilities in basic natural language processing tasks. The Huggingface Open LLM Leaderboard, for example, uses seven benchmarks to evaluate all open-source models on the platform. However, these benchmarks may not accurately reflect how well models perform in cybersecurity contexts, as they are often generalized and may not highlight security-specific expertise gained from training data. To address this gap, SophosAI developed three benchmarks focused on incident investigation assistance, incident summarization, and incident severity rating.
In testing 14 different models against these benchmarks, including variations of Meta’s LlaMa2 and CodeLlaMa models, OpenAI’s GPT-4 demonstrated superior performance in incident investigation assistance. However, none of the models tested performed accurately enough in categorizing incident severity. The benchmarks provided insights into the models’ abilities to handle specific cybersecurity tasks out-of-the-box and highlighted areas for potential fine-tuning.
For the incident investigation assistant benchmark, models were tasked with converting natural language queries into SQL statements, a crucial skill for SOC analysts investigating security incidents. GPT-4 emerged as the top performer with an 88% accuracy rate, followed closely by other models like CodeLlama-34B-Instruct and the Claude models. These high accuracy scores suggest that LLMs could effectively support threat analysts in incident investigation tasks.
In the incident summarization benchmark, models were challenged to organize and summarize data from security incidents to help analysts identify notable events efficiently. Large language models proved valuable in this task, offering a way to streamline the analysis of incident data and assist analysts in determining next steps. The benchmarks developed by SophosAI provide a valuable framework for evaluating LLMs in cybersecurity contexts and highlight the potential of these models in enhancing security operations. # Evaluating Language Models for Incident Summarization and Severity Evaluation
As part of a benchmark study conducted by SophosAI, various large language models (LLMs) were evaluated for their performance in incident summarization and severity evaluation tasks. The study involved comparing the output of different LLMs against manually reviewed incident summaries to assess accuracy and effectiveness. Here are the key findings from the study:
## Incident Summarization Benchmark Results
– GPT-4 emerged as the top performer in incident summarization, outperforming other models in accuracy and detail extraction.
– Qualitative evaluation revealed that GPT-4 produced accurate summaries but was slightly verbose.
– Llama-70B and J2-Ultra also performed well in extracting details but struggled with summarization format.
– MPT-30B and CodeLlaMa-34B faced challenges in generating organized summaries, with CodeLlaMa-34B regurgitating event data instead of summarizing.
## Incident Severity Evaluation Task
– The study also assessed LLMs’ ability to determine the severity of security events, with GPT-3 embeddings showing significant performance improvements.
– None of the LLMs demonstrated sufficient performance in severity classification, with most models struggling to adhere to the evaluation format.
– GPT-4 and Claude v2 stood out as top performers across all benchmarks, while CodeLlama-34B showed promise for deployment as a SOC assistant.
## Conclusion
While LLMs like GPT-4 show promise in aiding threat hunting and incident investigation, there is still room for improvement in fine-tuning models for specific cybersecurity tasks. Specialized LLMs trained on cybersecurity data may be necessary for more accurate artifact evaluation. Overall, the study highlights the potential of LLMs in security applications but underscores the importance of careful prompt engineering and model selection.
### Key Points:
– GPT-4 excelled in incident summarization, while GPT-3 embeddings showed performance improvements in severity evaluation.
– Most LLMs struggled with adhering to evaluation formats and producing organized summaries.
– Specialized LLMs may be required for accurate artifact evaluation in cybersecurity applications.
– CodeLlama-34B showed promise as a competitive model for deployment as a SOC assistant.
In conclusion, the benchmark study conducted by SophosAI sheds light on the capabilities and limitations of various LLMs in incident summarization and severity evaluation tasks. While models like GPT-4 show promise, further advancements and specialized training may be needed to enhance their performance in cybersecurity applications.