7th International Congress on Human-Computer Interaction, Optimization and Robotic Applications-ICHORA, Ankara, Türkiye, 23 - 24 Mayıs 2025, (Tam Metin Bildiri)
Open-source software is critical for modern digital infrastructure, yet security vulnerabilities remain a significant concern as attackers exploit unpatched systems. Large Language Models (LLMs) have shown promise in vulnerability detection, but their ability to detect security patches solely based on code changes remains underexplored. This capability is crucial for identifying security patches when commit messages lack explicit security labels. This study evaluates six LLMs, e.g., GPT-4o, Claude 3.5 Haiku, and DeepSeek V3, using various prompting approaches to assess their capacity to distinguish between security and non-security patches and analyze their effectiveness in classifying security patches into their corresponding CWE categories. The results show that LLMs can detect security patches, but performance varies across models and prompting strategies. DeepSeek V3 (Chain-of-Thought) and GPT-4o (Zero-Shot) demonstrate the most consistent performance across all evaluation metrics, each achieving over 70% in accuracy, precision, recall, and F1-score. New prompting techniques have also led to notable improvements in certain areas, particularly in precision. However, CWE classification remains a major challenge, with most models misclassifying over 70% of security patches. Even the best-performing model, Claude Haiku 3.5 (Few-Shot), achieves only 31.1% accuracy, with memory-related vulnerabilities like out-of-bounds write and use-after-free being the most frequently misclassified. These findings highlight the potential of LLMs in security patch detection but emphasize the need for improved CWE classification. Source code available at: https://github.com/betulgkkaya/LLMs_Patch_Detection.git.