BENCHMARKING HALLUCINATION, VULNERABILITY EXPOSURE, AND STYLE DRIFT IN AI-ASSISTED CODE REVIEW SYSTEMS

Authors

  • Collins Okafor Author

DOI:

https://doi.org/10.46121/pspc.53.4.35

Keywords:

AI-assisted code reviews, Large Language Models, Hallucination, Vulnerability Exposure, Style Drift, Software Assurance, DevSecOps, Security Benchmarking, Automated Code Review.

Abstract

Artificial Intelligence (AI) and Large Language Models (LLMs) have transformed automated code review by streamlining and improving the process. AI Review Systems, however, pose significant problems such as hallucination of findings, vulnerability exposure, and style drift that can impact software reliability, security, and maintainability. Hallucinations generate unsupported review comments, vulnerability exposure is caused by not recognizing vulnerabilities or insecure suggestions, and style drift is when generated suggestions drift from existing coding style guides.

This paper advocates a benchmark framework to assess such failure modes in AI-driven code review systems. The framework features a curated repository of samples, security-focused review scenarios, ambiguous review scenarios, and longitudinal evaluation to assess review accuracy, security performance, and stylistic consistency. The experimental results demonstrated that AI reviewers are good at recognising common coding defects and common security problems, but they can make mistakes with contextual reasoning, fail to provide a thorough threat analysis, and give inconsistent coding-style recommendations.

The study also underscores the importance of governance structures, such as human oversight, output auditability, adversarial testing and ongoing performance monitoring. The proposed benchmark is a pragmatic way to evaluate AI-based code review tools prior to their implementation in enterprise and security-sensitive environments and to use them as software-assurance systems and not as just productivity enhancement tools.

Downloads

Published

2025-11-30