Chapter 3 — Penetration Testing and Vulnerability Analysis with AI
Penetration testing — the authorised simulated attack against a system to identify exploitable weaknesses — has been an established security discipline for decades. Vulnerability analysis, the systematic identification and assessment of weaknesses in systems and software, is its closely-related companion. Generative AI is transforming both practices: automating exploit research, accelerating vulnerability triage, augmenting penetration tester workflows, and changing the economics of testing. The same capabilities transform the defensive disciplines of patching, monitoring, and hardening. This chapter examines how AI affects offensive and defensive security testing, the tooling emerging in this space, and the operational practices that mature organisations are adopting.
3.1 Generative AI in automated exploit development and testing
Exploit development context
An exploit is a piece of software, data, or sequence of commands that takes advantage of a vulnerability in a system to cause unintended behaviour — typically arbitrary code execution, privilege escalation, denial of service, or information disclosure — providing an attacker (or authorised tester) the means to demonstrate or operationalise a vulnerability.
Exploit development traditionally requires:
- Deep understanding of the target technology.
- Vulnerability analysis skills.
- Reverse engineering capability.
- Programming proficiency.
- Persistent debugging.
- Creative problem-solving.
Skilled exploit developers have historically been scarce. Generative AI is changing the economics.
Stages of exploit development
Generative AI affects each stage differently:
Vulnerability research. Analysing code or binaries to find weaknesses.
Proof of concept. Demonstrating a vulnerability is exploitable.
Reliable exploitation. Making the exploit work consistently across environments.
Weaponisation. Packaging for operational use, evading detection.
Delivery. Getting the exploit to the target.
AI assistance at each stage
Vulnerability research with AI.
LLMs can analyse code to identify potential security issues:
- Buffer overflows, off-by-one errors.
- SQL injection patterns.
- Cross-site scripting.
- Insecure deserialisation.
- Race conditions.
- Cryptographic weaknesses.
- Authentication bypasses.
Specialised tools combine LLMs with traditional analysis:
- SAST tools (SonarQube, Checkmarx, Snyk Code) increasingly use LLM-style analysis.
- AI code review (GitHub Copilot, Cursor) catches security issues during development.
- Specialised vulnerability research tools emerging from research labs.
Limitations: LLMs identify candidate vulnerabilities; false positives common; expert verification required.
Proof-of-concept generation with AI.
Given an identified vulnerability, LLMs can suggest exploitation approaches. The depth of assistance varies — for well-documented vulnerability classes with extensive prior art, AI provides substantial help; for novel vulnerability classes, AI assistance is more limited.
Reliable exploitation with AI.
The engineering work of making exploits reliable — handling ASLR, dealing with version differences, surviving system noise — benefits from AI assistance in code generation, debugging, and pattern recognition.
Weaponisation considerations.
The line between research and operational use is policy and law, not technology. Academic exploit research is valuable; weaponisation for unauthorised use is criminal.
Frontier model capabilities
Frontier LLMs (as of 2026) can:
- Identify many common vulnerability classes in source code.
- Generate exploitation code for known vulnerability patterns.
- Explain vulnerabilities in educational terms.
- Suggest mitigations alongside vulnerability identification.
What they generally cannot do well:
- Discover novel vulnerability classes without specific guidance.
- Develop exploits for entirely unfamiliar architectures.
- Reliably bypass modern mitigations (ASLR, CFI, DEP, sandboxing) without substantial human assistance.
- Operate fully autonomously for sophisticated targets.
The gap is narrowing. Defensive planning should assume increasing automation of attack capability.
Responsible disclosure context
Legitimate exploit research operates within responsible disclosure norms:
- Vendor notification. Discovered vulnerabilities reported to vendor.
- Coordinated timing. Public disclosure after vendor patches (typically 90 days).
- Bug bounty programmes. Many vendors offer rewards for responsible disclosure.
- Legal frameworks. Some jurisdictions provide protection for good-faith research.
For Nepali context:
- No formal vulnerability-disclosure framework specific to Nepal.
- Researchers typically follow international norms.
- npCERT coordinates some disclosures.
- Several Nepali researchers participate in international bug bounty programmes.
AI in defensive vulnerability research
Defensive vulnerability research benefits from AI:
Code review. AI-assisted review of internal code finds issues before deployment.
Fuzzing enhancement. AI-generated test inputs improve fuzzing coverage.
Vulnerability pattern discovery. AI identifies new vulnerability patterns from data.
Triage acceleration. AI processes vulnerability reports for severity assessment.
For Nepali bank security teams developing internal applications, AI-augmented code review during development substantially reduces the rate of vulnerabilities reaching production.
3.2 Vulnerability scanning and AI-driven prioritization
Vulnerability scanning
Discussed in the Cloud Security subject Chapter 3. Recap and AI extensions:
Network vulnerability scanners. Nessus, OpenVAS, Qualys, Rapid7 InsightVM.
Web application scanners. Burp Suite, OWASP ZAP, Acunetix, AppScan.
Container/cloud scanners. Aqua, Wiz, Lacework, Orca.
SCA tools. Snyk, Dependabot, Black Duck.
IaC scanners. Checkov, tfsec, Snyk IaC.
The triage problem
Modern vulnerability scanning generates substantial output. A typical enterprise scan might identify:
- Thousands of CVEs across the environment.
- Hundreds of "critical" by severity.
- Many duplicates and contextually-irrelevant findings.
- Mixed signal — real issues alongside theoretical risks.
Human analysts cannot triage all findings comprehensively. Prioritisation is essential and difficult.
Traditional prioritisation factors
CVSS score. Common Vulnerability Scoring System; standardised severity score.
Exploitability metrics. Whether public exploit available; difficulty of exploitation.
Asset criticality. Importance of affected asset.
Network exposure. Internet-facing vs internal.
Existing controls. Mitigations reducing actual risk.
Patch availability. Whether fix available.
AI-enhanced prioritisation
AI brings several improvements:
Contextual analysis. AI understands the deployment context — what the vulnerability means in the specific environment.
Exploit prediction. ML models predict likelihood of exploitation based on multiple factors. EPSS (Exploit Prediction Scoring System) is industry effort in this direction.
Threat-intelligence integration. Correlating vulnerabilities with active threat intelligence.
Business impact assessment. AI estimates business impact based on affected systems.
False positive reduction. AI identifies and filters false positives.
Patch prioritisation. AI suggests optimal patching sequence.
Tools and platforms
Tenable Vulnerability Management. Risk-based prioritisation.
Qualys VMDR. Vulnerability management with prioritisation.
Rapid7 InsightVM. Real-Risk score.
Wiz, Orca, Lacework. Cloud-native with prioritisation.
Vulcan Cyber. Specialised orchestration.
Snyk. Developer-focused with prioritisation.
Increasingly these platforms incorporate LLM features for contextualisation and explanation.
Specific prioritisation example
A Nepali bank scan produces 5,000 findings:
Naive approach. Patch all critical severity first.
Risk-based approach.
- Filter false positives (e.g., 1,000 reduced to 800 valid).
- Apply context — internet-facing vs internal (300 internet-facing critical).
- Check exploit availability and active campaigns (100 with active exploitation).
- Assess business impact (top 20 require immediate attention).
- Generate remediation plan with timing and ownership.
AI assists each step — filtering, contextualising, prioritising, planning.
EPSS — Exploit Prediction Scoring System
EPSS is a data-driven scoring system that estimates the probability that a given vulnerability will be exploited in the wild within the next 30 days, providing a probability score (0-1) and percentile ranking to inform prioritisation decisions, distinct from CVSS which measures severity rather than likelihood of exploitation.
EPSS uses ML on multiple data sources to predict exploitation. Vulnerabilities with high CVSS but low EPSS may warrant lower priority than those with high EPSS regardless of CVSS.
Integration of EPSS, CVSS, asset criticality, and threat intelligence produces a substantially better prioritisation than any single signal.
CISA Known Exploited Vulnerabilities (KEV) catalogue
CISA maintains a list of vulnerabilities known to be actively exploited. These warrant urgent attention regardless of CVSS score — known exploitation in the wild is direct evidence of risk.
The KEV list is widely used in prioritisation; AI tools integrate it as a key input.
3.3 AI tools for enhancing penetration testing workflows
Penetration testing context
Penetration testing is the authorised, simulated attack against a system, network, or application with the goal of identifying and demonstrating exploitable vulnerabilities, conducted by skilled testers ("pen testers" or "ethical hackers") following established methodologies, with results provided to the system owner for remediation.
Standard methodologies:
- PTES (Penetration Testing Execution Standard).
- OSSTMM (Open Source Security Testing Methodology Manual).
- OWASP Testing Guide for web applications.
- NIST SP 800-115 technical guide.
Standard phases:
- Pre-engagement.
- Intelligence gathering.
- Threat modelling.
- Vulnerability analysis.
- Exploitation.
- Post-exploitation.
- Reporting.
AI assists each phase.
LLM in pentest workflow
Pre-engagement. Defining scope, planning approach, generating documentation.
Intelligence gathering. As covered in Chapter 2 — AI-enhanced OSINT.
Vulnerability analysis. AI-assisted code review, configuration analysis, attack surface assessment.
Exploitation. AI-assisted exploit selection, customisation, troubleshooting. With caveats on capability limits.
Post-exploitation. AI-assisted lateral movement planning, persistence, data identification.
Reporting. This is where current AI adds the most clear value — drafting findings, generating executive summaries, producing tailored explanations.
Specific AI-pentest tools
PentestGPT. LLM-driven pentest assistant. Provides step-by-step guidance.
hackingBuddyGPT. Linux privilege escalation assistant.
Garak. LLM vulnerability scanner (for testing LLMs themselves).
WhiteRabbitNeo. Open-source security model.
Cybersec-eval frameworks. Testing LLM capabilities on security tasks.
Commercial offerings. Various vendors integrating LLMs into established pentest tools.
LLM with security tooling
Beyond standalone LLMs, the powerful pattern is LLMs orchestrating traditional tools:
LLM + Nmap. Network scanning interpreted by LLM.
LLM + Burp Suite. Web application testing with LLM analysis.
LLM + Metasploit. Exploit selection and customisation.
LLM + Cobalt Strike. Post-exploitation operations (red team / threat actor context).
LLM + custom scripts. Bespoke testing workflows.
The LLM provides the reasoning layer; traditional tools provide the technical capability.
Realistic capability assessment
What current frontier models can do well in pentest contexts:
- Suggest test approaches based on described environment.
- Identify common vulnerability patterns in provided code/configurations.
- Explain technical findings in different audience-appropriate ways.
- Draft technical and executive reports.
- Recommend remediations.
- Generate test cases.
What they handle less well:
- Fully autonomous operation against sophisticated targets.
- Novel vulnerability discovery in complex systems.
- Long-duration multi-step operations without human guidance.
- Recognising when their suggestions are dangerous or counterproductive.
The MSc graduate working in security testing should expect AI to augment rather than replace human judgement.
Pentest reporting
AI substantially helps reporting:
Finding writeups. Standard format with context-specific details.
Risk ratings. Consistent CVSS or other scoring.
Remediation guidance. Specific, actionable steps.
Executive summary. Translation from technical detail to business impact.
Customisation by audience. Same findings presented for technical, management, executive audiences.
For Nepali pentest practitioners, AI-assisted reporting cuts substantial time and improves consistency.
Red team operations
Red teaming — adversarial simulation more extended than pentesting — adopts AI capabilities:
Planning. Multi-step attack planning.
Persona generation. Personas for social engineering operations.
Content generation. Phishing pretexts, fake documents.
Operational support. Real-time assistance during operations.
Reporting. Substantial reports describing complex operations.
For Nepali bank context, red team exercises by international or local firms are increasingly common. AI features in their tooling.
Pentest market in Nepal
The Nepali penetration-testing market:
- Major banks typically engage external firms for periodic testing.
- Local pentest firms with growing capability.
- International firms for higher-tier engagements.
- Independent consultants for smaller engagements.
- In-house teams at largest organisations.
The MSc graduate has opportunities both in-house at major organisations and at testing firms. AI skills are increasingly differentiating.
3.4 Defensive strategies — patching, monitoring, system hardening
Defensive testing focus
The other side of the testing coin: not just identifying weaknesses but addressing them effectively.
Patch management
Patch management is the systematic process of acquiring, testing, and installing software updates that address security vulnerabilities, bugs, and feature requests, ensuring systems remain protected against known threats while minimising operational disruption.
Covered in the Cloud Security subject Chapter 3 in cloud context. General principles:
Inventory accuracy. Know what is deployed.
Vulnerability awareness. Know what is vulnerable.
Prioritisation. AI-enhanced as discussed in Section 3.2.
Testing. Patches tested before production deployment.
Deployment automation. Patches deployed through pipelines.
Verification. Confirming successful application.
Tracking. Visibility into status.
Patch management challenges
Legacy systems. Some systems cannot be patched without significant work.
Third-party dependencies. Many vulnerabilities in libraries used by applications.
Operational windows. Production systems may have limited patching windows.
Reboot requirements. Some patches require reboots affecting availability.
Patch quality. Patches sometimes cause issues; testing essential.
Patch lag. Time between disclosure and deployment is exposure window.
AI in patch management
Patch impact analysis. AI assesses patch documentation to identify potential impact.
Test automation. AI-generated test cases for patch validation.
Patch sequencing. Optimal order for patching dependent systems.
Rollback decision support. Automated rollback if metrics degrade.
Continuous monitoring
Monitoring covered extensively in the Cloud Security subject Chapter 5 and Managing Secure Networks Chapter 4 (IDS/IPS). For penetration-testing defence specifically:
Detect reconnaissance. Some reconnaissance creates detectable signatures.
Detect exploitation attempts. IDS/IPS, WAF, application logging.
Detect post-exploitation. EDR, behavioural analytics, lateral-movement detection.
Detect data exfiltration. DLP, traffic analysis, anomaly detection.
AI in monitoring
Anomaly detection. Discussed previously — ML identifies unusual patterns.
Alert triage. LLMs summarise and contextualise alerts.
Threat hunting. Hypothesis generation for proactive hunting.
Pattern recognition. Detecting subtle attack patterns across multiple data sources.
Automated response. SOAR with AI-driven playbook selection.
System hardening
System hardening is the systematic reduction of attack surface and increase in security posture of systems through configuration, removal of unnecessary components, application of security baselines, and ongoing maintenance of secure operational state.
Major hardening areas:
Operating system hardening. OS-level configurations per CIS Benchmarks or DISA STIGs.
Network hardening. Reducing exposed services; firewall rules; network segmentation.
Application hardening. Secure configuration of applications; removal of unnecessary features.
Authentication hardening. MFA, strong passwords, account policies.
Logging and audit. Comprehensive logging configured.
Privilege reduction. Services run with minimum necessary privileges.
Removal of unnecessary components. Less code is fewer vulnerabilities.
Hardening at scale
For organisations with many systems, hardening at scale requires automation:
Configuration management. Ansible, Puppet, Chef enforcing configurations.
Image management. Hardened golden images.
Cloud baselines. AWS Config rules, Azure Policy, GCP Organization Policy.
Continuous compliance. Automated checks against hardening standards.
AI in hardening
Configuration review. LLM analyses configurations against best practices.
Drift remediation. LLM-generated remediation for detected configuration drift.
Custom hardening for specific environments. LLM tailors generic hardening guides to specific environment.
Documentation. LLM-generated documentation of hardening decisions.
Defensive testing as continuous practice
Modern security operations integrate defensive testing continuously:
Continuous validation. Tools like AttackIQ, SafeBreach, Cymulate continuously test controls.
Purple team exercises. Red and blue teams collaborate on improvements.
Tabletop exercises. Scenario-based readiness assessment.
Crisis simulation. Major incident scenarios practised.
Lessons learned. Continuous improvement from real and simulated events.
The Nepali context for defensive testing
For Nepali enterprises:
Banks. NRB directives drive regular testing; major banks conduct multiple tests annually; some have established purple team practices.
Telecoms. Less regulated than banks; testing varies by institution.
Government. Limited formal testing; npCERT involvement in some cases.
Enterprises. Variable; mature organisations test regularly; many don't.
The MSc graduate building a security career in Nepal will likely participate in both testing programmes (initially as analyst or junior tester, later as test lead or programme manager) and in defensive operations (SOC, vulnerability management, incident response). AI literacy in both contexts is increasingly required rather than differentiating.
Synthesis — the testing/defence loop
Effective security is a continuous loop:
Testing identifies weaknesses
↓
Findings prioritised and assigned
↓
Remediation applied
↓
Verification confirms remediation
↓
Monitoring detects new issues
↓
Testing identifies weaknesses (again)
AI accelerates each step but doesn't change the fundamental loop. The discipline of testing, fixing, and verifying remains. AI helps practitioners cover more ground; the practitioners must still apply judgement, manage stakeholders, and ensure that the testing actually drives improvement.
The next chapter takes up the related discipline of threat intelligence and anomaly detection — where the focus shifts from finding weaknesses in our own systems to understanding threats targeting them and detecting their activity.