Jun 04, 2025

AI Can Write Code. Who Makes Sure It Works?

Jacob Conger

Ram Raval

Over the past year, AI-powered code generation has become commonplace—virtually all developers (97%) in a recent GitHub survey indicated they have used an AI tool at some point. Agentic products such as Cursor, Windsurf, and Claude Code have correspondingly seen rapid adoption, with Cursor alone now reportedly generating nearly 1 billion lines of code daily.

As development velocity accelerates, attention is increasingly turning to quality assurance (QA). We met with an early stage CTO who captured the current market sentiment well:

“I can now ship 5x as much code as I could ship a few years ago, but I don’t have 5x as many people to review it. Humans are thus quickly becoming a bottleneck.”

So, how do we ensure QA keeps pace with AI-driven code generation? The answer—fittingly—is more AI.

We have begun to see meaningful adoption of AI tools that target subsequent phases of the software development life cycle (SDLC)—those that occur after code generation, including code review and various forms of testing. Leaders are actively seeking adversarial counterparts to their agentic code generation tools. These systems are purpose-built for finding issues, designed to reign in the systems that prioritize pushing code. Headline’s CTO, Conrad Chu, has felt this need for our engineering team:

“We use Claude Code for vibecoding and this necessarily requires a different solution to evaluate and quality check the 50k lines of code we just generated.”

Market Landscape

We illustrate the players we see redefining software QA and testing with AI-native solutions in the graphic below:

Defining Key Categories with AI-Native Players

Coding Agents: Agentic systems and integrated development environments (IDEs) with linting
Unit Testing: Testing individual functions and modules in isolation
Code Review: Inspection of submitted pull requests for code quality and potential bugs
Static Application Security Testing (SAST) & Remediation: Finding and fixing security vulnerabilities
Integration Testing: Ensuring proper interaction of multiple services
End-to-End Testing: Simulating real user flows across the full application
Resilience & Scalability Testing: Evaluating system stability under stress and failure conditions
Fuzz & Penetration Testing: Probing systems to uncover exploitable security flaws

Today’s AI testing tools mostly improve existing workflows rather than reinvent them. They still align with the traditional spectrum—from code-focused, white-box testing (like static analysis and unit tests) to system-level, black-box testing (like end-to-end and load testing), with integration and penetration testing falling in the gray-box middle.

Each testing type relies on different kinds of context: white-box requires code understanding, gray-box needs system-level knowledge, and black-box demands expected behaviors. These contexts shape the kind of AI agents needed for the task—coding agents may suffice for unit tests, while end-to-end tests might call for browser-based agents.

What This Looks Like in Practice Today

Coding Agents: Today, many engineering teams (including our own) extensively leverage agentic code generation tools such as Cursor or Claude Code to not only write code, but to perform initial testing and QA steps as well. Coding agents can autonomously identify and resolve linting errors, and teams often instruct them to generate unit tests as well.

Code Review: AI code review tools like Greptile and CodeRabbit are growing in popularity. These solutions act as first-pass reviewers, capable of catching significant bugs that could otherwise reach production after slipping past local linters and unit tests. Many suggest that these solutions have gotten “surprisingly good” and can be more effective than human reviewers, especially junior ones.

Coding Agents + Code Review: Taking it a step further, the most forward-thinking teams are integrating their code review systems with their coding agents systems via MCP, thus creating an almost fully agentic closed-loop system. In this system, an AI code review tool like Greptile can flag potential bugs and a coding agent like Claude Code can then autonomously modify PRs to address those bugs.

Subsequent Stages: We are still in early innings for leveraging AI tools to address subsequent phases of testing and QA, such as end-to-end testing. We believe this is at least partially because the solutions are still maturing—in end-to-end testing, for example, QA Wolf is successfully bridging the gap left by productized solutions with AI-native services.

What This Will Look Like Tomorrow

How the SDLC ultimately evolves—and the eventual role of humans in the workflow—remains unknown. Even so, most engineering leaders agree on a few critical points:

Going Earlier: The feedback loop between code generation and testing/QA is tightening and increasingly shifting left. A common question arises: “Why wait until code review?” Historically, efforts to surface bugs earlier—such as in the IDE—have had mixed results, with developers often feeling overwhelmed by “too much noise.” But in an AI-first world, this changes. AI can now autonomously filter that noise, identifying and even remediating bugs before they ever reach the pull request stage.
Going Parallel: Given the sheer volume of noise to filter and tests to run, parallelization is becoming a requirement. We see companies building toward this in end-to-end testing in particular, with companies such as Propolis seeking to deploy swarms of browser agents that run potentially hundreds of simulated user sessions in parallel to identify bugs.
Getting Better Context: Current market sentiment suggests that QA/testing tools are getting better at analyzing code in isolation—but still fall short of evaluating it in full context. Engineering leaders consistently want tools that can assess code against both the broader codebase and the original requirements. Delivering this will likely require tighter integration with Git repositories (as highlighted by Codex’s recent launch) and ticketing systems like Jira.
Utilizing Telemetry: Shipping code is only the beginning—the real challenge is ensuring it runs reliably in production. Teams have historically underutilized telemetry to address performance bottlenecks before they become incidents. That’s starting to change, as teams work to create a tighter loop between telemetry and code. Our portfolio company Honeycomb recently acquired Grit, citing its ability to help answer the question, 'Why isn’t my software doing what I expect?' Emerging tools like Digma are similarly using telemetry to autonomously optimize code.

Key Questions

As we look to invest against this thesis, we have a number of open questions around how the ecosystem will evolve. Two of the biggest for us:

Need for separate adversarial systems: Engineering leaders today suggest having distinct systems for code generation and testing/QA is critical—there may be limitations to a system evaluating itself, and each use case requires different prompting, context, and levels of synchrony. But as the market matures, will parts of testing and QA become absorbed into broader ecosystems like Claude or Codex? We're already seeing these coding agents handle local, code-based tasks like unit testing. In the future, Codex’s cloud-native architecture and tight GitHub integration indicate OpenAI is looking beyond the local IDE—toward cloud-centric, end-to-end developer workflows. Companies such as Qodo are publicly building toward this vision of a single platform covering code generation, testing, and review already.
This raises important questions: Will developers eventually adopt unified ecosystems offering both code generation and QA? Or will standalone, purpose-built tools dominate? Does the answer change based on the type of workflow—white-box vs. black-box, local vs. cloud?
Evolution of the SDLC: The SDLC as we know it may look very different in the coming years, as it shifts away from human-centered workflows toward agent-driven automation. In a world where AI agents can autonomously detect and fix bugs in real time as code is written, will code reviews still be a necessary step? Could tests like end-to-end testing move much earlier in the development process? And how might CI/CD pipelines evolve to reflect this new sequencing? Will intelligent test selection become increasingly critical to manage costs?

Despite the open questions, we’re bullish on AI’s ability to reimagine software QA and testing. The business case for automation is clear, and while past efforts have taken many forms, we believe agents are the technological unlock that will finally make it possible at scale. We are excited to invest in the category—if you are building in any of these areas, please get in touch with us by reaching out to jacob@headline.com and ram@headline.com.

The Duality of Infrastructure Software Investing in an AI World