Build an AI Code Review Bot with GitHub Actions
By Ergini, Software & AI Developer in Pristina, Kosovo
TL;DR
Three weekends and one prompt template later, you have an AI reviewer on every PR. Here is the build, the prompt that actually works, and the honest take on when CodeRabbit beats DIY - with code you can copy.
The AI code review market in 2026 looks settled from the outside - CodeRabbit raised a Series B, Greptile is the YC darling, PR-Agent owns the open-source mindshare. From the inside it is the opposite. Every senior team I work with has either built their own reviewer, ripped out a SaaS one, or is debating both. The reason is simple: code review is the highest-leverage place to put an LLM in your dev loop, and nobody wants their style guide, severity threshold, or model provider decided by a third party. This post walks through the build, the buy, and the honest tradeoff between them - with a GitHub Action you can copy.
The 2026 landscape
Three vendors dominate the SaaS side. CodeRabbit has the polished product and the deepest GitHub integration - chat-with-the-PR, learnings file, sequence diagrams in the review summary. Greptile bets on whole-repo context: it indexes your codebase and the bot answers review comments with citations to other files. Qodo's PR-Agent (formerly Codium-PR-Agent) is the open-source default, runs as a GitHub Action you self-host, and supports OpenAI, Anthropic, Google, and a dozen other providers behind a unified interface. Bito and Sourcery sit in the long tail with adequate products and aggressive pricing.
DIY became plausible in 2024 and obvious in 2026. The Anthropic SDK ships prompt caching, which means the system prompt and your style guide cost nothing after the first call. GitHub Actions now have a first-class Anthropic action and the @octokit/rest library covers the Reviews API in one import. The whole thing is roughly 40 lines of TypeScript plus a 20-line YAML workflow. The leverage shifted from infrastructure (which the SaaS vendors do well) to prompt and policy (which only you can do for your team).
Build vs buy - the honest matrix
Most build-vs-buy posts gloss over the actual decision. Here is the version I use with clients. Buy is the right answer for most teams. Build is the right answer for specific ones.
| Dimension | Buy (CodeRabbit / Greptile) | Build (this post) |
|---|---|---|
| Time to first review | 30 minutes | 1 to 2 weekends |
| Cost at 10 devs | $150/mo flat | $30 to $80/mo inference |
| Cost at 100 devs | $1,500/mo flat | $300 to $800/mo inference |
| Privacy posture | Vendor SOC 2, third-party processor | Your GitHub account, your API key |
| Custom rules | Learnings file, limited | Anything you can prompt or RAG |
| Maintenance | None | One engineer-day per quarter |
The crossover is around 60 engineers, but it is heavily skewed by non-cost factors. A 15-engineer fintech with strict data residency rules ships the build path. A 200-engineer mobile shop with no custom-rule needs ships CodeRabbit. The dimension that decides it most often is "how badly do you want your style guide enforced identically across every reviewer."
Architecture
The flow is the same whether you build or buy. A GitHub event fires when a PR opens or pushes. A handler fetches the diff, prepares the context (changed files only, generated files excluded, system prompt with your conventions), and calls an LLM. The model returns a structured list of findings. The handler posts them via the Reviews API as a single review with inline comments. The whole round-trip takes 6 to 25 seconds for a typical diff.
Four design decisions matter at this stage. First: line-level versus file-level comments. Inline comments require you to know the diff position (a brittle number the GitHub API exposes), but they land next to the code and engineers act on them. File-level comments are easier to render and easier to ignore. Pick inline; pay the engineering tax. Second: review-on-every-push or review-on-final-push. Debounce on the synchronize event so the bot only reviews after 60 seconds of git inactivity - otherwise a push-heavy PR triggers 12 reviews. Third: chunking. A 5,000-line diff does not fit in your prompt with the context you want; split per-file and post per-file reviews. Fourth: rate-limiting per repo and per PR to bound your bill if a bug causes runaway reviews.
GitHub Action setup
The workflow file lives at .github/workflows/ai-review.yml. It fires on pull-request events, sets the right permissions, and runs a single Node script. The synchronize trigger is debounced server-side by GitHub when you set concurrency with cancel-in-progress.
# .github/workflows/ai-review.yml
name: AI Code Review
on:
pull_request:
types: [opened, synchronize, reopened]
permissions:
contents: read
pull-requests: write
concurrency:
group: ai-review-${{ github.event.pull_request.number }}
cancel-in-progress: true
jobs:
review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 22 }
- run: npm ci
- run: npx tsx scripts/ai-review.ts
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
PR_NUMBER: ${{ github.event.pull_request.number }}
REPO: ${{ github.repository }}That is the entire YAML. GITHUB_TOKEN is provided automatically by Actions and has the permissions block's write-pull-requests scope. ANTHROPIC_API_KEY goes in repo secrets. The cancel-in-progress block kills the previous review if a new push lands while one is running, which prevents duplicate comments.
The action script
Here is the working reviewer in 40 lines. It fetches the PR diff, sends it to Claude with a tuned system prompt, parses structured findings, and posts an inline review. The full version on a real repo adds retry logic and chunking but the core loop is this.
// scripts/ai-review.ts
import Anthropic from "@anthropic-ai/sdk";
import { Octokit } from "@octokit/rest";
import { z } from "zod";
const Finding = z.object({
file: z.string(),
line: z.number().int(),
severity: z.enum(["info", "warn", "error", "critical"]),
category: z.enum(["bug", "security", "perf", "clarity", "test"]),
message: z.string(),
});
const Findings = z.object({ findings: z.array(Finding) });
const [owner, repo] = process.env.REPO!.split("/");
const pull_number = Number(process.env.PR_NUMBER);
const gh = new Octokit({ auth: process.env.GITHUB_TOKEN });
const claude = new Anthropic();
const { data: pr } = await gh.pulls.get({ owner, repo, pull_number });
const { data: files } = await gh.pulls.listFiles({ owner, repo, pull_number });
const reviewable = files.filter(
(f) => !/(generated|vendor|dist|\.lock|snap)/i.test(f.filename) && f.patch
);
const diff = reviewable.map((f) => `# ${f.filename}\n${f.patch}`).join("\n\n");
const result = await claude.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 4096,
system: [
{ type: "text", text: SYSTEM_PROMPT, cache_control: { type: "ephemeral" } },
],
messages: [{ role: "user", content: `PR title: ${pr.title}\n\nDiff:\n${diff}` }],
});
const text = result.content.find((c) => c.type === "text")!.text;
const { findings } = Findings.parse(JSON.parse(text));
const comments = findings
.filter((f) => f.severity !== "info")
.map((f) => ({ path: f.file, line: f.line, body: `**[${f.severity}]** ${f.message}` }));
await gh.pulls.createReview({
owner,
repo,
pull_number,
event: "COMMENT",
body: `AI review found ${comments.length} item(s).`,
comments,
});Three details worth noticing. The cache_control block on the system prompt is what makes this cheap - Anthropic caches it for 5 minutes, and every subsequent PR in that window pays roughly 10% of the system-prompt token cost. The z.object schema is not parsed by the model directly (Claude returns text); it validates the JSON we asked the model to produce, and Zod throws if the model drifted. The event: "COMMENT" on the review leaves it as advisory; switch to "REQUEST_CHANGES"once you trust the bot.
Prompt design - what good reviewers ask
The system prompt is 80% of the quality. A bad prompt produces lint output; a good one produces senior-engineer comments. The skeleton I use:
const SYSTEM_PROMPT = `
You are a senior staff engineer reviewing a pull request. Your job is to
catch problems that would embarrass the author in a public code review:
real bugs, security holes, race conditions, broken error handling, missing
test coverage on changed logic, and unclear naming that will confuse the
next reader.
You DO NOT comment on:
- Code style or formatting (a linter handles this)
- Personal opinions about idiom or aesthetics
- Files that did not change
- Generated, vendored, or lockfile content
- Things you cannot see in the diff (only reason about lines shown)
Severity rubric:
- critical: silent data loss, auth bypass, leaked credentials, infinite loops
- error: clear bug under realistic input, regression, broken contract
- warn: fragile pattern, missing test, performance footgun
- info: suggestion (will be filtered out before posting)
Output strictly this JSON, with no prose, no markdown fences:
{ "findings": [ { "file": "...", "line": 42, "severity": "warn",
"category": "bug", "message": "..." } ] }
For each finding the message must (1) state the problem in one sentence,
(2) explain why it is wrong, (3) suggest the specific fix. Aim for under
3 findings per file unless severity is error or critical.
`;The list of things to NOT comment on does more work than the list of things to look for. Without it the model defaults to a lint persona - formatting nits, opinionated naming, suggestions to add comments to obvious code. With it, the model reads like a senior reviewer with limited time and high standards. The severity rubric is the other lever; without it, every finding shows up as "medium" and the bot becomes white noise. Tool-call design principles apply directly here - see my tool calling best practices post for the broader pattern.
Avoiding noise
Every team that ships an AI reviewer goes through a noise crisis around week two. The bot posts 40 comments on a 200-line PR, the author marks them resolved without reading, and within a month the notifications are muted. Five controls keep the noise floor low.
- Severity threshold. Filter out everything below warn before posting. Info-level findings live in a private log for prompt-tuning, never in PR comments.
- Per-file cap. Hard limit at 3 comments per file unless severity is error or critical. The model will happily find 10; engineers read 2.
- Skip globs. Generated code, vendored deps, lockfiles, snapshots, migrations, fixtures. Maintain a glob list in the script and skip matching files before sending to the model.
- Line-level only. If you cannot point to a specific changed line, you cannot post. File-level comments on a PR get ignored.
- Resolve-on-rewrite. When the next push changes a commented line, mark the old comment outdated. The Reviews API handles this if you tag findings with the commit SHA.
Cost math
The per-PR cost on Claude Sonnet 4.5 at 2026 pricing ($3/$15 per million in/out tokens) shakes out as follows. A typical 300-line diff is roughly 4,000 input tokens for the diff plus 1,500 for the system prompt. The system prompt caches after the first call. Output is around 800 tokens of structured JSON. That is $0.014 per PR for the diff input, $0.0045 system prompt (or $0.0005 cached), $0.012 output - total about $0.027 per cached PR. Call it $0.05 with retry buffer.
A 50-engineer team merging 500 PRs per week burns about $25 per week, or $1,300 per year. Same team on CodeRabbit at $15/dev/month is $9,000 per year. The break-even is around 18 engineers in pure inference terms, but the SaaS price covers UX, learnings, and zero maintenance. For deeper LLM cost engineering - caching, batch API, provider selection - see my OpenAI API cost breakdown; the same techniques apply to Anthropic.
Custom rules via RAG over your conventions
Once the basic reviewer ships, the next request is always the same: "teach it our style guide." The light version is to paste the guide into the system prompt and let prompt caching absorb the cost. The heavy version is RAG - embed your style docs, ADRs, and past code-review comments, and retrieve the relevant rules per diff. The heavy version wins past about 5,000 words of guidance.
The retrieval step is straightforward: for each changed file, embed a summary of the diff, query your vector store for top-K relevant conventions, and inject them into the system prompt for that file's review. The full architecture mirrors agent-style retrieval - I wrote it up in agentic RAG architecture - and the simpler base pattern lives in my RAG architecture tutorial. For a code-review reviewer, a flat pgvector table of conventions chunked at one rule per row is enough; you do not need a heavyweight vector DB.
Privacy
The biggest argument for DIY is that you control the data path end-to-end. The action runs inside your GitHub account, your runner (GitHub-hosted or self-hosted), and calls your Anthropic or OpenAI account. Both providers contract not to train on API traffic and delete inputs after 30 days. Nothing touches a third-party intermediary.
SaaS reviewers (CodeRabbit, Greptile, Bito) ship the diff to their infrastructure, which then calls the model. They have SOC 2 reports and similar contracts but the surface area is larger. For regulated industries - fintech, health-tech, defense, anything HIPAA - the DIY path or a self-hosted model (Llama 3.3 70B, DeepSeek-V3, Qwen 2.5-Coder) is the safer call. Self-hosting code review on an in-house GPU is genuinely viable in 2026; quality is roughly 85% of Sonnet on review tasks at near-zero marginal cost.
The 5 things to never auto-comment
Five categories of comment will get your bot muted within a sprint. Every one of them is easy to filter out at the prompt level or in the post-processing step.
- Formatting and style. Prettier and ESLint do this for free. An AI comment that says "consider using consistent spacing" is noise.
- Opinion-based naming. "Consider renaming this to be more descriptive" - no. Either name a concrete problem (shadows a builtin, conflicts with a sibling) or stay quiet.
- Unchanged files. The model loves to suggest improvements to code it can see but did not change. Anchor every comment to a diff position and drop the rest.
- Vendored or generated code. Skip via globs before sending. A review on a generated GraphQL types file teaches the author nothing.
- "Consider adding a test" without specifics. Drop it unless the model can name what the test should assert. Vague test-suggestion comments are pure noise.
SaaS alternatives compared
Five tools cover most of the SaaS market. I have used all of them on at least one client repo; this is the honest snapshot.
| Tool | Best for | Price (per dev/mo) | Tradeoff |
|---|---|---|---|
| CodeRabbit | Polished UX, mid-market | $12 to $24 | Excellent product, opinionated workflow |
| Greptile | Whole-repo context | $20 to $30 | Best for large monorepos, slower per review |
| Qodo PR-Agent | Open-source DIY | Free + your model spend | You configure, you maintain, you pay the bill |
| Bito | Budget option | $8 to $15 | Adequate quality, less polished UI |
| Codium (Qodo) | Test-gen focused | $15 to $25 | Strong on test generation, weaker on review |
Useful external reading: the CodeRabbit GitHub Marketplace listing, Qodo PR-Agent for the open-source reference implementation, and docs.anthropic.com for the prompt-caching and message-API patterns the DIY path depends on.
How this fits the rest of your AI dev loop
An AI reviewer is one node in a broader AI-assisted development stack. The IDE side (Claude Code, Cursor) handles drafting; the reviewer side (this post) handles checking; the deployment side handles safety nets. I compare the IDE tools in Claude Code vs Cursor, and the broader pattern of where to put AI in your workflow belongs to AI workflow automation as a discipline. If you want a senior engineer to set up the whole loop - reviewer, evals, observability, and the org policies around them - the AI integration practice is the right entry point, and you can also hire an AI developer in Kosovo directly for the implementation. Same person behind OmniAPI, which uses a near-identical reviewer on its own monorepo.
Ship it on Monday
The minimum-viable rollout is one workflow file, one script, one secret, and one week of comment-only mode. Drop the workflow into.github/workflows/ai-review.yml, put the script atscripts/ai-review.ts, set ANTHROPIC_API_KEY in repo secrets, and open a throwaway PR to test. After a week of comments, look at which findings engineers actioned versus ignored. Tune the severity rubric and the "do not comment on" list. After a month, promote the critical-severity findings to a required status check. After three months, decide whether to add RAG over your conventions or just paste the style guide into the system prompt and let caching absorb the cost.
The full DIY path costs you two weekends and roughly $30 per month per ten engineers. The SaaS path costs you 30 minutes and $150 per month per ten engineers. The decision between them is rarely about the bot itself - it is about whether your reviewer needs to know anything that does not fit in a vendor's opinionated UX.
Frequently asked questions
What is an AI code review tool?
An AI code review tool reads a pull request diff and posts review comments the way a senior engineer would - flagging bugs, unsafe patterns, missing tests, and unclear naming. The good ones in 2026 (CodeRabbit, PR-Agent, Greptile, Bito, Codium) run as a GitHub App or GitHub Action, fetch the diff on every push, send it to an LLM with a tuned prompt, and leave inline comments. You can also build your own in a weekend with the Anthropic SDK and @octokit/rest - the entire mechanism is roughly 40 lines of TypeScript.
Should I build my own AI code reviewer or use CodeRabbit?
Buy if you want a polished UX in 24 hours, do not have unusual privacy requirements, and have under ~50 engineers. CodeRabbit at ~$15/dev/month is cheaper than the engineering time to build and maintain a custom reviewer. Build if you have strict data residency rules, want to enforce your own style guide via RAG, need to gate on internal systems (Jira tickets, security scanners, code-ownership graphs), or are a 100+ engineer org where the SaaS pricing crosses $20K/year. The breakeven is around 60 engineers in my experience.
How much does an AI code reviewer cost to run per PR?
With Claude Sonnet on a typical 300-line diff, expect $0.10 to $0.40 per PR. With Opus on a larger 1,500-line diff plus extended thinking, you can hit $1 to $2. A 50-engineer team merging 500 PRs per week burns about $200 to $800 per week on inference. Caching the system prompt drops that 70%. SaaS tools price at $12 to $25 per developer per month and absorb the inference cost - they win on price under ~30 devs, lose past ~80 devs unless you eat the model bill anyway.
Will the bot leak my private source code to model providers?
Depends on the path you pick. The Anthropic API, OpenAI API, and Vercel AI Gateway all contract to not train on API traffic - your diffs are processed and discarded. CodeRabbit and Greptile have SOC 2 reports and similar contracts. Concerns increase with smaller SaaS vendors. The build-your-own path keeps everything in your GitHub account and your own API key, which is why regulated industries (finance, healthcare, defense) tend to DIY. Self-hosted models on your infrastructure (Llama 3, DeepSeek-V3) are the strictest path and are now viable for code review.
How do you stop an AI reviewer from posting useless noise?
Three controls. First, prompt the model to grade every finding from 1 to 5 on severity and drop anything below 3 before posting. Second, restrict comments to lines that actually changed in the diff (the GitHub Reviews API enforces this with the position field). Third, skip generated files, vendored code, lockfiles, and snapshots via a glob list in the workflow. The fourth quiet hack: rate-limit to one comment per file unless severity is critical. A reviewer that posts five critical findings is read; one that posts 50 nits is muted in a week.
Can the AI reviewer learn my team's style guide?
Yes, with two patterns. Light version: paste your style guide into the system prompt (cached, so the cost is negligible). Heavy version: embed your style-guide docs into a vector store and retrieve the relevant rules for each diff - agentic RAG over your conventions. Both work; the second scales better past ~5,000 words of guidance. CodeRabbit and Greptile expose a learnings file you can commit to the repo; the DIY path lets you build the same thing with a JSON file the action reads on startup.
Does it run on every PR or only on demand?
Default: every PR, on every push, with a synchronize-debounce so you do not pay for a review on every keystroke. The on-demand pattern (review only when a maintainer comments /ai-review) is better for repos with high PR churn from external contributors, since otherwise you burn tokens on PRs that will not merge. The GitHub Action pattern in this post supports both with a single if: condition on the workflow trigger.
Can the reviewer block merging or only comment?
Both. Treat it as a comment-only bot for the first month while you tune the prompt and severity threshold - the failure mode of a blocking reviewer with a noisy prompt is engineers disabling it. After tuning, you can promote critical findings to required status checks via the GitHub Checks API. The pattern that works in 2026: AI reviewer comments, humans approve, and a separate security-scanner action blocks for hard issues (secrets, known CVEs). Do not put judgment-call findings on the blocking path.