Detect Duplicate PRs With Vector Embeddings
Detect duplicate pull requests across large repos using vector embeddings and LLM reasoning. Find goal duplication even across different files.
Originally published:
What You'll Learn
- How to detect duplicate pull requests across large open-source repositories using vector embeddings and LLM reasoning
- Techniques for compressing and filtering massive code changes to fit within API budgets
- How to classify redundant PRs into actionable categories (SHADOW, SUPERSET, COMPETING)
- Practical strategies for managing AI API rate limits on a zero-dollar budget
- Common pitfalls in semantic code analysis and how to overcome them
Introduction: The Duplicate PR Problem
High-traffic open-source repositories face an invisible cost: maintainers drowning in duplicate pull requests. AI coding agents amplify this problem—they generate working solutions at scale, but often solve the same problem multiple times across different files, architectures, and approaches. A recent analysis of 200 PRs in shadcn-ui/ui revealed 69 valid redundancies, many solving identical functional failures through completely different code paths. This tutorial walks you through building a duplicate PR detection system that identifies semantic redundancy, not just code clones.
Why This Matters
Maintainers spend hours triaging PRs that solve problems already fixed. The system described here detects goal duplication—when three different PRs modify config files, utilities, and registries, all to fix the same broken documentation link. This is the most expensive type of noise to filter manually because the changes are architecturally sound but functionally redundant. By automating this detection, maintainers reclaim hours and contributors get faster feedback.
Prerequisites
- API Access: Free or trial accounts for Gemini embeddings, an LLM provider (Gemini, Llama 2 via OpenRouter), and a vector database (Upstash Vector recommended for free tier)
- Development Environment: Node.js 18+, TypeScript (optional but recommended)
- GitHub Access: A GitHub personal access token with read access to public repositories
- Repository Knowledge: Familiarity with REST APIs, vector databases, and basic LLM concepts
- Token Budget Awareness: Understanding of API rate limits and free-tier constraints
Step-by-Step Guide
Phase 1: Data Ingestion & Context Compression
The first bottleneck is volume. A single large PR can contain thousands of lines across multiple files. Without aggressive filtering, you'll exhaust API budgets before analyzing 50 PRs. The strategy is to extract only semantic signal—the code changes that matter—and discard noise.
Step 1.1: Set up GitHub API access
Install Octokit, the official GitHub API client:
npm install @octokit/rest dotenv
Create a .env file with your GitHub token:
GITHUB_TOKEN=ghp_xxxxxxxxxxxxx
REPO_OWNER=shadcn
REPO_NAME=ui
Initialize the client to paginate through recent PRs:
const { Octokit } = require("@octokit/rest");
const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN });
const prs = await octokit.paginate("GET /repos/{owner}/{repo}/pulls", {
owner: process.env.REPO_OWNER,
repo: process.env.REPO_NAME,
state: "closed",
per_page: 100,
});
Step 1.2: Filter and compress PR content
Not all files matter. SVGs, lockfiles, node_modules entries, and auto-generated documentation add noise without signal. Extract only meaningful code changes:
const IGNORE_PATTERNS = [
/\.svg$/,
/package-lock\.json$/,
/yarn\.lock$/,
/node_modules/,
/\.md$/,
];
function isRelevantFile(filename) {
return !IGNORE_PATTERNS.some((pattern) => pattern.test(filename));
}
async function compressPRContent(pr) {
const files = await octokit.paginate(
"GET /repos/{owner}/{repo}/pulls/{pull_number}/files",
{
owner: process.env.REPO_OWNER,
repo: process.env.REPO_NAME,
pull_number: pr.number,
per_page: 100,
}
);
let compressedContent = "";
for (const file of files) {
if (!isRelevantFile(file.filename)) continue;
// Extract only the diff hunks (+ and - lines)
const patch = file.patch || "";
const changes = patch
.split("\n")
.filter((line) => line.startsWith("+") || line.startsWith("-"))
.filter((line) => !line.startsWith("+++") && !line.startsWith("---"))
.join("\n");
if (changes.length > 0) {
compressedContent += `File: ${file.filename}\n${changes}\n\n`;
}
}
// If content still exceeds 1500 chars, truncate to the first 1500
if (compressedContent.length > 1500) {
compressedContent = compressedContent.substring(0, 1500);
}
return compressedContent;
}
Step 1.3: Build a metadata index
Store PR metadata alongside compressed content for later reference. This becomes your audit trail:
const prIndex = [];
for (const pr of prs.slice(0, 200)) {
const compressed = await compressPRContent(pr);
prIndex.push({
number: pr.number,
title: pr.title,
author: pr.user.login,
created_at: pr.created_at,
merged_at: pr.merged_at,
compressed_content: compressed,
});
}
console.log(Indexed ${prIndex.length} PRs with compression applied.);
Phase 2: Vector Embeddings & Semantic Search
Raw text similarity (string matching) misses the point. Two PRs solving the same bug in different ways look completely different as strings. Vector embeddings convert semantic meaning into high-dimensional space, allowing you to find similar problems even when code syntax differs.
Step 2.1: Generate embeddings with Gemini
Install the Google AI SDK:
npm install @google/generative-ai
Initialize the embedding model and create vectors for each PR:
const { GoogleGenerativeAI } = require("@google/generative-ai");
const client = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
async function generateEmbedding(text) {
const model = client.getGenerativeModel({
model: "embedding-001",
});
const result = await model.embedContent(text);
return result.embedding.values;
}
const embeddedIndex = [];
for (const item of prIndex) {
const embedding = await generateEmbedding(item.compressed_content);
embeddedIndex.push({
...item,
embedding,
});
console.log(Embedded PR #${item.number});
}
Step 2.2: Index vectors in Upstash
Install Upstash SDK:
npm install @upstash/vector
Create a vector index in Upstash (free tier available) and upsert your embeddings:
const { Index } = require("@upstash/vector");
const index = new Index({
url: process.env.UPSTASH_VECTOR_REST_URL,
token: process.env.UPSTASH_VECTOR_REST_TOKEN,
});
for (const item of embeddedIndex) {
await index.upsert({
id: pr_${item.number},
values: item.embedding,
metadata: {
pr_number: item.number,
title: item.title,
author: item.author,
},
});
}
console.log("Vectors indexed in Upstash.");
Step 2.3: Query for similar PRs
For each PR, find the 8 most semantically similar candidates:
async function findSimilarPRs(prNumber, embedding) {
const results = await index.query({
vector: embedding,
topK: 8,
includeMetadata: true,
});
// Exclude the PR itself from results
return results.filter((r) => r.metadata.pr_number !== prNumber);
}
for (const item of embeddedIndex) {
const similar = await findSimilarPRs(item.number, item.embedding);
item.similar_candidates = similar;
}
Phase 3: LLM-Driven Reasoning & Classification
The LLM doesn't do duplicate detection—it does intent analysis. Given two PRs and their semantic similarity score, the model reasons about what problem each solves and classifies the relationship. This is the critical filtering step that separates true duplicates from false alarms.
Step 3.1: Set up multi-provider LLM access
Install OpenRouter SDK to handle provider fallbacks:
npm install axios
Create a resilient LLM router that pivots between providers on rate limits:
const axios = require("axios");
const LLM_PROVIDERS = [
{
name: "Gemini",
apiKey: process.env.GEMINI_API_KEY,
url: "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent",
},
{
name: "OpenRouter",
apiKey: process.env.OPENROUTER_API_KEY,
url: "https://openrouter.ai/api/v1/chat/completions",
},
];
let currentProvider = 0;
async function callLLMWithFallback(prompt, maxRetries = 3) {
let lastError;
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const provider = LLM_PROVIDERS[currentProvider];
console.log(Attempt ${attempt + 1} with ${provider.name}...);
const response = await axios.post(
provider.url,
{ prompt },
{
headers: { Authorization: `Bearer ${provider.apiKey}` },
timeout: 10000,
}
);
return response.data;
} catch (error) {
lastError = error;
if (error.response?.status === 429 || error.response?.status === 503) {
// Rate limited or service unavailable—try next provider
currentProvider = (currentProvider + 1) % LLM_PROVIDERS.length;
const backoff = Math.pow(2, attempt) * 1000;
console.log(`Rate limited. Waiting ${backoff}ms before retry...`);
await new Promise((resolve) => setTimeout(resolve, backoff));
}
}
}
throw new Error(
All LLM providers exhausted. Last error: ${lastError.message}
);
}
Step 3.2: Analyze PR intent and classify redundancy
For each pair of similar PRs, construct a reasoning prompt that forces the model to explicitly state the functional goal of each PR:
async function analyzeRedundancy(pr1, pr2)
Step 3.3: Build the redundancy report
Run the full analysis and aggregate results:
const redundancyReport = [];
for (const item of embeddedIndex) {
for (const candidate of item.similar_candidates) {
const analysis = await analyzeRedundancy(item, {
...embeddedIndex.find(
(e) => e.number === candidate.metadata.pr_number
),
});
if (analysis && analysis.are_duplicates && analysis.confidence > 0.7) {
redundancyReport.push({
primary_pr: item.number,
duplicate_pr: candidate.metadata.pr_number,
category: analysis.category,
confidence: analysis.confidence,
functional_goal: analysis.functional_goal_pr1,
});
}
}
}
console.log(
Found ${redundancyReport.length} high-confidence duplicates across 200 PRs.
);
console.log(JSON.stringify(redundancyReport, null, 2));
Real-World Results: What the System Found
Running this pipeline on shadcn-ui/ui's 200 most recent PRs identified 69 valid redundancies. The most interesting case involved three separate PRs targeting the same broken /blocks page link:
- PR #10156: Simple URL replacement in config file
- PR #10088: Path normalization in utilities
- PR #10096: Reference key refactoring in registry
All three solved identical functional failures, but in completely different files using different strategies. The vector embedding phase connected them despite architectural differences, and the LLM phase confirmed they were solving the same problem. This is "goal duplication"—the most expensive noise for maintainers to filter manually because the changes look legitimate in isolation.
Troubleshooting Common Issues
Issue 1: Vector Similarity Misses Architectural Differences
Symptom: The system finds two PRs as semantically similar, but they're solving completely different problems in the same file.
Root Cause: Vector embeddings capture syntactic similarity (similar code structure) but not semantic intent. Two different JSON additions to the same registry might have identical structure but target different problems.
Solution: Lower the similarity threshold before sending to the LLM. Only query the top 5 candidates instead of 8, and increase the LLM confidence threshold to 0.75+. You'll miss some duplicates, but reduce false positives dramatically.
Issue 2: Smaller Models Show Structural Bias
Symptom: After hitting quota limits and falling back to a smaller 8B model, the system flags new registry entries as duplicates just because they look similar structurally (same JSON shape).
Root Cause: Smaller models struggle to weigh semantic content (URL values, IDs) over syntactic structure (JSON nesting). They pattern-match on shape rather than meaning.
Solution: Avoid smaller models for this task. If you must use them due to budget constraints, add a pre-processing step that emphasizes literal values in your prompts: "Focus on URLs, IDs, and exact values. Ignore structural similarity."
Issue 3: Rate Limiting Exhausts Budget Mid-Analysis
Symptom: You complete 50 PRs and then hit 429 (Too Many Requests) errors, with no fallback.
Root Cause: Free-tier APIs have aggressive rate limits. A single provider isn't enough.
Solution: Implement the multi-provider router shown in Phase 3. Rotate between Gemini, OpenRouter, and Llama. Add exponential backoff (1s, 2s, 4s) before retries. Cache LLM responses to avoid re-analyzing the same pair of PRs.
Issue 4: Large PRs Dominate the Vector Index
Symptom: A single large PR with 5,000 lines truncated to 1,500 chars produces weak embeddings; semantic information is lost.
Root Cause: Truncation loses context. A 5,000-line PR compressed to 1,500 chars contains only the first 25% of changes.
Solution: Slice large PRs by file, not by character count. Generate separate embeddings for each modified file, then aggregate similarity scores. This preserves semantic information across large changes.
Best Practices
1. Cache Embeddings Aggressively
Generating embeddings costs tokens. Once created, save them to persistent storage (PostgreSQL, DynamoDB). Re-use cached embeddings for repeat analyses. Store both the raw vector and metadata (PR number, title, author) together.
2. Validate High-Confidence Results Manually
Even at 0.9 confidence, LLM classifications can be wrong. For the top 10-20 flagged duplicates, manually verify the analysis. This builds trust with maintainers and improves your feedback loop for future runs.
3. Monitor for Hallucination in Wide Sweeps
When analyzing large repositories (500+ PRs), the system sometimes identifies false connections between unrelated changes that happen to touch the same package. Combat this by:
- Requiring high similarity scores (top-3 candidates only, not top-8)
- Increasing LLM confidence threshold to 0.8+
- Adding a "false positive penalty" in scoring when PRs modify different core files
4. Segment by PR Age and Feature Area
Don't analyze all 200 PRs at once. Segment by feature area (docs, components, themes) and by age (recent first). This narrows the search space and improves precision. A PR from 2 weeks ago is more likely to conflict with a new PR than one from 6 months ago.
5. Build a Live Bot, Not Just a Batch Script
The real value emerges when you alert maintainers in real-time. Extend this pipeline into a GitHub bot that:
- Listens for new PR events
- Queries the historical vector index (your 200-PR corpus)
- Posts a comment if a duplicate is detected: "This may be a duplicate of PR #xxxx. See analysis: [link]"
- Allows maintainers to provide feedback, refining your detection over time
Cost Optimization on a $0 Budget
This entire system ran on free-tier APIs. Here's how:
- Gemini Embeddings: 60 requests/minute free. Process in batches with 1-second delays.
- Upstash Vector: 10,000 free operations. 200 PRs = ~400 operations (embed + query). Plenty of room.
- LLM Inference (OpenRouter): Free trial credits cover ~100 inference calls. Rotate providers to stretch credits further.
- GitHub API: 60 requests/hour unauthenticated, 5,000 authenticated. Authenticate to stay under limits.
Total cost for analyzing 200 PRs: $0, assuming you don't exceed free-tier quotas. At scale (1000+ PRs), expect ~$20-50/month across all services.
Next Steps: Extending the System
1. Deploy a GitHub Bot
Convert the batch analysis into a real-time bot using Probot or GitHub Actions. When a PR is opened, query your vector index and post an automated comment if duplicates are detected. This closes the feedback loop with maintainers.
2. Build a Maintainer Dashboard
Visualize semantic clusters of similar PRs. Show maintainers where contributors are accidentally overlapping. Highlight high-value SUPERSET PRs (broader fixes that subsume smaller ones). This transforms raw analysis into actionable intelligence.
3. Fine-Tune on Your Repository
After running the initial analysis, collect false positives and false negatives. Fine-tune a small embedding model (like sentence-transformers) on examples from your specific repository. This improves recall for domain-specific patterns your LLM might miss.
4. Integrate with CI/CD
Add a pre-merge check that runs this analysis on every PR. Block merge if high-confidence duplicates are detected, forcing maintainers to explicitly acknowledge the redundancy or provide context.
Summary
Detecting duplicate pull requests in high-traffic repositories requires three phases: aggressive data compression to fit API budgets, vector embeddings to identify semantic similarity despite architectural differences, and LLM reasoning to classify the type of redundancy. This tutorial implemented a complete system that analyzed 200 PRs in shadcn-ui/ui and found 69 valid duplicates, many solving identical functional failures across completely different files.
The system isn't looking for code clones or string matches. It evaluates architectural intent—when three separate PRs all fix the same broken link but through config, utilities, and registry changes, the system correctly identifies them as goal duplicates. This is the type of noise that drains maintainers' time most efficiently.
Key insights from implementation:
- Truncate PRs intelligently (by modified hunks, not character count) to preserve semantic signal while managing API budgets
- Vector similarity alone produces false positives due to structural bias; always validate with LLM reasoning
- Implement multi-provider LLM routing to survive rate limits on free-tier APIs
- Cache embeddings aggressively and validate high-confidence results manually before deploying
- The real value emerges when extended into a live GitHub bot that alerts maintainers in real-time
For maintainers interested in deploying this on your repository, the code is production-ready with proper error handling, exponential backoff, and provider fallback. For contributors wanting to extend this work, the natural next step is a live bot and maintainer dashboard that visualize semantic clusters and surface high-impact SUPERSET PRs.
Original Source
https://dev.to/chinmaymhatre/i-analysed-200-prs-in-shadcn-uiui-to-find-duplicates-it-went-surprisingly-well-3jh7
Last updated: