Detect Duplicate PRs With Vector Embeddings

Dev.to by Chinmay Mhatre April 19, 2026

What You'll Learn

How to detect duplicate pull requests across large open-source repositories using vector embeddings and LLM reasoning
Techniques for compressing and filtering massive code changes to fit within API budgets
How to classify redundant PRs into actionable categories (SHADOW, SUPERSET, COMPETING)
Practical strategies for managing AI API rate limits on a zero-dollar budget
Common pitfalls in semantic code analysis and how to overcome them

Introduction: The Duplicate PR Problem

High-traffic open-source repositories face an invisible cost: maintainers drowning in duplicate pull requests. AI coding agents amplify this problem—they generate working solutions at scale, but often solve the same problem multiple times across different files, architectures, and approaches. A recent analysis of 200 PRs in shadcn-ui/ui revealed 69 valid redundancies, many solving identical functional failures through completely different code paths. This tutorial walks you through building a duplicate PR detection system that identifies semantic redundancy, not just code clones.

Why This Matters

Maintainers spend hours triaging PRs that solve problems already fixed. The system described here detects goal duplication—when three different PRs modify config files, utilities, and registries, all to fix the same broken documentation link. This is the most expensive type of noise to filter manually because the changes are architecturally sound but functionally redundant. By automating this detection, maintainers reclaim hours and contributors get faster feedback.

Prerequisites

API Access: Free or trial accounts for Gemini embeddings, an LLM provider (Gemini, Llama 2 via OpenRouter), and a vector database (Upstash Vector recommended for free tier)
Development Environment: Node.js 18+, TypeScript (optional but recommended)
GitHub Access: A GitHub personal access token with read access to public repositories
Repository Knowledge: Familiarity with REST APIs, vector databases, and basic LLM concepts
Token Budget Awareness: Understanding of API rate limits and free-tier constraints

Step-by-Step Guide

Phase 1: Data Ingestion & Context Compression

The first bottleneck is volume. A single large PR can contain thousands of lines across multiple files. Without aggressive filtering, you'll exhaust API budgets before analyzing 50 PRs. The strategy is to extract only semantic signal—the code changes that matter—and discard noise.

Step 1.1: Set up GitHub API access

Install Octokit, the official GitHub API client:

npm install @octokit/rest dotenv

Create a .env file with your GitHub token:

GITHUB_TOKEN=ghp_xxxxxxxxxxxxx
REPO_OWNER=shadcn
REPO_NAME=ui

Initialize the client to paginate through recent PRs:

const { Octokit } = require("@octokit/rest");
const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN });

const prs = await octokit.paginate("GET /repos/{owner}/{repo}/pulls", {
  owner: process.env.REPO_OWNER,
  repo: process.env.REPO_NAME,
  state: "closed",
  per_page: 100,
});

Step 1.2: Filter and compress PR content

Not all files matter. SVGs, lockfiles, node_modules entries, and auto-generated documentation add noise without signal. Extract only meaningful code changes:

const IGNORE_PATTERNS = [
  /\.svg$/,
  /package-lock\.json$/,
  /yarn\.lock$/,
  /node_modules/,
  /\.md$/,
];

function isRelevantFile(filename) {
  return !IGNORE_PATTERNS.some((pattern) => pattern.test(filename));
}
async function compressPRContent(pr) {
  const files = await octokit.paginate(
    "GET /repos/{owner}/{repo}/pulls/{pull_number}/files",
    {
      owner: process.env.REPO_OWNER,
      repo: process.env.REPO_NAME,
      pull_number: pr.number,
      per_page: 100,
    }
  );
  let compressedContent = "";
  for (const file of files) {
    if (!isRelevantFile(file.filename)) continue;
// Extract only the diff hunks (+ and - lines)
const patch = file.patch || "";
const changes = patch
  .split("\n")
  .filter((line) => line.startsWith("+") || line.startsWith("-"))
  .filter((line) => !line.startsWith("+++") && !line.startsWith("---"))
  .join("\n");

if (changes.length > 0) {
  compressedContent += `File: ${file.filename}\n${changes}\n\n`;
}

  }
  // If content still exceeds 1500 chars, truncate to the first 1500
  if (compressedContent.length > 1500) {
    compressedContent = compressedContent.substring(0, 1500);
  }
  return compressedContent;
}

Step 1.3: Build a metadata index

Store PR metadata alongside compressed content for later reference. This becomes your audit trail:

const prIndex = [];

for (const pr of prs.slice(0, 200)) {
  const compressed = await compressPRContent(pr);
  prIndex.push({
    number: pr.number,
    title: pr.title,
    author: pr.user.login,
    created_at: pr.created_at,
    merged_at: pr.merged_at,
    compressed_content: compressed,
  });
}
console.log(Indexed ${prIndex.length} PRs with compression applied.);

Phase 2: Vector Embeddings & Semantic Search

Raw text similarity (string matching) misses the point. Two PRs solving the same bug in different ways look completely different as strings. Vector embeddings convert semantic meaning into high-dimensional space, allowing you to find similar problems even when code syntax differs.

Step 2.1: Generate embeddings with Gemini

Install the Google AI SDK:

npm install @google/generative-ai

Initialize the embedding model and create vectors for each PR:

const { GoogleGenerativeAI } = require("@google/generative-ai");
const client = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);

async function generateEmbedding(text) {
  const model = client.getGenerativeModel({
    model: "embedding-001",
  });
  const result = await model.embedContent(text);
  return result.embedding.values;
}
const embeddedIndex = [];
for (const item of prIndex) {
  const embedding = await generateEmbedding(item.compressed_content);
  embeddedIndex.push({
    ...item,
    embedding,
  });
  console.log(Embedded PR #${item.number});
}

Step 2.2: Index vectors in Upstash

Install Upstash SDK:

npm install @upstash/vector

Create a vector index in Upstash (free tier available) and upsert your embeddings:

const { Index } = require("@upstash/vector");

const index = new Index({
  url: process.env.UPSTASH_VECTOR_REST_URL,
  token: process.env.UPSTASH_VECTOR_REST_TOKEN,
});
for (const item of embeddedIndex) {
  await index.upsert({
    id: pr_${item.number},
    values: item.embedding,
    metadata: {
      pr_number: item.number,
      title: item.title,
      author: item.author,
    },
  });
}
console.log("Vectors indexed in Upstash.");

Step 2.3: Query for similar PRs

For each PR, find the 8 most semantically similar candidates:

async function findSimilarPRs(prNumber, embedding) {
  const results = await index.query({
    vector: embedding,
    topK: 8,
    includeMetadata: true,
  });

  // Exclude the PR itself from results
  return results.filter((r) => r.metadata.pr_number !== prNumber);
}
for (const item of embeddedIndex) {
  const similar = await findSimilarPRs(item.number, item.embedding);
  item.similar_candidates = similar;
}

Phase 3: LLM-Driven Reasoning & Classification

The LLM doesn't do duplicate detection—it does intent analysis. Given two PRs and their semantic similarity score, the model reasons about what problem each solves and classifies the relationship. This is the critical filtering step that separates true duplicates from false alarms.

Step 3.1: Set up multi-provider LLM access

Install OpenRouter SDK to handle provider fallbacks:

npm install axios

Create a resilient LLM router that pivots between providers on rate limits:

const axios = require("axios");

const LLM_PROVIDERS = [
  {
    name: "Gemini",
    apiKey: process.env.GEMINI_API_KEY,
    url: "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent",
  },
  {
    name: "OpenRouter",
    apiKey: process.env.OPENROUTER_API_KEY,
    url: "https://openrouter.ai/api/v1/chat/completions",
  },
];
let currentProvider = 0;
async function callLLMWithFallback(prompt, maxRetries = 3) {
  let lastError;
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const provider = LLM_PROVIDERS[currentProvider];
      console.log(Attempt ${attempt + 1} with ${provider.name}...);
  const response = await axios.post(
    provider.url,
    { prompt },
    {
      headers: { Authorization: `Bearer ${provider.apiKey}` },
      timeout: 10000,
    }
  );

  return response.data;
} catch (error) {
  lastError = error;

  if (error.response?.status === 429 || error.response?.status === 503) {
    // Rate limited or service unavailable—try next provider
    currentProvider = (currentProvider + 1) % LLM_PROVIDERS.length;
    const backoff = Math.pow(2, attempt) * 1000;
    console.log(`Rate limited. Waiting ${backoff}ms before retry...`);
    await new Promise((resolve) => setTimeout(resolve, backoff));
  }
}

  }
  throw new Error(
    All LLM providers exhausted. Last error: ${lastError.message}
  );
}

Step 3.2: Analyze PR intent and classify redundancy

For each pair of similar PRs, construct a reasoning prompt that forces the model to explicitly state the functional goal of each PR:

async function analyzeRedundancy(pr1, pr2)

Step 3.3: Build the redundancy report

Run the full analysis and aggregate results:

const redundancyReport = [];

for (const item of embeddedIndex) {
  for (const candidate of item.similar_candidates) {
    const analysis = await analyzeRedundancy(item, {
      ...embeddedIndex.find(
        (e) => e.number === candidate.metadata.pr_number
      ),
    });
if (analysis && analysis.are_duplicates && analysis.confidence > 0.7) {
  redundancyReport.push({
    primary_pr: item.number,
    duplicate_pr: candidate.metadata.pr_number,
    category: analysis.category,
    confidence: analysis.confidence,
    functional_goal: analysis.functional_goal_pr1,
  });
}

  }
}
console.log(
  Found ${redundancyReport.length} high-confidence duplicates across 200 PRs.
);
console.log(JSON.stringify(redundancyReport, null, 2));

Real-World Results: What the System Found

Running this pipeline on shadcn-ui/ui's 200 most recent PRs identified 69 valid redundancies. The most interesting case involved three separate PRs targeting the same broken /blocks page link:

PR #10156: Simple URL replacement in config file
PR #10088: Path normalization in utilities
PR #10096: Reference key refactoring in registry

All three solved identical functional failures, but in completely different files using different strategies. The vector embedding phase connected them despite architectural differences, and the LLM phase confirmed they were solving the same problem. This is "goal duplication"—the most expensive noise for maintainers to filter manually because the changes look legitimate in isolation.

Troubleshooting Common Issues

Issue 1: Vector Similarity Misses Architectural Differences

Symptom: The system finds two PRs as semantically similar, but they're solving completely different problems in the same file.

Root Cause: Vector embeddings capture syntactic similarity (similar code structure) but not semantic intent. Two different JSON additions to the same registry might have identical structure but target different problems.

Solution: Lower the similarity threshold before sending to the LLM. Only query the top 5 candidates instead of 8, and increase the LLM confidence threshold to 0.75+. You'll miss some duplicates, but reduce false positives dramatically.

Issue 2: Smaller Models Show Structural Bias

Symptom: After hitting quota limits and falling back to a smaller 8B model, the system flags new registry entries as duplicates just because they look similar structurally (same JSON shape).

Root Cause: Smaller models struggle to weigh semantic content (URL values, IDs) over syntactic structure (JSON nesting). They pattern-match on shape rather than meaning.

Solution: Avoid smaller models for this task. If you must use them due to budget constraints, add a pre-processing step that emphasizes literal values in your prompts: "Focus on URLs, IDs, and exact values. Ignore structural similarity."

Issue 3: Rate Limiting Exhausts Budget Mid-Analysis

Symptom: You complete 50 PRs and then hit 429 (Too Many Requests) errors, with no fallback.

Root Cause: Free-tier APIs have aggressive rate limits. A single provider isn't enough.

Solution: Implement the multi-provider router shown in Phase 3. Rotate between Gemini, OpenRouter, and Llama. Add exponential backoff (1s, 2s, 4s) before retries. Cache LLM responses to avoid re-analyzing the same pair of PRs.

Issue 4: Large PRs Dominate the Vector Index

Symptom: A single large PR with 5,000 lines truncated to 1,500 chars produces weak embeddings; semantic information is lost.

Root Cause: Truncation loses context. A 5,000-line PR compressed to 1,500 chars contains only the first 25% of changes.

Solution: Slice large PRs by file, not by character count. Generate separate embeddings for each modified file, then aggregate similarity scores. This preserves semantic information across large changes.

Best Practices

1. Cache Embeddings Aggressively

Generating embeddings costs tokens. Once created, save them to persistent storage (PostgreSQL, DynamoDB). Re-use cached embeddings for repeat analyses. Store both the raw vector and metadata (PR number, title, author) together.

2. Validate High-Confidence Results Manually

Even at 0.9 confidence, LLM classifications can be wrong. For the top 10-20 flagged duplicates, manually verify the analysis. This builds trust with maintainers and improves your feedback loop for future runs.

3. Monitor for Hallucination in Wide Sweeps

When analyzing large repositories (500+ PRs), the system sometimes identifies false connections between unrelated changes that happen to touch the same package. Combat this by:

Requiring high similarity scores (top-3 candidates only, not top-8)
Increasing LLM confidence threshold to 0.8+
Adding a "false positive penalty" in scoring when PRs modify different core files

4. Segment by PR Age and Feature Area

Don't analyze all 200 PRs at once. Segment by feature area (docs, components, themes) and by age (recent first). This narrows the search space and improves precision. A PR from 2 weeks ago is more likely to conflict with a new PR than one from 6 months ago.

5. Build a Live Bot, Not Just a Batch Script

The real value emerges when you alert maintainers in real-time. Extend this pipeline into a GitHub bot that:

Listens for new PR events
Queries the historical vector index (your 200-PR corpus)
Posts a comment if a duplicate is detected: "This may be a duplicate of PR #xxxx. See analysis: [link]"
Allows maintainers to provide feedback, refining your detection over time

Cost Optimization on a $0 Budget

This entire system ran on free-tier APIs. Here's how:

Gemini Embeddings: 60 requests/minute free. Process in batches with 1-second delays.
Upstash Vector: 10,000 free operations. 200 PRs = ~400 operations (embed + query). Plenty of room.
LLM Inference (OpenRouter): Free trial credits cover ~100 inference calls. Rotate providers to stretch credits further.
GitHub API: 60 requests/hour unauthenticated, 5,000 authenticated. Authenticate to stay under limits.

Total cost for analyzing 200 PRs: $0, assuming you don't exceed free-tier quotas. At scale (1000+ PRs), expect ~$20-50/month across all services.

Next Steps: Extending the System

1. Deploy a GitHub Bot

Convert the batch analysis into a real-time bot using Probot or GitHub Actions. When a PR is opened, query your vector index and post an automated comment if duplicates are detected. This closes the feedback loop with maintainers.

2. Build a Maintainer Dashboard

Visualize semantic clusters of similar PRs. Show maintainers where contributors are accidentally overlapping. Highlight high-value SUPERSET PRs (broader fixes that subsume smaller ones). This transforms raw analysis into actionable intelligence.

3. Fine-Tune on Your Repository

After running the initial analysis, collect false positives and false negatives. Fine-tune a small embedding model (like sentence-transformers) on examples from your specific repository. This improves recall for domain-specific patterns your LLM might miss.

4. Integrate with CI/CD

Add a pre-merge check that runs this analysis on every PR. Block merge if high-confidence duplicates are detected, forcing maintainers to explicitly acknowledge the redundancy or provide context.

Summary

Detecting duplicate pull requests in high-traffic repositories requires three phases: aggressive data compression to fit API budgets, vector embeddings to identify semantic similarity despite architectural differences, and LLM reasoning to classify the type of redundancy. This tutorial implemented a complete system that analyzed 200 PRs in shadcn-ui/ui and found 69 valid duplicates, many solving identical functional failures across completely different files.

The system isn't looking for code clones or string matches. It evaluates architectural intent—when three separate PRs all fix the same broken link but through config, utilities, and registry changes, the system correctly identifies them as goal duplicates. This is the type of noise that drains maintainers' time most efficiently.

Key insights from implementation:

Truncate PRs intelligently (by modified hunks, not character count) to preserve semantic signal while managing API budgets
Vector similarity alone produces false positives due to structural bias; always validate with LLM reasoning
Implement multi-provider LLM routing to survive rate limits on free-tier APIs
Cache embeddings aggressively and validate high-confidence results manually before deploying
The real value emerges when extended into a live GitHub bot that alerts maintainers in real-time

For maintainers interested in deploying this on your repository, the code is production-ready with proper error handling, exponential backoff, and provider fallback. For contributors wanting to extend this work, the natural next step is a live bot and maintainer dashboard that visualize semantic clusters and surface high-impact SUPERSET PRs.

Read original