ENGINEERING

Duplicate Image Detection: How to Find and Remove Duplicate Photos

Giorgi Kenchadze

Giorgi Kenchadze

2026-04-08 · 9 min read

A user uploads a product photo to your marketplace. That same photo is already on 14 other listings posted by different sellers. Or your photo library has 50,000 images and a good chunk of them are slightly different versions of the same shot. Resized, cropped, filtered, re-compressed.

Detecting duplicates sounds like a simple problem until you try to define what "duplicate" means. An exact copy? A resized version? A cropped version with a filter on it? A screenshot of the original? Each level of "duplicate" requires a different detection approach.

This post covers three approaches, from simple to sophisticated, with real numbers on speed and accuracy so you can pick the right one for your use case.

Approach 1: File Hashing (Exact Duplicates)

The simplest approach. Compute an MD5 or SHA-256 hash of the raw file bytes. If two files have the same hash, they're byte-for-byte identical.

import hashlib
from pathlib import Path


def file_hash(path: str) -> str:
    return hashlib.sha256(Path(path).read_bytes()).hexdigest()


# Compare two files
if file_hash("photo_a.jpg") == file_hash("photo_b.jpg"):
    print("Exact duplicate")

Speed: You can hash tens of thousands of files per second. The bottleneck is disk I/O, not the hash computation.

What it catches: Exact copies. Same file re-uploaded, same file saved to two locations.

What it misses: Everything else. Resize the image by 1 pixel, re-save as a different JPEG quality, add a single byte of metadata, and the hash changes completely. This approach treats a 4000x3000 original and its 800x600 thumbnail as completely different files.

When to use it: As the first pass in a multi-stage pipeline. It's free, it's instant, and it catches the easy cases. Every duplicate detection system should start here.

Approach 2: Perceptual Hashing (Near-Duplicates)

Perceptual hashing looks at what the image looks like, not the raw bytes. Two photos of the same scene will produce similar hashes even if they're different resolutions, compression levels, or file formats.

The most common variants:

  • aHash (Average Hash): Resize to 8x8, grayscale, compare each pixel to the mean. Fastest, least robust.
  • dHash (Difference Hash): Compares adjacent pixel brightness. Fast and surprisingly effective.
  • pHash (Perceptual Hash): Uses DCT (discrete cosine transform). More robust to edits.

All produce a compact hash (usually 64 bits). You compare two hashes by counting how many bits differ (Hamming distance). Low distance means the images look similar.

from PIL import Image
import imagehash  # pip install imagehash


def detect_near_duplicate(path_a: str, path_b: str, threshold: int = 5):
    hash_a = imagehash.phash(Image.open(path_a))
    hash_b = imagehash.phash(Image.open(path_b))
    distance = hash_a - hash_b  # Hamming distance
    return distance <= threshold


# Find all duplicates in a folder
def find_duplicates(image_dir: str, threshold: int = 5):
    hashes = {}
    duplicates = []

    for path in Path(image_dir).glob("*.jpg"):
        h = imagehash.phash(Image.open(path))
        for existing_hash, existing_path in hashes.items():
            if h - existing_hash <= threshold:
                duplicates.append((str(path), str(existing_path)))
                break
        else:
            hashes[h] = path

    return duplicates

Hamming Distance Thresholds

For a 64-bit pHash:

Distance Meaning
0 Perceptually identical
1-5 Near-duplicate (resized, recompressed, minor color shift)
6-10 Probably similar (some edits, possible false positives)
>10 Different images

Most production systems use a threshold of 5 or lower.

Speed: 500-2,000 images/second for hashing. Comparison is millions per second (it's just XOR on 64-bit integers). A pHash index for 1 million images fits in about 8 MB of RAM.

What it catches: Resized copies, different JPEG compression levels, minor color adjustments, format conversions (PNG to JPEG).

What it misses: Crops, rotations beyond a few degrees, text overlays, watermarks, filters, screenshots of photos. If the visual structure of the image changes significantly, perceptual hashing breaks.

When to use it: As the second layer after file hashing. Catches the common re-upload scenarios on marketplaces and content platforms. Fast enough to run on every upload in real time.

Approach 3: Embedding Similarity (Semantic Duplicates)

This is the most powerful approach. A vision model (like CLIP or SigLIP) looks at the image and produces a vector embedding that captures what the image actually contains. Two photos of the same red sneaker from different angles will have similar embeddings, even though they look different at the pixel level.

Instead of comparing pixel patterns (like hashing), you're comparing meaning.

import requests
import os

API_KEY = os.getenv("VECSTORE_API_KEY")
DB_ID = os.getenv("VECSTORE_DB_ID")
BASE = "https://api.vecstore.app/api"
HEADERS = {"X-API-Key": API_KEY, "Content-Type": "application/json"}


def check_duplicate(image_url: str, threshold: float = 0.90) -> dict | None:
    """Check if a similar image already exists in the database."""
    resp = requests.post(
        f"{BASE}/databases/{DB_ID}/search",
        headers=HEADERS,
        json={"image_url": image_url, "top_k": 1},
    )
    results = resp.json().get("results", [])
    if results and results[0]["score"] >= threshold:
        return results[0]  # duplicate found
    return None


def insert_if_unique(image_url: str, threshold: float = 0.90):
    """Only insert an image if it's not already in the database."""
    duplicate = check_duplicate(image_url, threshold)
    if duplicate:
        print(f"Duplicate found (score: {duplicate['score']:.2f})")
        return {"status": "duplicate", "match": duplicate}

    resp = requests.post(
        f"{BASE}/databases/{DB_ID}/documents",
        headers=HEADERS,
        json={"image_url": image_url},
    )
    print("Inserted new image")
    return {"status": "inserted", "data": resp.json()}

Similarity Score Thresholds

Score Meaning
0.95-1.0 Near-identical (same photo, different crop or quality)
0.85-0.95 Very similar (same subject, different angle or lighting)
0.70-0.85 Related (same category, some visual overlap)
<0.70 Different images

For duplicate detection, 0.90+ works well for most use cases. For "find similar" features, 0.75+ is usually the right range.

Speed: Depends on where you run the model. Self-hosted CLIP on a GPU gives 50-100 images/sec. Through a managed API like Vecstore, the embedding happens server-side and search comes back in under 200ms.

What it catches: Everything the other approaches catch, plus: crops, rotations, filters, watermarks, screenshots of photos, photos of screens, text overlays, aspect ratio changes. It even catches two different photos of the same object from different angles.

What it misses (or over-matches): It can flag two different red sneakers as "similar" because it's matching on semantics, not pixels. For strict duplicate detection, raise the threshold to 0.95+. For "same product, different photo" detection, 0.85+ works well.

When to use it: When perceptual hashing isn't enough. If your duplicates include edited versions, screenshots, or photos taken from different angles, embedding similarity is the only reliable approach.

Which Approach Should You Use?

It depends on what you're trying to catch.

Scenario Best Approach
Same file re-uploaded File hash (SHA-256)
Resized or recompressed copies Perceptual hash (pHash)
Cropped, filtered, or edited versions Embedding similarity
Same object photographed differently Embedding similarity
All of the above Layered: file hash → pHash → embeddings

For most production systems, the layered approach makes the most sense:

  1. File hash first. Instant, free, catches exact copies.
  2. Perceptual hash second. Fast, catches resized/recompressed copies.
  3. Embedding similarity third. Catches everything else.

Each layer filters out the easy cases before the more expensive check runs. The file hash catches re-uploads before pHash even runs. pHash catches resized copies before you spend an API call on embedding similarity.

That said, if your duplicates are commonly edited (marketplace photo theft, social media reposts, content scraping), skip straight to embedding similarity. The simpler approaches will miss them.

Building a Production Pipeline

Here's a more complete example that combines all three approaches. This is what a real upload handler might look like:

import hashlib
import imagehash
from PIL import Image


class DuplicateDetector:
    def __init__(self, hash_threshold=5, similarity_threshold=0.90):
        self.file_hashes = set()
        self.perceptual_hashes = {}
        self.hash_threshold = hash_threshold
        self.similarity_threshold = similarity_threshold

    def check(self, file_path: str) -> dict:
        # Layer 1: exact file hash
        fhash = hashlib.sha256(
            open(file_path, "rb").read()
        ).hexdigest()

        if fhash in self.file_hashes:
            return {"duplicate": True, "method": "file_hash"}
        self.file_hashes.add(fhash)

        # Layer 2: perceptual hash
        phash = imagehash.phash(Image.open(file_path))
        for existing_hash, existing_path in self.perceptual_hashes.items():
            if phash - existing_hash <= self.hash_threshold:
                return {
                    "duplicate": True,
                    "method": "perceptual_hash",
                    "match": str(existing_path),
                }
        self.perceptual_hashes[phash] = file_path

        # Layer 3: embedding similarity (API call)
        duplicate = check_duplicate_via_api(
            file_path, self.similarity_threshold
        )
        if duplicate:
            return {
                "duplicate": True,
                "method": "embedding",
                "score": duplicate["score"],
            }

        return {"duplicate": False}

The first two layers are local and fast. The third layer only runs for images that passed both previous checks. This means you're only making API calls for images that are genuinely new or have been significantly edited.

Common Pitfalls

Threshold tuning is the hard part. There's no universal threshold. A marketplace detecting stolen product photos needs different thresholds than a photo library removing personal duplicates. Start with the defaults from this post and adjust based on your false positive/negative rates.

Pairwise comparison doesn't scale. Comparing every image against every other image is O(n^2). At 100,000 images, that's 10 billion comparisons. Use indexed search (vector databases, hash tables) instead of brute-force comparison.

Screenshots and photos-of-screens. These are one of the hardest cases. The image is captured through a different device, adding noise, color shift, and perspective distortion. Perceptual hashing usually misses these. Embedding similarity catches them if the threshold isn't too strict.

JPEG re-encoding. Every time a JPEG is saved, it loses a tiny bit of quality. After a few rounds of save-upload-download-save, the file is different enough that file hashing misses it, but the image still looks identical to a human. Perceptual hashing handles this well.

Wrapping Up

For most applications, start with file hashing and perceptual hashing. They're fast, they're local, and they catch the obvious cases. When you need to catch edited, cropped, or re-photographed duplicates, add embedding similarity on top.

If you want to skip the hashing layers and go straight to embedding-based detection, Vecstore handles the embeddings and similarity search for you. Insert your images, search each new upload against your database, and check if the top result scores above your threshold.

Get started with Vecstore - 25 free credits on signup, no credit card required.

Better search for your product—without the engineering overhead.

45M+ searches powered by Vecstore this year

Sign up for Vecstore
Start for Free

25 Free credits. No credit card required.