PDF Batch Watermark Remover: Elegantly Clean Up Watermarks and Text from Your PDF Documents

本文最后更新于141 天前，其中的信息可能已经过时，如有错误请评论指出！

In day-to-day work and study, we often encounter PDF documents with watermarks. These watermarks can hinder readability, especially when they overlap with the main content. I downloaded a compressed archive of research papers online, and unsurprisingly, some papers had watermarks from tutoring agencies, which was quite annoying (as shown in the image below). I checked online, and most solutions use the PyMuPDF library to cover the watermarks with white blocks, which I didn’t find very elegant. So, I spent some time developing a more refined solution and am documenting it here.

Features

Recursive Search: Automatically processes PDF files in a specified folder and all its subfolders.
Precise Matching: Only deletes text that exactly matches the specified content, leaving other content untouched.
In-Place Processing: Modifies the original files directly, saving disk space.
Safe and Reliable: Employs a temporary file mechanism to ensure file safety.
Error Handling: Robust error handling ensures that batch processing won’t be interrupted by failures with individual files.
Detailed Logging: Provides detailed information about the processing progress.

How to Use

Ensure you have a Python environment installed.
Install the necessary dependency:

pip install PyMuPDF

Download and configure the script (complete code at the end of this article):

import fitz  # PyMuPDF library
import os

# --- User Configuration ---

# 1. Enter all text strings you want to remove in this list.
TEXTS_TO_DELETE = [
    "The watermark text you want to delete"
]

# 2. The path to the folder containing all your PDF files.
SOURCE_FOLDER = "Your file path"

Run the script. It will automatically:

Search for all PDF files in the specified folder and its subfolders.
Find and delete the specified watermark text.
Update the processed files in place.
Display the processing progress and results summary.

How It Works

File Search: Uses os.walk() to recursively traverse all folders and find all PDF files.
Text Processing:

Scans each page of each PDF file.
Uses PyMuPDF’s search_for() method to locate the target text.
Safely deletes the text using add_redact_annot() and apply_redactions() methods.

Safe Saving:

Uses a temporary file mechanism to ensure file safety.
Replaces the original file only after the new file is successfully saved.
Automatically cleans up temporary files in case of errors.

Error Handling:

Handles corrupted PDF files.
Skips encrypted PDF files.
Errors on individual pages don’t affect the overall progress.
Provides detailed error logging.

Complete Code

import fitz  # PyMuPDF library
import os

# --- User Configuration ---

# 1. Enter all text strings you want to remove in this list.
#    The script will find and remove all text that exactly matches the strings in this list.
#    You can add as many strings as needed.
TEXTS_TO_DELETE = [
    "The watermark text you want to delete"
]

# 2. The path to the folder containing all your PDF files.
SOURCE_FOLDER = "Your file path"

def process_pdf_file(filepath):
    """
    Processes a single PDF file, removes the specified text, and overwrites the original file.

    Args:
        filepath: The full path to the PDF file.

    Returns:
        bool: True if the file was modified, False otherwise.
    """
    filename = os.path.basename(filepath)
    is_modified = False

    try:
        # Try to open the PDF file.
        doc = fitz.open(filepath)
        print(f"\nProcessing: {filename}...")

        if doc.needs_pass:
            print(f"  -> Skipping: File {filename} is password protected.")
            doc.close()
            return False

        # Iterate through each page of the PDF.
        for page_num in range(doc.page_count):
            try:
                page = doc[page_num]
            except Exception as e:
                print(f"  -> Warning: Error accessing page {page_num + 1}, skipping this page: {e}")
                continue

            # Apply all redaction annotations to this page.
            for text in TEXTS_TO_DELETE:
                try:
                    # search_for() returns a list of rectangles for all matching text instances.
                    found_instances = page.search_for(text)

                    # If matches are found.
                    if found_instances:
                        is_modified = True
                        print(f"  -> Found and marked text '{text}' on page {page_num + 1}.")
                        # Add a redaction annotation for each found instance.
                        for inst in found_instances:
                            try:
                                page.add_redact_annot(inst, fill=(1, 1, 1))
                            except Exception as e:
                                print(f"  -> Warning: Error adding redaction annotation on page {page_num + 1}: {e}")
                                continue
                except Exception as e:
                    print(f"  -> Warning: Error searching for text on page {page_num + 1}: {e}")
                    continue

            # Apply all redaction annotations on this page to actually remove the text.
            if is_modified:
                try:
                    page.apply_redactions()
                except Exception as e:
                    print(f"  -> Warning: Error applying redactions on page {page_num + 1}: {e}")
                    continue

        # If the file was modified, save it in place.
        if is_modified:
            try:
                # Save using a temporary file.
                temp_filepath = filepath + ".temp"
                doc.save(temp_filepath, garbage=4, deflate=True, clean=True)
                doc.close()

                # Delete the original file and rename the temporary file.
                os.remove(filepath)
                os.rename(temp_filepath, filepath)

                print(f"  -> Successfully removed specified text and overwritten the original file.")
            except Exception as e:
                print(f"  -> Error saving file: {e}")
                # Clean up the temporary file (if it exists).
                if os.path.exists(temp_filepath):
                    os.remove(temp_filepath)
                return False
        else:
            print(f"  -> No text to delete found in this file, skipping.")
            doc.close()
        return is_modified

    except Exception as e:
        print(f"  -> Error processing file {filename}: {e}")
        return False

def batch_delete_text_by_content():
    """
    Recursively searches and processes all PDF files, removing the specified text content.
    """
    total_files = 0
    modified_files = 0

    # Recursively traverse all folders.
    for root, _, files in os.walk(SOURCE_FOLDER):
        # Filter out all PDF files.
        pdf_files = [f for f in files if f.lower().endswith('.pdf')]
        total_files += len(pdf_files)

        # Process all PDF files in the current folder.
        for pdf_file in pdf_files:
            filepath = os.path.join(root, pdf_file)
            if process_pdf_file(filepath):
                modified_files += 1

    print(f"\n🎉 Processing complete!")
    print(f"Total PDF files processed: {total_files}")
    print(f"Files modified: {modified_files}")

# --- Run the script ---
if __name__ == "__main__":
    if "此处填写" in SOURCE_FOLDER:
        print("Error: Please set the correct path for 'SOURCE_FOLDER' in the script!")
    else:
        batch_delete_text_by_content()

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Features

How to Use

How It Works

Complete Code

Send Comment Edit Comment

Related Posts