PDF Batch Watermark Remover: Elegantly Clean Up Watermarks and Text from Your PDF Documents

In day-to-day work and study, we often encounter PDF documents with watermarks. These watermarks can hinder readability, especially when they overlap with the main content. I downloaded a compressed archive of research papers online, and unsurprisingly, some papers had watermarks from tutoring agencies, which was quite annoying (as shown in the image below). I checked online, and most solutions use the PyMuPDF library to cover the watermarks with white blocks, which I didn’t find very elegant. So, I spent some time developing a more refined solution and am documenting it here.

Features

  • Recursive Search: Automatically processes PDF files in a specified folder and all its subfolders.
  • Precise Matching: Only deletes text that exactly matches the specified content, leaving other content untouched.
  • In-Place Processing: Modifies the original files directly, saving disk space.
  • Safe and Reliable: Employs a temporary file mechanism to ensure file safety.
  • Error Handling: Robust error handling ensures that batch processing won’t be interrupted by failures with individual files.
  • Detailed Logging: Provides detailed information about the processing progress.

How to Use

  1. Ensure you have a Python environment installed.
  2. Install the necessary dependency:
pip install PyMuPDF
  1. Download and configure the script (complete code at the end of this article):
import fitz  # PyMuPDF library
import os

# --- User Configuration ---

# 1. Enter all text strings you want to remove in this list.
TEXTS_TO_DELETE = [
    "The watermark text you want to delete"
]

# 2. The path to the folder containing all your PDF files.
SOURCE_FOLDER = "Your file path"
  1. Run the script. It will automatically:
  • Search for all PDF files in the specified folder and its subfolders.
  • Find and delete the specified watermark text.
  • Update the processed files in place.
  • Display the processing progress and results summary.

How It Works

  1. File Search: Uses os.walk() to recursively traverse all folders and find all PDF files.
  2. Text Processing:
  • Scans each page of each PDF file.
  • Uses PyMuPDF’s search_for() method to locate the target text.
  • Safely deletes the text using add_redact_annot() and apply_redactions() methods.
  1. Safe Saving:
  • Uses a temporary file mechanism to ensure file safety.
  • Replaces the original file only after the new file is successfully saved.
  • Automatically cleans up temporary files in case of errors.
  1. Error Handling:
  • Handles corrupted PDF files.
  • Skips encrypted PDF files.
  • Errors on individual pages don’t affect the overall progress.
  • Provides detailed error logging.

Complete Code

import fitz  # PyMuPDF library
import os

# --- User Configuration ---

# 1. Enter all text strings you want to remove in this list.
#    The script will find and remove all text that exactly matches the strings in this list.
#    You can add as many strings as needed.
TEXTS_TO_DELETE = [
    "The watermark text you want to delete"
]

# 2. The path to the folder containing all your PDF files.
SOURCE_FOLDER = "Your file path"

def process_pdf_file(filepath):
    """
    Processes a single PDF file, removes the specified text, and overwrites the original file.

    Args:
        filepath: The full path to the PDF file.

    Returns:
        bool: True if the file was modified, False otherwise.
    """
    filename = os.path.basename(filepath)
    is_modified = False

    try:
        # Try to open the PDF file.
        doc = fitz.open(filepath)
        print(f"\nProcessing: {filename}...")

        if doc.needs_pass:
            print(f"  -> Skipping: File {filename} is password protected.")
            doc.close()
            return False

        # Iterate through each page of the PDF.
        for page_num in range(doc.page_count):
            try:
                page = doc[page_num]
            except Exception as e:
                print(f"  -> Warning: Error accessing page {page_num + 1}, skipping this page: {e}")
                continue

            # Apply all redaction annotations to this page.
            for text in TEXTS_TO_DELETE:
                try:
                    # search_for() returns a list of rectangles for all matching text instances.
                    found_instances = page.search_for(text)

                    # If matches are found.
                    if found_instances:
                        is_modified = True
                        print(f"  -> Found and marked text '{text}' on page {page_num + 1}.")
                        # Add a redaction annotation for each found instance.
                        for inst in found_instances:
                            try:
                                page.add_redact_annot(inst, fill=(1, 1, 1))
                            except Exception as e:
                                print(f"  -> Warning: Error adding redaction annotation on page {page_num + 1}: {e}")
                                continue
                except Exception as e:
                    print(f"  -> Warning: Error searching for text on page {page_num + 1}: {e}")
                    continue

            # Apply all redaction annotations on this page to actually remove the text.
            if is_modified:
                try:
                    page.apply_redactions()
                except Exception as e:
                    print(f"  -> Warning: Error applying redactions on page {page_num + 1}: {e}")
                    continue

        # If the file was modified, save it in place.
        if is_modified:
            try:
                # Save using a temporary file.
                temp_filepath = filepath + ".temp"
                doc.save(temp_filepath, garbage=4, deflate=True, clean=True)
                doc.close()

                # Delete the original file and rename the temporary file.
                os.remove(filepath)
                os.rename(temp_filepath, filepath)

                print(f"  -> Successfully removed specified text and overwritten the original file.")
            except Exception as e:
                print(f"  -> Error saving file: {e}")
                # Clean up the temporary file (if it exists).
                if os.path.exists(temp_filepath):
                    os.remove(temp_filepath)
                return False
        else:
            print(f"  -> No text to delete found in this file, skipping.")
            doc.close()
        return is_modified

    except Exception as e:
        print(f"  -> Error processing file {filename}: {e}")
        return False

def batch_delete_text_by_content():
    """
    Recursively searches and processes all PDF files, removing the specified text content.
    """
    total_files = 0
    modified_files = 0

    # Recursively traverse all folders.
    for root, _, files in os.walk(SOURCE_FOLDER):
        # Filter out all PDF files.
        pdf_files = [f for f in files if f.lower().endswith('.pdf')]
        total_files += len(pdf_files)

        # Process all PDF files in the current folder.
        for pdf_file in pdf_files:
            filepath = os.path.join(root, pdf_file)
            if process_pdf_file(filepath):
                modified_files += 1

    print(f"\n🎉 Processing complete!")
    print(f"Total PDF files processed: {total_files}")
    print(f"Files modified: {modified_files}")

# --- Run the script ---
if __name__ == "__main__":
    if "此处填写" in SOURCE_FOLDER:
        print("Error: Please set the correct path for 'SOURCE_FOLDER' in the script!")
    else:
        batch_delete_text_by_content()

No Comments

Send Comment Edit Comment


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠( ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ °Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
颜文字
Emoji
小恐龙
花!
Previous