In day-to-day work and study, we often encounter PDF documents with watermarks. These watermarks can hinder readability, especially when they overlap with the main content. I downloaded a compressed archive of research papers online, and unsurprisingly, some papers had watermarks from tutoring agencies, which was quite annoying (as shown in the image below). I checked online, and most solutions use the PyMuPDF library to cover the watermarks with white blocks, which I didn’t find very elegant. So, I spent some time developing a more refined solution and am documenting it here.
Features
- Recursive Search: Automatically processes PDF files in a specified folder and all its subfolders.
- Precise Matching: Only deletes text that exactly matches the specified content, leaving other content untouched.
- In-Place Processing: Modifies the original files directly, saving disk space.
- Safe and Reliable: Employs a temporary file mechanism to ensure file safety.
- Error Handling: Robust error handling ensures that batch processing won’t be interrupted by failures with individual files.
- Detailed Logging: Provides detailed information about the processing progress.
How to Use
- Ensure you have a Python environment installed.
- Install the necessary dependency:
pip install PyMuPDF
- Download and configure the script (complete code at the end of this article):
import fitz # PyMuPDF library
import os
# --- User Configuration ---
# 1. Enter all text strings you want to remove in this list.
TEXTS_TO_DELETE = [
"The watermark text you want to delete"
]
# 2. The path to the folder containing all your PDF files.
SOURCE_FOLDER = "Your file path"
- Run the script. It will automatically:
- Search for all PDF files in the specified folder and its subfolders.
- Find and delete the specified watermark text.
- Update the processed files in place.
- Display the processing progress and results summary.
How It Works
- File Search: Uses
os.walk()
to recursively traverse all folders and find all PDF files. - Text Processing:
- Scans each page of each PDF file.
- Uses PyMuPDF’s
search_for()
method to locate the target text. - Safely deletes the text using
add_redact_annot()
andapply_redactions()
methods.
- Safe Saving:
- Uses a temporary file mechanism to ensure file safety.
- Replaces the original file only after the new file is successfully saved.
- Automatically cleans up temporary files in case of errors.
- Error Handling:
- Handles corrupted PDF files.
- Skips encrypted PDF files.
- Errors on individual pages don’t affect the overall progress.
- Provides detailed error logging.
Complete Code
import fitz # PyMuPDF library
import os
# --- User Configuration ---
# 1. Enter all text strings you want to remove in this list.
# The script will find and remove all text that exactly matches the strings in this list.
# You can add as many strings as needed.
TEXTS_TO_DELETE = [
"The watermark text you want to delete"
]
# 2. The path to the folder containing all your PDF files.
SOURCE_FOLDER = "Your file path"
def process_pdf_file(filepath):
"""
Processes a single PDF file, removes the specified text, and overwrites the original file.
Args:
filepath: The full path to the PDF file.
Returns:
bool: True if the file was modified, False otherwise.
"""
filename = os.path.basename(filepath)
is_modified = False
try:
# Try to open the PDF file.
doc = fitz.open(filepath)
print(f"\nProcessing: {filename}...")
if doc.needs_pass:
print(f" -> Skipping: File {filename} is password protected.")
doc.close()
return False
# Iterate through each page of the PDF.
for page_num in range(doc.page_count):
try:
page = doc[page_num]
except Exception as e:
print(f" -> Warning: Error accessing page {page_num + 1}, skipping this page: {e}")
continue
# Apply all redaction annotations to this page.
for text in TEXTS_TO_DELETE:
try:
# search_for() returns a list of rectangles for all matching text instances.
found_instances = page.search_for(text)
# If matches are found.
if found_instances:
is_modified = True
print(f" -> Found and marked text '{text}' on page {page_num + 1}.")
# Add a redaction annotation for each found instance.
for inst in found_instances:
try:
page.add_redact_annot(inst, fill=(1, 1, 1))
except Exception as e:
print(f" -> Warning: Error adding redaction annotation on page {page_num + 1}: {e}")
continue
except Exception as e:
print(f" -> Warning: Error searching for text on page {page_num + 1}: {e}")
continue
# Apply all redaction annotations on this page to actually remove the text.
if is_modified:
try:
page.apply_redactions()
except Exception as e:
print(f" -> Warning: Error applying redactions on page {page_num + 1}: {e}")
continue
# If the file was modified, save it in place.
if is_modified:
try:
# Save using a temporary file.
temp_filepath = filepath + ".temp"
doc.save(temp_filepath, garbage=4, deflate=True, clean=True)
doc.close()
# Delete the original file and rename the temporary file.
os.remove(filepath)
os.rename(temp_filepath, filepath)
print(f" -> Successfully removed specified text and overwritten the original file.")
except Exception as e:
print(f" -> Error saving file: {e}")
# Clean up the temporary file (if it exists).
if os.path.exists(temp_filepath):
os.remove(temp_filepath)
return False
else:
print(f" -> No text to delete found in this file, skipping.")
doc.close()
return is_modified
except Exception as e:
print(f" -> Error processing file {filename}: {e}")
return False
def batch_delete_text_by_content():
"""
Recursively searches and processes all PDF files, removing the specified text content.
"""
total_files = 0
modified_files = 0
# Recursively traverse all folders.
for root, _, files in os.walk(SOURCE_FOLDER):
# Filter out all PDF files.
pdf_files = [f for f in files if f.lower().endswith('.pdf')]
total_files += len(pdf_files)
# Process all PDF files in the current folder.
for pdf_file in pdf_files:
filepath = os.path.join(root, pdf_file)
if process_pdf_file(filepath):
modified_files += 1
print(f"\n🎉 Processing complete!")
print(f"Total PDF files processed: {total_files}")
print(f"Files modified: {modified_files}")
# --- Run the script ---
if __name__ == "__main__":
if "此处填写" in SOURCE_FOLDER:
print("Error: Please set the correct path for 'SOURCE_FOLDER' in the script!")
else:
batch_delete_text_by_content()