PDF Metadata Extraction with Python

This paper explores techniques for programmatically extracting metadata from PDF files using Python. It begins by detailing the internal structure of PDF documents, focusing on the internal system of indirect references and objects within the PDF binary, the document information dictionary metadata type, and the XMP metadata type contained in the file's metadata streams. Next, the paper explores the most common means of accessing PDF metadata with Python, the high-level PyPDF and PyPDF2 libraries. This examination discovers deficiencies in the methodologies used by these modules, making them inappropriate for use in digital forensics investigations. An alternative low-level technique of carving the PDF binary directly with Python, using the re module from the standard library is described, and found to accurately and completely extract all of the pertinent metadata from the PDF file with a degree of completeness suitable for digital forensics use cases. These low-level techniques are built into a stand-alone open source Linux utility, pdf-metadata, which is discussed in the paper's final section.

Download file

38800 (PDF, 5.31MB)

5 Feb 2019

ByChristopher A. Plaisance

All papers are copyrighted

No re-posting of papers is permitted

PDF Metadata Extraction with Python

Related Content

SANS 2025 CTI Survey Webcast & Forum: Navigating Uncertainty in Today’s Threat Landscape

Beneath the Mask: Can Contribution Data Unveil Malicious Personas in Open-Source Projects?

Catching the Hand in the Cookie Jar: Canary Session Cookies

A Pebble In the Ocean: Maximizing Log Fidelity In Container Environments

SANS 2025 Threat Hunting Survey: Advancements in Threat Hunting Amid AI and Cloud Challenges

Empowering Responders with Automated Investigation

Beyond Detection: Using Real Phishing Data to Gauge Security Training Program Success

Hunting the Hound of Hades: Kerberos Delegation Attacks, Detections and Defenses

Threat Intelligence-Driven Attack Surface Management

How to Build and Use an Incident Response Playbook Effectively

Windows 10 vs. Windows 11, What Has Changed?

Malware Function-based encryption technique

Detecting Unauthorized Behavior From Legitimate Accounts

Recommendations for small/medium-sized businesses enabling incident response

Cloud Forensics Triage Framework (CFTF)

EDR Evasion: Stranger Things In A Payload

CIS CSC Controls vs. Ransomware: An Evaluation

Missing SQLite Records Analysis

Insider Threat The Theft of Intellectual Property in Windows 10

A Forensic Analysis of the Encrypting File System

Subscribe to GIAC’s Monthly Newsletter