Quick PDF Analysis

malware analysis PDF

Today we demonstrate how to quickly analyze a suspicious PDF file to determine whether it contains malicious content. PDF documents are a favorite vector for attackers because they support embedded scripts, multimedia, and complex object structures that can hide shellcode.

Filename: 010820170003375296186050723708.pdf
MD5: b2fbd8077726f78884e5330979b213a1
Status: Download Sample



Full Analysis Walkthrough



Understanding PDF Malware Structure

To analyze a PDF, you must understand its four-part structure: the Header (version), the Body (objects like text, images, and scripts), the Cross-Reference (XREF) Table (locates objects), and the Trailer (tells the reader where to start).

Attackers typically hide malicious intent in "Streams" using filters like /FlateDecode to compress data or /JS and /JavaScript tags to trigger automated exploits.



The Analysis Process



Step 1: Initial Triage and Identification

Malicious PDFs often appear smaller than legitimate documents because they focus on delivering a specific exploit rather than visual content. However, they may also be inflated with garbage data to evade automated scanners. Use antivirus and static identifiers first to see if the file is known.



Step 2: Metadata and Properties

Examine the document properties using tools like ExifTool. Suspicious metadata—such as a creation date that differs wildly from the content or an "Author" that appears to be a random string—can provide clues about the origin of the file.



Step 3: Object Analysis with PDF Tools

The most effective way to find hidden threats is to parse the internal objects.



  • PDFiD: Quickly scans the file for "dangerous" keywords like /JS, /JavaScript, /OpenAction, or /EmbeddedFile.
  • PDFStreamDumper: An essential tool for extracting and decompressing stream data to find shellcode or hidden URLs.
  • Origami: A powerful Ruby/Python library for parsing and manipulating the PDF object tree.


Step 4: Script Obfuscation & JavaScript

JavaScript is the primary engine for PDF exploits. Attackers use heavily obfuscated scripts to bypass static detection. Tools like PDF-JS or SpiderMonkey can be used to extract and "de-obfuscate" these scripts in a safe environment to see what they are attempting to download or execute.



Step 5: Dynamic Sandboxing

If static analysis is inconclusive, run the document in a sandbox like Cuckoo or Hybrid Analysis. This allows you to monitor network calls (HTTP/DNS requests) and file system changes without risking your host machine.



Conclusion

PDF analysis is a game of "hide and seek" within the object tree. By combining keyword scanning, stream decompression, and dynamic sandboxing, you can effectively unmask malicious intent. Always perform your analysis in a segmented, virtualized laboratory.



Happy hunting.