Embedding Referential Metadata in PDFs Is a Good Idea

By John Howie

PDF standards enable users to embed or include non-visible metadata within PDFs as attribute name and attribute value pairs. This feature can be used to embed referential metadata normally stored and used external to the files to help find or otherwise work with them. Here are some reasons why embedding metadata values can be a good idea:

Facilitate Transferring Files - Content management systems will accumulate metadata about individual files, and copying or transferring the entire collection or a subset of the collection will mean having to do an export of the metadata with path and file names that synch up with the target system. One way to avoid this complexity is to embed the metadata within the PDFs themselves so that the target system can just index them and extract the metadata name and value pairs at the same time.

Simplify Coordination of Multiple Versions - Some systems maintain different versions of documents for various reasons, e.g., there may be *.txt files with the same name and folder structure as image-only TIF files in order to provide search capability for the documents, or there may be different files containing translations of files. Rather than having to coordinate the maintenance of all those copies, they can be simply embedded in the PDF representations.

Get Benefits of Both PDF and Native Files - A major benefit of the DPF format is that PDFs can be widely distributed regardless of whether the recipients have the software that created the original files. However, distributing just PDFs can deprive the recipients who do have the original software of the ability to edit and work with those native files. Embedding the original native file in the PDF version gives a solution that is the best of both worlds.

Make Original Folder Paths Searchable - Many times people creating file folder structures to store files use folder names that have informational value, e.g., an energy company might have top-level folders for country, then a deeper folder level for document type, then names of specific projects with latitude and longitude coordinates. When those files are indexed, the search or content management system will usually not permit searching the folder & file names, especially if they contain characters like underscores or dashes.

One solution is to parse the file path folders and build attribute name/value pairs to include within the PDF. Then when the search or content management system indexes the files with embedded PDFs those terms become searchable and usable. This is particularly useful for organizations that use systems like:

  • EMC Documentum D2
  • OpenText
  • SharePoint with either FAST or MOSS

Overcome Broken Link Issues - Search and content management systems depend on files remaining where they were located when they were initially indexed. However, files sometimes get moved to new drives or drive mappings can change the paths to them. In either case the pointers to the indexed files are no longer valid and those files can be essentially lost from view. By including the referential metadata values within the PDF those values are no longer susceptible to being disconnected from the files. Whenever the files are indexed those metadata values can be re-incorporated in the index to help find and work with those files.

Use Same Approach for Inferential Metadata - Inferential metadata is data than can be inferred from examination of the files themselves, e.g., Loan Applicant, Well Number, Property Description, etc.  Having data values associated with specific fields or values enables far easier and more precise searching than is available with only text. Inferential attribute name and value pairs can be stored within PDFs in the same manner as referential metadata so that they can be used independently from the search or content system that manages them.

John Howie is a consultant with information governance specialists BeyondRecognition (BR) based in Houston, Texas. Contact info@beyondrecognition.net.