Does PDF/A pass muster for archiving?

Is PDF/A the right solution to long term archiving? A German academic has published a detailed paper criticising aspects of accessibility and reusability of the format and considering the alternatives.

PDF/A differs from PDF by prohibiting features ill-suited to long-term archiving, such as font linking and encryption.

Marco Klindt of the Zuse Institute Berlin, a research institute for applied mathematics and computer science, believes the PDF/A format has shortcomings and potential pitfalls that may create problems for present and future content users.

Klindt says, “Converting “normal” PDFs to PDF/A a-level conformance automatically is not advisable as a lot of information may already be lost during the creation process of the document.

“The benefit and convenience of PDF to easily capture all kinds of textual and graphical information in an electronic equivalent of a stack of paper comes at a cost for digital archives. In the digital preservation workflow technical validation is an essential step to ensure files are valid with respect to the specification of the file format they claim to be. This process will always be costly as it involves manual assessment as the tools are not yet usable for a fully automatic workflow …”

“Despite the reusability issues, exporting to PDF sometimes also results in significant loss of information apart from text structure. Two examples: Spreadsheet formulas and numerical precision are lost, making testing data sets more difficult. Storing OCR results as invisible text over the digital facsimiles loses the confidence values for characters of the recognition software.”

However, he concedes that “there is no viable alternative to PDF as a universal digital container of everything that can be flattened to printed pages.

“PDF/A is perceived to be an archival solution for digital documents. Discussion within the community revealed the reason for that is three-fold: Firstly, it is marketed as an archival format. The A in PDF/A might stand for “Archive” or “Archival” or simply for the letter “A”; I haven’t found any official explanation for the choice of A in the acronym.

“The second reason may be that it is used by so many institutions to a point where a critical mass is reached. They cannot altogether err in their risk assessment, so the reasoning is that you simply cannot be wrong when you run with the flock. And thirdly, there does not seem to be a better alternative available …”

One of the strategies he suggests is to employ PDF/A-3 which allows the original source documents to be embedded and linked alongside the PDFs for full text and structure retention.

The full paper is available online at  “PDF/A considered harmful for digital preservation.”

Carl Wilson at the Open Preservation Foundation has responded to the paper with “PDF/A and Long Term Preservation.”

He notes that “PDF/A is not a standard you’d have designed with long-term preservation in mind. It’s also significantly more complex than necessary in many cases.”

However, “Regardless of opinions regarding the format, a major consideration for memory institutions with a mandate for preservation is pragmatism. Governmental and commercial organisations currently make wide use of the PDF format and there seems little prospect of that changing in the short to medium term.”

