Emails for eternity - A Primer on PDF Archiving

By Dietrich von Seggern, Callas Software

According to a recent survey conducted on behalf of the digital association Bitkom, an average of 26 emails are received per professional mailbox in Germany every day. Processing them takes up a large part of working time. In addition, emails are an integral part of processes.

Some of them must be retained under tax law, including purchase orders or invoices, but also any documents that may be relevant in connection with a business transaction. In addition, electronic messages often contain valuable knowledge that must be retained.

But how can emails be elegantly archived? To date, there is no supreme solution. However, for a number of reasons, the PDF route currently seems to be the most practical.

The good news is that emails are digital per se and already contain metadata. This makes it fundamentally easier to archive them than paper-based communications. However, in many cases, there are no company guidelines in this regard, so users decide individually how to handle their emails. As a result, there is a high risk that business-relevant messages are lost.

Emails are handled by various specialized systems that enable the creation, transport, viewing and storage of these electronic messages (lifecycle: client, server, relay, archiving system). For more on archiving emails, we will have to deep dive in what an email consists of.

The header is basically the equivalent of the letterhead and contains the sender and recipient information, the creation date and some optional information such as the subject in the form of metadata. Often, an ID is also included here to help the email client associate it with other emails when an email sequence consists of replies and forwards.

In order to properly assess emails and the reliability of header information, it is important to understand that the actual routing is independent of the header data and takes place via the Simple Mail Transfer Protocol (SMTP). The SMTP acts as an envelope, so to speak, and controls the routing of the electronic message.

The email client therefore sends an SMTP call to the email server together with the user data of the email (including the header), which contains the address of the recipient and is decisive for the routing.

The body, i.e. the actual mail content, is displayed differently depending on the user-defined settings in the email software. Possible are plain text (ASCII) without umlauts, simply formatted text (like bold or italics) with support of country-specific encodings (umlauts) as well as extensive HTML formatting with embedded images, etc.

An email file can contain multiple variants at the same time and there is no guarantee of corresponding content: It is readily possible to place different text. Often, for example, the ASCII text part only contains a note that an HTML-capable email client is required for display. This is a crucial aspect for possible format conversions during archiving.

The third, optional part consists of attachments. This is where the infinite field of file formats, feared by every archivist opens up: These are often documents or images, possibly combined in a ZIP file, but exotic file formats or executable programs or scripts can also be included.

As already described, email is transported via the SMTP protocol, namely from the client to the server at the sender, then via the Mail-Relays to the server at the recipient and from there to the recipient's client.

Since emails are often sent in "conversations" as replies and the complete history is not always included, it would be ideal to archive the entire mail system in order to be able to fully trace the email communication with all steps later on. In practice, this is obviously rather unfeasible. Alternatively, it would be good if at least the receiving or sending mailbox could be archived completely with all references of the emails to each other.

To date, however, there is no standardized, interoperable approach to this, although there are some interesting initiatives and approaches (e.g. a report recently produced by the University of Illinois with the support of the PDF Association).

Furthermore, such an approach is problematic because the technology most commonly used in business processes by Microsoft uses its own proprietary format (MSG). Although it is documented, it is subject to frequent changes.

Content is sometimes not even inserted into the body of the email by the programs, but sent as "Winmail.dat" attachments, which can then only be interpreted and displayed by appropriately prepared clients on the recipient side. For these reasons alone, it seems essential to convert the emails into a standard format suitable for archiving.

This becomes even more overwhelming when attachments are taken into consideration. Here, there are no limits to the imagination as to which file format is used in the attachments. It is therefore impossible to guarantee that an application will be available for years, or even decades, with which the attachments can be displayed - one of the reasons why PDF/A was developed and became established so quickly.

PDF/A for secure archiving

To break free from this dependency, system-independent archiving of all emails and attachments in PDF/A is recommended. The format has long been established for general archiving purposes.

Recently, the PDF/A-4f conformance level has become available as the successor to PDF/A-3, in which any files can be embedded. On this basis, at least the question of format for email archiving can be answered satisfactorily.

Most email systems offer an export function to PDF. Unfortunately, however, this approach often falls short, because usually only the email body is taken into account and not the header or any attachments.

If emails are to be archived in PDF in their entirety, the header data should be saved as XMP metadata in the PDF file. This can then be used as the basis for a targeted search for emails.

The email body is ideally converted on the basis of the body branch (plain ASCII, formatted text, HTML) that most comprehensively reflects the content. Links or referenced images in HTML must then also be integrated.

The greatest flexibility in the use of archived emails is available if the original email file in EML or MSG format and the attachments are also embedded in the PDF, which is possible with PDF/A-3 or PDF/A-4f.

But experience has shown that this is not the only reason why emails archived as PDF/A are almost always larger than the original files. Another factor is that the PDF/A standard requires the embedding of fonts or ICC profiles for colours in order to ensure the reproducibility of emails over the years.

On the other hand, file size can be minimized via compression methods built into the PDF, an option that does not exist in "email formats".

If you have made it through till here in this rather long blog post, I want to give away some more helpful information. Since callas is a PDF Association member, I will be presenting about the same topic named 'Archiving email – as PDF?' at the upcoming PDF Days Europe 2021 for which you can find the full agenda here. (All sessions are also being live streamed)

I will summarise, with practical demonstrations, exactly what needs to be done in order to include as much information as possible in email archiving and to be able to retrieve and use it in the future. In case you are interested to be a part of this event and want to receive a discounted ticket until the end of July 2021, please write to us at info@callassoftware.com.

Originally published HERE