Does this data have business value? I don’t care!

By Andrew Sohn

Anyone who's had the responsibility to manage an end-user centred system has heard this refrain when asking people to get rid of old emails, documents, files, voice mail recordings, pictures, or any other electronically stored information (ESI). "I need it to do my job" or "It has business value and I need to save it".

I've decided not to challenge them or to ask for justification anymore. When I put on my CIO or CISO hat, I don't care if the data has business value or not. But there are many other things that I need to care about regarding this hoard of ESI. (When I put on my CDO or Data Analyst hat I care - more on that some other time.) Actually, I usually do have an opinion on the matter. But unless it's data in my domain of responsibility I'm not in a position to really judge whether a set of data has business value or not to someone or some business unit. So if someone says they need to keep an email longer than the 90 day retention period or they need an exception to the 3 year timetable for Record Type X, they need to appropriately document that fact but I'm not going to fight them on the business need.

These conversations usually come up regarding ESI in emails systems, network fileshares, hard drives, document management systems, storage as a service repositories (e.g. Box), software as a service repositories (e.g. Salesforce) and collaboration systems (such as the big junk pile known as SharePoint). But they also come up in transactional and operational structured data oriented systems. How many years of data should we keep in the Data Warehouse? Are transactions from 1990 on obsolete products useful? If the company's economists say so, who am I to argue that point?

While I can't argue on the value of the data, I do have the responsibility to manage and control how that data is kept. What do I care about? I care about where and how the data is stored. I care about ensuring that the risk of the data is understood, managed and minimized. I care the data is properly governed so it can be found and eventually destroyed. I also care about the cost of processing and storing the data.

Where the data is stored

If the data being kept is deemed to be a corporate record or needs to be retained due to regulatory requirements (e.g. all electronic communications of regulated employees at a bank), then there's really no question that the data needs to be moved into an appropriately compliant system. There are well defined capabilities that need to be in place to properly retain this data. But most of the data I'm talking about are not formal records. They are project related documents or summarized transaction records in a data reporting repository.

In order to be able manage, search, secure and ultimately get rid of this data, the retained data needs to be stored in a capable system albeit with much less rigour than a records management system. The system must to be able to control and monitor who has access to the data. It also needs to be able to associate and maintain a minimal set of metadata with the data. Simple things like who actually owns the document, when was it created and (really) last accessed, security classification, what project/department/account /etc. does it relate to, and others.

A standard Windows fileshare and out of the box SharePoint don't provide these capabilities. Emails stored within PST files on a stand-alone PC are problematic. Until recently, services like Box and Dropbox didn't support custom metadata. There are now a number of options to either move data to a better system or augment existing systems.

There are a growing number of Enterprise File Services systems that provide the ability to manage and protect this data that standard Windows cannot do (e.g. CommVault, Varonis). Third party tools for SharePoint can help rein in the free for all document dumping. And some level of automated classification and categorization tools can help transparently manage the ESI at creation time. (Policies and procedures are all fine and you need to have them, but I've rarely seen them followed if there’s any extra thought or work needed from the user community).

Managing Risk

One of the critical items that needs to be determined is does the ESI contain any sensitive or otherwise risky information. This can be a complex problem and is unique for each company. If you're going to keep data, I'm going to scan the hell out of it for compliance with PCI, competitive business information and a list of organizationally defined sensitive items. . In the structured data world there are many straight forward tools available to do this, although they are not cheap, take significant compute resources and can be intrusive. For unstructured data, this is a lot more difficult. I found the best tools to do this came from the eDiscovery technology space. The same tools that can look to see if ESI is responsive to preservation requests or collection have a lot of features to perform sophisticated text analytics. This is a growing market and there's a lot of new tools and convergence happening in this market (see Microsoft’s recent purchase of Equvio).

Once sensitive data is found there are a few courses of action. This is when I may challenge a user on if they really still want to keep this data given its level of risk. Assuming the data is still necessary and has business value, then it must be properly protected. Encryption is always a good option, but it's not enough since at some time the data needs to be unencrypted to be used. Data masking, anonymisation, and other techniques can be used to "neuter" the data. This is especially relevant for structured data found in data analysis repositories where trends are being addressed and specific information on individual parties is not important.

Managing the Cost

You commonly hear the phrase "storage is cheap". Well, compared to a few years ago, some storage infrastructure is certainly cheaper. But there are also a greater array of storage technologies, some of which have a quite expensive unit cost. So, even if a unit of commodity unmanaged storage is cheap, the total cost for storage media and management is not and keeps growing.

In the past few year I've been in many budget discussions talking about the need for tens of millions of capital dollars for new windows filer, additional SharePoint storage and increased infrastructure for a multitude of document management systems. There are many strategies for reducing the total cost of storage. These include technologies like deduplication, compression, and tiered information lifecycle management. It also includes understanding the probable use patterns of the data and placing it on the appropriate infrastructure. High speed highly available replicated media should be limited to special use cases. The use of virtual tape and other archiving technologies should be maximized for the rarely accessed data and those CYA documents. Even in the storage as a service world, companies like Amazon have different tiers of products (e.g. Glacier) and prices depending on the requirements. So, I'm not being just nosy when I ask about how you are planning on using the data and how that changes over time, I'm doing my duty to control some very significant costs.

In conclusion, if you say you have a business reason to keep some documents or data, I won't argue with you. In return, I don't expect you to argue with me on the how that data is kept.

Andrew Sohn is an experienced Information Technology Executive, Information & Enterprise Architect