Preventing Information Chaos with Unstructured Data

At the recent Chief Data Officer Summit in Melbourne, Veritas Senior Vice President Greg Muscarella presented a session on The Importance of Gaining Insight into Your Unstructured Data. Afterwards, IDM asked Greg to outline some of his recommendations. Greg leads Product Management and Engineering for the Veritas Information Intelligence team where he is responsible for products such as Data Insight, Information Map, Enterprise Vault and Enterprise, and the eDiscovery Platform powered by Clearwell.

IDM: You highlight the fact that that organisations need to analyse and manage unstructured data rather than just buying more storage.  Is that still a message you think that people need to receive?   Are there still some who don’t understand the danger and the risk of unmanaged data? 

GM: The No. 1 problem that most organizations have is that they have no idea what information they have, what it is worth, or why they keep it. But they will keep it. Forever. This is mainly because the information  you do need to keep for regulatory or other reasons is all in the same pile with everything else, and unfortunately, since it all looks the exact same, the IT department, who’s usually left trying to care for it, basically just freezes down and says, “Okay, I’m going to keep everything forever.”  The lack of differentiation amongst the data prevents managers from making decisions about their information and taking action. In fact, if organizations are investing in data cleanup, they are spending the majority of their time focusing on Structured Data.  With disciplines like Master Data Management and other data quality initiatives taking focus at organizations, there has been much more overall investment targeting the management of structured information…which is important…but it seems to have presented organisations with a bit of a gap in terms of what they have been doing with all of that Unstructured Data.  

IDM: Whose job should it be to fix this?

GM: In many organizations, acquiring authorization to dispose of data — or even move data to less expensive storage — is nearly impossible. Within many organisations, no one claims ownership of unstructured data.  So essentially IT is left holding the bag, because at the end of the day they run the systems that have to store all this stuff.  But they don’t get to decide whether it’s being kept or not. Every organisation has a different take on things, but often the lawyers think that things need to be retained.  In some companies they take a more harsh line and say, you know, “We need to dispose of stuff.”  But then you have lines of business or the business units which say that, “We want to hold onto this stuff forever,” The IT guys would love to delete things or at least archive it, but they can’t get the lawyers, and the lines of business and records managers and others to agree.  That’s where you end up in this default space of just retaining everything forever. 

IDM: Do you think you’ll ever be able to get rid of that line of thinking? 

GM: I’m hopeful that if we can start to show that machine learning actually works, as we have in the legal space, we can apply those insights to retention.  It’s never going to be 100% accurate, but if can we get to the point where it passes the bar, then we will get to the point where we see organizations actually starting to delete things.  In the short-term people may not be confident to delete data, but instead put it in an archive. And then maybe in two, three, five years’ time, whatever it is, when no one’s accessed it we might say, “Okay, are we ready yet?  Can we hit the delete button now?”  But the pain will continue to grow because even though storage that is priced on a per gig-basis is getting much, much cheaper, organizations will start requiring petabytes and petabytes of data, which is still a very significant cost.  Gartner estimates that for every petabyte of information it costs $5 million a year just to keep it plugged in and available. We have customers with 50 petabytes and above, and that’s growing at 20, 30, 40% per year, so costs go up dramatically from there. 

IDM: The IP that you use within the eDiscovery Platform to do predictive coding, are you offering that as a product for automated classification of unstructured data? 

GM: Right now we’re using the learnings from the application of predictive coding within the eDiscovery Platform to apply it to data classification, or really what I would call retention management, the automated decision-making around retention.  We’re also working with the Stanford University Statistics Department to really develop our machine learning methodology to evaluate the algorithms we employ.  So it’s both the machine learning to make the decision as well as the statistics to see how well you’re making decisions. 

IDM: What are the key steps to get more visibility into this problem?

GM: Firstly, it’s critical to determine what you have by developing a holistic view of your organisation’s storage repositories. Next, you need to ensure everybody knows the rules, so create a policy guide that summarizes all existing information policies. Ultimately, achieving positive outcomes within the first 90 days is critical, so prioritise low-hanging fruit: determine which projects will yield quick wins and develop a project plan (e.g., PST remediation, stale file clean-up).