Data profiling to reign in compliance costs

Index Engines, an enterprise data management and archiving company, is looking to provide a cost and time effective solution to the Big Data storage crisis with its latest release, the Catalyst Data Profiling Engine.

The Catalyst Data Profiling Engine processes all forms of unstructured files and document types, creating a searchable index of what exists, where it is located, who owns it, when it was last accessed and what key terms are in it. High-level summary reports allow instant insight into enterprise storage providing never before knowledge of data assets.

Through this process, mystery data can be managed and classified, including content that has outlived its business value or that which is owned by ex-employees and is now abandoned on the network.

Data profiling relies on an enterprise class index of metadata from user files and email databases such as last modified or accessed time, number of duplicates, size, owner, location, file type, and more. Using summary reports combined with filters, users can view content on specific servers or locations, and see a chart of top owners by capacity, age of data, files by type and much more.

Optionally data profiling can look beyond metadata and go deep within documents and email finding content supporting keyword searches or even confidential information or compliance assurance audits for sensitive content misplaced behind the firewall in PSTs or on the wrong server.

Once the data is located it can be remediated, archived or even moved to a different storage platform. Organisations are finding that significant capacity can be reclaimed by purging data that has no business or legal value including ex-employees files, duplicates, and content that is abandoned and has not been accessed in more than 7 years. Besides the ability to reclaim storage capacity and reduce the annual storage budget, data profiling supports proactive compliance, security and risk management.

“On a very granular level, you can search for Social Security and credit card numbers,” Index Engines vice president Jim McGann said. 

“But the biggest use case is likely going to be showing legal and compliance what information exists and getting the ball rolling on managing data and putting an information governance or data retention policy in place.”

The Catalyst Data Profiling Engine is designed for large enterprise class environments allowing organisations to uncover and analyse unstructured and mystery data, creating an index of the information that is only a 1 percent footprint resulting in extreme scalability.

From there, the indexing engine, version 5.0, allows action to take place on the data. Features include: 

- Deletion with Validation – Manage the defensible deletion of unstructured data using validation to ensure the content has not changed since it was profiled. Validation checks the modified date or optionally the signature of the document.

- Defensible Audit Logs – As disposition of the data is performed, including deletion, logs will be maintained that detail the date and disposition of the document, including the user that executed the disposition.

- Expanded Duplicate Reports – Summary reports include duplicates by file type, owner, age, location and more. These reports allow for deeper profiling of redundant content.

- Report Scheduling and History – Stored reports can be scheduled to run on a periodic basis and the results can be stored in order to access a historical perspective of the data environment. This allows a view into the data center including the incremental change of the content based on historical reports.

- Increased Capacity – This version breaks the 1PB barrier and now supports metadata profiling of up to 1PB of unstructured data using a single engine. This unprecedented scale and efficiency is unmatched in the market and allows for enterprise class data profiles to be achieved.

Data profiling starts at $US1,000 a terabyte and is deployable through VMware and hardware.

“With a few clicks of a mouse you can find data on your network servers that have not been accessed in five, 10 years, who it belongs to and where it lives,’’ Jim McGann said. “From there it can be moved to cheaper storage, archived for compliance or purged from the system.”