Avoiding the Data Swamp: How do we govern big data?

By John Parkinson

Have you encountered problems governing new types of data? When talking about new kinds of data I would suggest that most readers will think primarily of big data, both structured but particularly in this case unstructured.

Whilst it does not create “problems” as such, governance of big data does bring new challenges to the organization. For example, data may be brought in to build out information sources with no regard for its quality.

I read an interesting article on the Gartner bog recently which made the excellent point that many data lakes have no barrier to entry. This is clearly not the right approach, and introducing poor quality data into the enterprise actively encourages a specific situation that most organizations try and avoid.

Just because the platform is Hadoop and the project contains the words “Big Data” does not mean that common sense should be ignored. The result of such unrestricted behaviour is less of a data lake as a data swamp, with poor visibility of the data sitting underneath the murky waters, and an unknown number of data quality issues that may come back and bite you later.

An interesting point to consider here is the big data lifecycle. A good colleague of mine, Ralf Teschner, Global Head of Data Governance for Capgemini, makes the point in his blog that the more mature an organisation’s information management the nearer governance is performed to the start of the lifecycle. Importing and using big data often moves management towards the middle and end of the lifecycle. This makes it more difficult to anticipate and rectify problems – typically the first time data is seen to be of poor quality or unsuitable for use is after someone tries to use it.

So how do we govern big data? Data governance can be defined as the expression of authority over data management and its alignment with the enterprise strategy, governance and risk model. Given that is the case, let us start by deconstructing the components of data management (I'm using the DAMA framework here). Breaking data management down we arrive at the following components:

  • Data architecture management.
  • Data operations management.
  • Data development.
  • Data security management.
  • Metadata management.
  • Data quality management.
  • Reference and master data management.
  • Document and content management.

We can see on deconstruction that large parts of the scope of data management are perfectly applicable to big data. Whilst creation of a detailed data model for big data may be outside immediate scope, it is still possible to understand the high level data architecture.

Utilizing the DAMA framework as a template, activities to be covered in architecture management include defining enterprise information needs, and defining and maintaining the data technology architecture, data integration architecture and the data warehousing and BI architecture. All of which are still perfectly possible within a big data solution.

Equally, within metadata management, it is still possible for metadata standards to be defined, to create and integrate metadata, and to manage metadata repositories. It is still possible for data sources to be listed in a catalogue, for data to be classified, and to record some form of lineage, even if it is just the website from which it was sourced. It is still possible to implement security and access policies. In fact, I am aware of vendors that offer solutions to facilitate metadata management, lineage and security within a big data platform.

For data quality management, it is possible to define data quality requirements, define business rules, and to act to remediate data where necessary. The same applies for each area of data management. In short, big data governance should follow the same pattern as for small data. Orchestrate data management to know where the data is, know where it comes from, know where it is going, and manage it well.

However trying to manage the volume of all big data in the same way as smaller, traditional data sources can be a waste of time for data management professionals unless they keep the business relevance of the data constantly in mind. In reality very little contained in vast big data stores will be used by the organization, and there is no point spending scarce resources managing data which will never actually be used. However that which is used to support the business and business processes should be managed well.

The data lake must not become a swamp populated with data of unknown quality of murky descent and dubious lineage that will come back and bite you later, which not only is risky in and of itself, but breeds distrust of all data in the enterprise.

John Parkinson is UK Lead - FS Data Governance for Insights & Data at Capgemini.