The Dangers of Data Mining

By Keith Power

According to the recently-released second annual Teradata report on enterprise decision-making, executives are having to make more complex decisions in less time, while at the same time they are being flooded with ever increasing amounts of data.

One of the principal dangers or pitfalls in data mining is poor quality, or "dirty" data. As data management guru and author of Data Modeling Essentials, Graeme Simsion puts it: "If there's garbage in the database, you don't want to be making decisions based on it."

In addition, operational data that is used to support day-to-day decision making, such as an address or an account balance, is likely to be of higher quality than data that has been captured for management information purposes, such as a customer's occupation, as the former is much more likely to have been verified. However, it is the latter that data miners tend to get excited about, Simsion says.

Simsion sees two opposite risks in data mining. One is in overvaluing the outcome. A company can undertake a lot of expensive data mining work but the information that is provided to managers as a result won't actually make, and never would have made, much difference to the business, he says.

"One of the mistakes about data mining is to think that managers are mechanical people who make decisions based largely on structured data. I think that's a very naive view of the way that managers work. I think managers work on an enormous amount of soft and unstructured data.

"My observation would be that far more often people are not finding information that they make real decisions on, and there's no point spending money on data mining unless you expect to make decisions based on it," Simsion explains.

The opposite risk is that managers are going to make decisions based on what they're given, and they're given the wrong information due to poor quality data. If data mining works at all there's a risk of getting it wrong, Simsion says.

Mining dirty data can also produce meaningless patterns and you can start finding correlations that don't truly exist. Coincidences do happen too, such as in the famous "storks and babies" case in the early 1900s when one researcher found a rather high correlation between the number of storks sited and the number of births occurring over a period of time.

According to Jim Kashner, chief technology officer, Teradata Data Mining Laboratory, if you unintentionally look for spurious results in data mining, you are almost assured of unintentionally finding them. Once data miners start seeing a pattern they should bring in appropriate expertise in the subject matter, he counsels.

Simsion concurs: "If you are looking for correlations you need to really understand statistics and go in with a scientific approach and the appropriate mathematical tools. Otherwise you are in danger of presenting conclusions to managers that are about accidental things and which make assumptions about cause and effect and are just not statistically significant.

"To be an effective data miner you have to understand the business, the data and statistics. Those are three quite different skills, and you have to bring them altogether in a creative way."

Teradata also places a lot of emphasis on having the right team for data mining to succeed. In its Data Mining Primer for the Data Warehouse Professional white paper, it states that data mining projects must be a collaborative effort driven by business experts, developed by analytic modellers and supported by IT. According to Ariene Zaima, Teradata's data mining marketing manager, the necessary expertise and combination of skills are rare in just one person.

Simsion also believes that it is an extremely extensive process just to be fishing around looking for nothing in particular, and that it's a somewhat mythical or even romantic notion that data miners find nuggets of gold by chance. Rather, he advocates that data mining should start with hard questions, such as in the famous nappies and beer case: Are there any products that people tend to buy together? And as important as data quality is, it is also essential to understand the data and have sound definitions of it, something which is often overlooked, he says.

"If somebody doesn't understand what a code means in the data and misinterprets it, obviously you're going to get wrong stuff out of it. In many databases, though, particularly older ones, data hasn't been properly documented, and only the applications programmer really knows what it means. It's very easy for someone to assume a code 'C' means 'closed' or whatever and in fact it's something more subtle or complicated than that," he exemplifies.

Predictably perhaps, Teradata claims that the right technology is also necessary for data mining to succeed, and that a data warehouse provides the right foundation."As a data warehouse matures to support full and rigorous analysis of data, the likelihood of spurious results drops proportionately," Kashner says.

Simsion, though, doesn't think a data warehouse is crucial, but rather its usefulness depends on what type of data mining you're doing. If you're interested in information that relates to a particular service or area of the business, then data mining on an individual application database that supports that is fine, he says.

However, if you're at a very strategic level of the organisation, you might need a data warehouse, as it pulls together data from multiple application databases and often carries a degree of historical data that operational databases don't. For example, a retail organisation isn't going to keep details of every single transaction in its operational system, they're going to be in its data warehouse.

Kashner also concedes that some vendors oversimplify data mining and that at the end of the day, the best data mining tool is the one between your ears.

Business Solution

Network Storage