Making Sense of Unstructured Data with Text Analytics

By Brion Scheidel

In case you hadn’t noticed, the amount of data in the world is increasing at an exponential rate. For example, every minute there are nearly 4.2 million posts uploaded to Facebook, nearly 3 million tweets, and thousands of responses to open-ended survey questions.

I’m often asked about all this unstructured data and how companies can make sense of it. The short answer: categorisation using text analytics. Inevitably, the follow up question is, “What’s the best approach to implementing categorization using text analytics?”. In this afrticle I’m going to dive deeper into these two questions and their answers.  My hope is that you’ll be able to see how text analytics can help you make sense of your company’s data.

How do companies make sense of all this unstructured data?

One main way is categorisation using text analytics. Categorisation is the process of defining a set of categories and then setting up a system to assign those categories to comments. Categorisation also typically involves deciding what sentiment to associate with each category assignment (e.g., Negative, Neutral, Positive). Once we have assigned categories and associated sentiments, we can start quantifying the results and displaying them in charts and reports along-side satisfaction and recommendation scores.

An airline, for example, might have a category set that would include categories such as:

  • Food Quality
  • Food Choice
  • Food Presentation
  • Food Freshness
  • Food General
  • Special Meals
  • Drinks
  • Alcoholic Drinks
  • Coffee
  • Tea

And that’s just the set of Food and Beverage categories. They would probably also have categories related to their customers’ experience at the airport, experience on the flight, and experience with customer service.

With categories and sentiment assigned, we can start analysing charts and querying data to answer specific questions such as “Which categories have the most assignments?”, “Which categories have the worst sentiment?”, “Which categories have downward-trending sentiment?”, etc.

What’s the best approach to implementing Categorization using Text Analytics?

This one is not so easy to answer. I’ll break my answers down into “How to implement categorization” and “Who should implement categorization”.

There are two main options for the methodology (the “how”) behind categorization: Machine learning and Rules-based.

Machine learning - Systems like IBM Watson use machine learning to do automated categorization. This means one must first define a set of categories, manually assign those categories to a training set of comments, and then feed that information to the machine learning algorithm. If all goes well, the machine learning algorithm learns the correct things from the comments it was trained with. Using the information it learned from the training set, the machine learning based system will then make category assignments to comments processed with it. One benefit of using this method of categorization is that no special skills are required. Anyone can set it up. One drawback to machine learning is that it is difficult to get good results with larger category sets (i.e. category sets with more than a handful of categories).

Rules-based  - Another way to perform automated categorisation is through a rules-based approach. Using a combination of linguistic and logic skills, text analysts define a set of categories, then manually create rules for those categories. These rules combine the experience and expertise of the text analysts with the natural language processing (NLP) functions and processing power provided by a rules-based text analytics engine. Using those rules, the engine will then make category assignments to comments processed. This method allows you to get good results, even with larger category sets.

Manual Categorisation - There’s actually a third option called manual categorisation which can be used instead of machine learning or rules, or in addition to them. This is the old school approach of having people read each comment and decide which categories the comment should be assigned to, and with what sentiment. While admittedly low-tech, this is actually a reasonable, cost-effective approach for small volumes and less common languages.

Once you’ve decided on whether to use machine learning, rules based, manual categorization, or even some combination, the next question to answer is “Who is going to implement and maintain this?” Your choices are essentially: Pay someone else to do it (service-based) or do it yourself.

Service-based - With a service-based approach, you select which text analytics provider best suits your needs and you pay them to categorise your comments. With this approach, text analysts leverage years of experience on your behalf. While they do the heavy lifting, you can concentrate on analysing and interpreting the text analytic results. Should you need to add a new category or tweak the logic or training behind a category, however, you need to rely on your TA provider for updates. This can take time if your TA provider isn’t responsive. This is one reason some companies choose to do text analytics themselves.

Do it Yourself - With a DIY approach, you select which text analytics tool best suits your needs. Often the tool will come with an off-the-shelf set of categories for your sector, but this will typically get you only 70 percent of what you need. To get good results with a DIY approach, you need to be prepared to invest time and effort to configure the category and sentiment algorithms and maintain them over time.

Here is a rough idea of what DIYers can expect to invest:

  • Text Analytics Software ($US25k to $US200k annually)
  • Rules-based: Text Analysts to implement category set (200 to 300 hours per language); Text Analysts to audit and maintain category set (100 to 200 hours annually per language)
  • Machine learning based: Manual coders to create training sets (about $1.00 per comment) (typically need at least 100 comments per category)

If you decide to do it yourself, you may wind up spending a majority of your time creating text analytic results rather than using them. While this may be a deal breaker for some, you will have full control of the process and can fine tune to your heart’s content.

Finding a Solution that Best Fits Your Company

Having a text analytics solution for your company is crucial. It will increase the speed at which you gain customer insights. Before setting up text analytics, companies first need to decide if they want to use a machine learning approach or a rules-based approach. They then need to decide on a DIY or service-based implementation.

In our experience, and for our purposes, the level of machine learning precision is not acceptable compared to the results we obtain with a rules-based approach. By developing and maintaining category sets with input from our customers, we ensure they are relevant and actionable. These category sets are also based on decades of experience in the various business sectors in which we work (automotive, retail, hospitality, banking, insurance, restaurant, telecommunications, etc.). We measure precision for each client implementation, and we are generally above 90 percent for categorisation, and 85 percent for sentiment.

In short, we don’t use machine learning because we’ve discovered we get more accurate, more reliable results with our rules-based approach. This is what works best for us. Implementing a machine learning solution may work well for your needs. It’s important to do your own research before making a decision.

Many companies we work with have opted for a service-based approach to categorisation, but some have chosen to invest the time and money to develop in-house text analytics solutions and expertise. And some even have a combination of the two. Most importantly, almost all have concluded that categorising their comments with text analytics is a key way to understand what their customers are saying.

Deciding which “How” and “Who” are the first steps in putting into place a solution that will help you make sense of the mounds of unstructured data that are ever-increasingly flooding into your company.

Brion Scheidel is Director, Text Analytics at MaritzCX, a US company that utilises the technology to determine customer sentiment and apply predictive analytic techniques to help organizations increase customer retention.