How to Solve Content Auto-Classification with High Accuracy

By Yanko Ivanov, Senior Knowledge Management Consultant at Enterprise Knowledge (EK), LLC

As technologies evolve, we have seen the rise of auto-tagging, auto-classification, and auto-categorisation tools that attempt to take over the task for describing the content we create. These tools apply metadata tags automatically so we don’t have to.

Yet, in many cases the accuracy of auto-tagging efforts has been underwhelming. Why is that? More often than not, it is the way the technology has been applied, rather than the technology itself.

Even if we implement a machine learning algorithm, we still need to teach the machine our language and the way we describe things within our domain. A machine learning algorithm is a toddler who first needs to learn the basics of your language. At EK we educate these “toddlers” by applying the following methodology.

Develop Your Taxonomy/Thesaurus, i.e. vocabulary - To start with, you need to teach your toddler the basics of your domain. This is where your business taxonomy is critical. It helps describe the knowledge in your organization and provides a structure from which the machine learns that solar energy is a type of energy source and that an article containing that term may be talking about energy sources, or clean energy. We help our clients design their taxonomy so that it is intuitive for people and simultaneously understandable for a machine. Utilizing industry standards, we apply alternative labels (e.g. synonyms) for terms, as well as identify how terms are related to each other outside of a simple parent-child hierarchy.

Select Your Teaching Tool, i.e. corpus - Next, we need to expose our toddler to the world, or at least to a contained playground so that it can apply what it already knows (the taxonomy/thesaurus), and learn new things. To do that, we help our clients identify a representative sample of their content that we then feed to the machine learning algorithm. This achieves two goals:

  • confirm that the taxonomy/thesaurus we developed actually describes the content domain of the organization; and
  • identify potentially new terms or synonyms in the content that should be included in the taxonomy to ensure comprehensive coverage.

Enhance Your Taxonomy/Thesaurus - Integral to the step above, we now need to define additional terms or concepts so that the toddler can understand what they are and how they fit in its world. In other words, through the automated content analysis and text mining in the previous step, we look through the items that the machine learning algorithm identified, and if applicable we include them in the correct place in the taxonomy, or add them as synonyms or alternative terms for items that already exist. This step helps enhance your taxonomy and increase its expressiveness.

In other words, revising and enhancing your taxonomy enriches your toddler’s vocabulary so it can identify even more things with ever greater accuracy. This step is critical for achieving highly accurate auto-tagging results. Think of it this way: the richer vocabulary you have, the more eloquent you are. Additionally, once the toddler has its base vocabulary, it will need less and less help when running across new terms. It will start identifying them correctly through their relationships with terms in its vocabulary.

Achieve Accurate Auto-tagging - The last step in this process is integrating and fine tuning your auto-tagging process. By the time we get to this step, our toddler has learned quite a bit and we’re really helping him refine its vocabulary. During this step, we apply rules to disambiguate terms that could be easily mixed up like “share” as in stock vs “share” as in a piece of the pie. 

In summary, auto-tagging is a powerful feature that helps organizations better describe their content while achieving better efficiency and time utilization. By teaching your toddler your language you no longer need to take away precious time from your content creators, SMEs, and end users so that your content is properly tagged. This results in happier content creators and increased accuracy in content tagging. And the ultimate end result of this effort is content that is easier to find and reuse.

Yanko Ivanov is a management consultant focusing on business analysis, system design, and integration.