The sore idea of thesauri

Post date:

Sunday, April 13, 2003 - 01:00

Classification of business concepts into a working taxonomy is one of the most under-rated elements of knowledge management.

By Paul Montgomery

Know what you know. It is one of the catchphrases of knowledge management, and while it is easy to say, it is devilishly hard to do. Even if you manage to extract all the important knowledge that had been locked away in employees’ heads and in legacy IT systems, the problem remains about how to make sense of all this data: how to know what is that you know now, after you set out to know what you knew. Confusing?

This may seem like mere semantics, but think about a library: where would you know to find a book if you didn’t have the Dewey classification system?

The application of taxonomy to corporate records dates back to Carolus Linnaeus, who in 1758 proposed a Latin-based naming scheme for biology which we now know as the Linnaean classification system. For instance, humans are usually referred to as Homo sapiens in biology, but the full classification for humans extends to twelve levels, through superfamily Hominoidea, infraclass Eutheria, subphylum Vertebrata, and all the way up to kingdom Animalia.

Thankfully, classification systems for companies are usually far less complex. Rather than organisms being classified based on numbers of legs or hardness of exoskeleton - as biological classifications specify - corporate data is most often evaluated according to function, activity and transaction. Function is the highest level, referring to what a business unitÕs role or goal is for the organisation. Activity is a level where data is sorted according to the actions that it relates to, and the transaction layer is where records are labelled differently for each type of act, decision or communication.

”Records management classification schemes are based on hierarchies,” said Stephen Bedford, records manager at the State Library of NSW and one of the architects of the most popular classification schemes in Australia, Keyword AAA. “The hierarchy is generally meant to reflect the business activity that is going on in an organisation. The hierarchy is generally meant to have functions at the highest level: something like personnel, or if you are at a bank, then another function might be customer account management or customer services. That’s the highest level. The next level down is activity to support that, where you have things like product development. The lowest level is generally meant to represent transactions or groups of transactions. There you would have terms to describe individual account management.”

The application of this hierarchy of documents is achieved through the development of a thesaurus, which in the aforementioned scheme contains all the possible transactions an organisation performs, and matches them with their activity and function types. In a taxonomy project, older files are converted to conform to the new system, and knowledge workers are instructed to file new documents under the correct designation.

The thesaurus is at the heart of a good taxonomy system, and the usual practice is to take an existing thesaurus relevant to the type of industry the organisation is in, and modify it to fit specific needs. As in many other RM pursuits, Australia has been leading the way in developing best practice thesauri. Keyword AAA has become the most popular basis for Australian thesauri, developed by the Archives Authority of NSW. Mr Bedford, one of those who formulated Keyword AAA at the Authority in 1995, said that while an organisation would always have to write its own thesaurus, it could take another thesaurus as a base.

”You have to build your own terms on top, because every organisation is unique,” he said. “If you’re a private company, your records have a different legal status than in public organisations. There can be differences state by state as well.”

Mr Bedford said that while the worthiness of a good thesaurus was usually measured by its ability to make document search and retrieval much better, there were also many benefits in records managers gaining better control over the records.

”Records management is evidence of what a business has done: what happened, who did it and why. Because they reflect that difference of purpose, RM classification schemes differ because they are about linking back records to the activities that created them. RM classification schemes are not just about finding information, but also a lot more about managing information. Most users, when they search for records, search on free text, or full text retrieval. How an RM thesaurus helps that process is by giving context to the hits they get back. If they search for something like Jones, is that Peter Jones the staff member, Jones Catering Company, or Alice Jones the client?”

”Retrieval is only part of the issue. Of equal or more importance is using classification to determine the default values to help manage records. If you classify records one way, that means you can attach a default workflow based on that class, you can define an access regime, and determine how long you need to keep it. There are a whole lot of other things that make thesauri quite valuable.”

CONTENT EXPERTS

Once you gain approval for a taxonomy project, there are several decisions you have to make in developing and implementing a thesaurus. On the development side, the usual practice is to hire a “content expert”, a consultant who is familiar with your industry, to help your own knowledge workers to formulate a coherent and comprehensive set of classification terms. Conni Christensen, managing director of Sydney-based consultancy firm Synercon, said the role of a consultant in a taxonomy project was “unifying things”, to deliver business alignment, consistency, and stability of systems.

”What we tend to be involved in, in a taxonomy project, is a part of an overall information management implementation. We are talking about an entire strategic IM approach. One thing those normally don’t have is a common classification scheme,” she said. “Taxonomy is an absolutely vital underpinning, a foundation infrastructure that must be in place. It supports records management, document management, knowledge management and content management.”

The development of a thesaurus is not a standard IT project where technical skill is paramount. Ms Christensen said that there were few people with the necessary blend of technological nous and a background in English skills to act as a content expert, as it was not a standard career path for an arts graduate.

”The hard part is, as it was in the old IT industry, that some people have the IT skills, and some have communication skills. People with both do well. Vendors have good technical skills, and consultants have good language skills. Thesauri are all about English comprehension, and using it through technology. People who bring together those skills are few and far between. When they have got them, it’s a powerful combination. Our job is to bring those two together,” she said.

Ms Christensen said consultants should be involved in a taxonomy project “from start to finish”, adopting a change management approach to ensure that the users satisfy their need to know where they are now, and where they should be.

”You can’t throw technology at the problem and get a fix until you get non-technical business rules and classification schemes. The infrastructure has to be there before you throw technology at the problem. We’ve always been involved in that. Where most of our clients come unstuck is when they don’t have business rules and classification, so we tell them they need to step back and put that in place. You don’t get return on investment until you have that stuff in place,” she said.

WORD PLAY

Once you have a suitable content expert, the work has only just started. To prevent confusion among users about which term should be applied to their newly-created record, the developers of the thesaurus in your organisation have to be very careful about the definition and application of terms, according to Mr Bedford.

”For example, if you are at a local council, you would be licensing dogs, and also own a fleet of cars,” he said. “You would have dog management and fleet management as functions. The second level term would be licensing, but it would be relating to both functions. If you look at licensing, you must say that you have to narrow your terms. There would be no way of being able to stop users missing the correct classification.”Mr Bedford said there were often difficulties in taxonomy projects with user resistance, especially if the new system imposed an unfamiliar structure on the normal way employees had been doing their job.

”You are expecting users to understand and use your schemes. That might be a problem with the way thesauri are being developing now. You should be able to describe people’s work in a way they are familiar with. Records classification schemes should be written in such a way so that people can say, ‘Those are my five jobs’.”

Mr Bedford is one of many Australian experts working on a replacement for Keyword AAA to help public and private companies develop better thesauri. He said that the basis for Keyword AAA, the ISO2788 multilingual thesaurus standard, had not taken into account the difference between RM thesauri and library-based thesauri.

”Keyword AAA is the best thing out there at the moment, but it does have some weaknesses, like in some of the terms at the second level. There can sometimes be some quite vague concepts,” he said. “For instance, under recruitment, which is a sub-function of personnel, you have reviewing and evaluating – what’s the difference? Evaluating is supposed to be the first time.”

MANUAL VERSUS AUTOMATED

After you have what you think is a comprehensive thesaurus, it's time to apply the technology. The major choice to make is how much of the classification of documents you want to be performed manually by employees, and how much you want to be left to software programs to make decisions over. The decision that a taxonomy project leader has to make is a trade-off between cost and quality, because a knowledge worker will always be able to classify documents better than software, but it is often not worth the expense of putting them to work on some records which could quite reliably be handled by machine intelligence.

"There are obviously two different spheres here," said Alan Chate, managing director of specialist Australian taxonomy software developer This to That. “We have not gotten into automation. That effectively means scanning all your documents and producing a taxonomy for that particular site. Most of our customers are using paper documents and a classification scheme and therefore require manual entry. We’re looking for a specific area of the market.”

Mr Chate said the choice would be a “horses for courses situation”, depending partly on the size of the business.

”Either approach still requires a lot of careful checking to produce a quality thesaurus that can be used for accurate checking. You have to remove noise, remove duplicates, and check for consistency. Automated generation is used for very large situations,” he said.

Mr Chate pointed out that taxonomy was only a resource to point to documents, not to change them, and thus documents might well slip through the taxonomy if they are misfiled.

”It is not a matter of economics, I don’t believe,” he said. “It is a matter of obtaining the best level of accuracy for your thesaurus, so that you get the most effective use out of it. Whichever approach you choose, the thesaurus must be very carefully checked by experts in your field. Thesauri prepared by inexperienced people will be of no purpose.”

Mr Chate said that the “big thrust” in taxonomy was now occurring in government circles, where legislation has mandated the use of robust records management systems. He said these practices were “starting to trickle into the corporate world”, especially with the signature examples of bad records management at Worldcom, Enron and BAT. ”The confusion that exists in the use of thesauri for RM is why the IT2209 committee formed in Australia. It is a sleeping industry in Australia, but it will be an awakening giant, I’m sure of that, because government will put stricter requirements on organisations,” said Mr Chate.

While there are a number of specialist taxonomy software providers, many broader applications also include their own thesaurus features. Rob Whiter, business development manager at knowledge management software vendor Hummingbird, said that his company was now agnostic over the manual/automated debate, after having previously been on the automation bandwagon in its Fulcrum division.

”No doubt, organisations are looking to start off with a manual classification function,” he said. “They want to go in and tell the tool exactly how to do it: a fully invasive manual approach. To get these systems where risk mitigation is the main purpose, rather than discovery, you can’t have the tool changing the way it does something without telling someone.”

However, Mr Whiter said the automated approach was most appropriate for electronic records, which in services organisations in particular is becoming a primary source of important business records.

”Records managers are now trying to capture and qualify, from a business function point of view, information that is electronic, volatile and very mobile. Email is the obvious culprit here. Especially with legal firms, emails are increasingly representing business functions and actions. To ask people to classify all their emails is not practical,” he said.

”Taxonomy is important both for the way you do RM classification, and the way you look at the data. One method is to classify differently for browsing, which wouldn’t change the RM classification in an organisation. You would just change it for discovery purposes, it’s a transitive state. Finally, RM staff have had to relent a little on their hard line stance and admit that they do need machine intelligence to help them classify.”

Mr Bedford said that without modification, automated techniques would not work in records management classification.

”What RM is all about is maintaining the business context of what’s going on. With techniques like word frequency tests, I can’t conceive of how a word frequency test would tell you about the un-stated things in a business transaction. It may be possible to get automation in other transactions. If a particular workflow is chosen, that can be linked back,” he said.

Mr Bedford admitted that automation would be “quite valid” when used to index documents for retrieval purposes.

”There is a difference between classifying information to find, and to find and manage. If you’re talking about to find, automation is as useful a tool as any other. You have to be really sure of it if you are going to make management decisions based on those classifications,” he said.

CRITICAL EVENTS

John Townsend, managing director of content management software developer HarvestRoad, said his company viewed its worth to a taxonomy project as adding value by storing data only once, but linking it to all other documents, something which has not happened in the shredding scandals of the new century.

”It has taken some critical events to make public companies in particular stand up and take notice. Everyone has had eyes on the holy dollar in last few years during the Internet boom, and they weren’t giving much thought to records management,” he said.The main customer base of HarvestRoad is in education, which Mr Townsend said was an industry in which users were “very aware” of taxonomies.

”Unless there’s a massive amount of material, all data has to be classified very accurately, so typically it’s a manual approach,” he said. “You would be surprised how much work some data entry workers can achieve given good guidelines, rather than spending an awful amount of money on automated tools. You can write programs to go through a directory and automate the publishing process, and we do that. You can extract information out of objects, and you can do a reasonable job that way. Most customers end up spending a fair bit on it.”

Mr Townsend said his customer usually took from two weeks to twelve weeks to complete a taxonomy project, depending on the resources they had available, but that project leaders had to focus on the business case to determine the balance between quality and cost.

”You have to think of the outcome the business is trying to receive. If it costs $100,000 to do the back capture, are you going to get $100,000 of return out of that? Many projects donÕt require that level of back capture, and you don’t want the project being delayed,” he said.

”Automation is not perfect, but neither are humans. A combination approach is advantageous. You don’t want to spend a lot of money for what people bashing away in a corner could achieve in the same time with greater accuracy.”

After the technology has been implemented, the final task in a taxonomy project is training staff to use it. Mr Townsend said it was not an easy thing for users to know how to file documents, or how to find them again using the new system.

”A very important part of an implementation plan is that end users know how to file something in the system. Typically, when presented with something that looks like the Windows XP file system, they would be tempted to file a document under one category. That is the limitation of the file system. The taxonomy idea is to file it under all types that it fits. You have got to give enough training to make users realise the capabilities of the system: that it is not just a file system, it is a powerful information categorisation tool. The way to do that is to say that it gives the user freedom,” he said.

Business Solution:

Document & Records Management

Search form

The sore idea of thesauri