Friday, March 23, 2007

LS 500 Thesauri, Taxonomies & Ontologies Oh My!

Gilchrist, A. (2003). Thesauri, Taxonomies and Ontologies - An Etymological Note. Journal of Documentation 59:7-18.

This article attempts to clarify the differences and similarities between the three terms: Thesauri, Taxonomies and Ontologies as they are presently being used by Information Scientists, AI Practitioners, those working on the foundations of the Semantic Web as well as others.

The beginning of the article lists the definitions from the Oxford English Dictionary (OED):

Thesaurus

A "treasury" or "storehouse" of knowledge, as a dictionary, encyclopedia or the like.
A collection of concepts or words arranged according to sense; also a dictionary of synonyms and antonyms.

Taxonomy

Classification, esp. in relation to its general laws or principles; that department of science, or of a particular science or subject, which consists in or relates to classification; especially the systemic classification of living organisms.

Ontology

The science or study of being; that department of metaphysics which relates to the being or essence of things, or to being in the abstract.

Thesauri

The article mentions that the word thesaurus makes most lay people think of Roget's Thesaurus of English Words and Phrases (first published in 1852). I would be one of those people who thought of Roget; it also make me think of some kind of dinosaur that likes words, but that could just be my imagination taking over.

The next reference of the word comes from Wilkins' 1668 Essay Towards a Real Character and a Philosophical Language which included a decimal classification that ranged from God to "public relationships (civil, judiciary, naval, military, ecclesiastical)."

The third reference of the word thesaurus comes from a paper by Helen Brownson from the American National Science Foundation. Vickery quoted Brownson as saying "The application of a mechanized thesaurus based on networks of related meanings."

The article also points out some scholars are of the opinion that, "The thesaurus may become almost invisible to most users." The author suggests that the conventional thesaurus should be extended and elaborated to include: term definitions, notes on term usage and more explicitly defined relationships. One of the great benefits of these elaborations would allow the semantic network to be more easily manipulated by an interference engine, most likely an IF...Then operator.

Taxonomies

Inevitably individuals will introduce old words with new meanings into the conscious of the mainstream. This occurrence has renewed an interest in taxonomies; the triggers that generally cause this to happen are the following four things:

Information Overload - Conventional search engines are inadequate in dealing effectively with very large databases and users are in desperate need of search aids and filters.

Information Literacy - End users have severe problems in properly knowing how to search for information causing much time to be wasted and critical information to be missed.

Organizational Terminology - Published classifications and thesauri do not reflect the languages of particular organizations, in which most often 80% of the information is created internally.

"Destructuring" of Organizations - Mergers and acquisitions create cultural problems at the implementation stage. Similar issues have been encountered when extranets have been combined and when virtual communities are established.

Most frequently the word Taxonomy was being used with at least five separate meanings:

  1. Web Directories - on the Internet and more and more in Intranets
  2. Taxonomies to Support Automatic Indexing - commercial Web sites
  3. Taxonomies Created by Automatic Categorization - software packages capable of automatic analysis
  4. Front End Filters - a taxonomy created or imported and used in query formulation
  5. Corporate Taxonomies - A number of thesauri get merged into a "megathesaurus"

Ontologies

In Vickery's 1997 paper he quotes Gruber (a leader in the Ontology Field) as saying, "An ontology can be defined as a formal, explicit specification of a shared conceptualization."

WordNet and CYC are two of the oldest and most widely known ontologies. WordNet contains 100,000 word meanings grouped by five categories: nouns, verbs, adjectives, adverbs and function words. Two areas where the use of ontologies is being touted prominently is in the area of Knowledge Management and in the idea of the Semantic Web.

The article mentions three kids of ontologies as being useful or organizational memory systems:

  • An Organizational Ontology - which describes the information meta-model
  • A Domain Ontology - which describes the content of the information source
  • An Enterprise Ontology - which is used for modeling business processes

Conclusion

The article quotes Wittgenstein as saying, "If you want to know the meaning of a word, you should look to see how it is used." Looking at the applications of thesauri, taxonomies and ontologies it is easy to see a progression of ideas that has resulted in some overlapping of detail. This progression has been driven mainly be three factors:

  1. The growing trend in organizations to collate external and internal information
  2. The vast quantities of information now available (Microsoft has 3 million documents on its Intranet)
  3. Available and Inexpensive Computing Power

The article states all three words (Thesauri, Taxonomies and Ontologies) all deal with natural language and notes that taxonomies use both classification and thesaurus techniques. In his final statement Gilchrist states it is very obvious that multidisciplinary teams will be needed if such dreams as the Semantic Web are to become a reality.

~ My Perspective ~

Well first and foremost I have to say that I understood the article much more after attending class. I don't think you should have to read the article two or three times to fully understand the concepts. At points I felt that the author was just blathering on to hear himself blather.

The one point that seems to come up time and time again is that users do not know how to search properly, get frustrated wasting time and miss the most valuable and useful information that they are searching for. My other main complaint with the article is that Mr. Gilchrist uses the term AI at least three times in the article. At no point does he clarify what it means or uses it in a context that I was able to gleam the meaning from. I am therefore left to assume (and I hate to assume) that it means Artificial Intelligence. What exactly that has to do with the three terms the article discusses I am not quite clear. At this point I'm quasi-clear on the three terms since class but I still don't like the article as a whole. I think it could have been done in a much clearer manner that the lay people category (which apparently I fit into from his Roget's Thesaurus comparison) could understand.

No comments:

Post a Comment