Friday 5 October 2012

Knowledge formalization for the masses

Summary

One of the greatest achievements of the Internet is to dramatically decrease the time it takes to find information. However, this capacity has not greatly improved over the past decade due in part to the explosion of unstructured content. The problem is even worse on the enterprise Intranets, where information is much more fragmented and lacks the interconnectivity of the Web that improves navigation and search result ranking.

The situation can be improved if content actors (producers, managers, and consumers) can combine efforts to better formalize information, making it easier to process by computers. While technologies to support such activity have been evolving, much still remains to be done. Lacking in particular are systems that assist content consumers, by far the largest segment of content actors.

Problem

Despite Apple’s Siri, we are still pretty far away from the Star Track computer, i.e., a computer that is capable of answering any question based on the available information. While content search technology has gradually improved over time, it has not been able to match the information explosion that we are experiencing (aka Big Data). In the end, you still get a (usually large) collection of documents that may or may not contain what you are looking for.

Even when information is semantically structured and we can ask a computer for a specific information object, there still remains a problem of searching across multiple structured data sets with different data models. One cannot specify query parameters if these parameters are not the same across datasets.

What’s needed is a way to formalize and merge semantic information structure :
·         Give semantic structure to content during its creation
·         Structure existing content through
o   Adding external structured properties called metadata (e.g., topic, author, date, etc)
o   Extracting structured facts from unstructured content as an alternative knowledge representation
·         Unify and interlink the resultant structure across all data sets

The result would be much easier to automatically process and search for computers, with powerful consequences for information consumers. They would then have their Star Track computer.

Unfortunately, this is a very complex undertaking that requires a lot of investment on the part of  information creators, managers, and consumers. Therefore, the supporting technology has a large challenge of boosting the ROI in order to achieve the tipping point of mass adoption.

So far, the technology has generally taken two opposing approaches :
  1. The content formalization work is carried out by dedicated trained people who are referred to as Information Architects, Content Curators, etc. These people create domain-specific data models, interconnect different models, and use them to formalize new and existing content. Some of this work can be automated to a certain degree, but automation usually introduces a significant amount of noise.
  2. The content formalization is carried out by content consumers via so-called « free tagging », whereby users can add whatever metadata to content that they wish. Free tags are just simple short phrases that add a bit of semantic structure. While content consumers can be motivated to improve semantic structure for better retrieval , the resultant degree of formalism is very weak.
There exist now sophisticated software and algorithms for experts that help to create, manage, consolidate, and reuse metadata and data models, such as linguistic rules and reference vocabularies. However, no tools are available that empower content consumers and consolidate their contributions with those made by experts.

Solution

What’s needed is a platform that allows to effectively crowdsource the content formalization task to those who use the content. After all, this formalization is done for the benefit of content users so it makes perfect sense that they should have a say in how the content is formalized. Just like one can crowdsource production of data, one can crowdsource production of data models and metadata.

Many research papers have been written on this subject (see an example), but strangely no effective commercial tools exist that would support such a process. Yet, the underlying concept is fairly simple:
  • Allow users to create, define, and modify tags, split them into properties and values, and create semantic links between tags (e.g., synonyms, translation, sub-terms).
  • Allow users to discuss and evaluate modifications proposed by others (e.g., using voting). Merge identical modification proposals and count them as votes.
  • Allow automatic acceptance based on the number of votes, user profile, etc.
  • Allow different moderation rights based on user profiles (e.g., certain users can have expert status and reject modifications made by others).
  • Assist and guide metadata creation and consolidation by using search keywords and expert-generated data models.
As research papers indicate, there are a lot of details to work out, but conceptually the main issues have already been resolved and modeled. All that’s needed is to create a viable commercial product.

Market Overview

Content is consumed through a huge number of diverse software systems. Many prominent systems already use free tagging:
  • Enterprise collaboration platforms such as Sharepoint, Drupal, and Confluence.
  • Public collaboration platforms, such as Twitter, Stack Exchange, and Delicious.
A few of these systems are starting to offer a basic level of tag management functionality. For example, in Stack Overflow users with enough reputation can specify and edit tag definitions as well as suggest and validate tag synonyms. While this is a move in the right direction, it is far from sufficient.

On the other extreme of the spectrum, Google has recently launched Knowledge Graph , which is the most complete public collection of expert-structured knowledge. This collection can be used by external services via an API, thereby providing a good basis for semi-automated enrichment of free tags.

Business model

The goal of the proposed solution is to enhance the functionality of existing content platforms, which can be accomplished in two ways:
  • Sale of a software component to content platform providers (OEM license)
  • Sale of a platform plugin to content platform users (software license or SaaS subscription)

Go-to-market strategy

Many of the platforms using free tagging provide API access and application marketplaces (see an example). The best starting strategy would be to develop tag management plugins for such platforms.

Moreover, those platforms that chose not to implement free tagging have done so in the knowledge of its limitations, and so may change their mind once a better system is in place.

No comments:

Post a Comment