Thursday, April 19, 2007

WikiCorpus, Natural Language Processing

I'm tentatively calling the website 'WikiCorpus Project'. I'm hoping to make the interface as easy to use as possible to make entering read knowledge a rapid and ergonomic process. Interestingly, the design of the interface and algorithms for ergonomics are similar to the process of natural language processing.

There will be an on screen list of predicates offered to the user for each sentence. Hopefully, this list will be accurately predicted so that the user does not have to search for or create a new predicate. This predictive list will likely improve over time, as the corpus is populated, using patterns in the sentence (improved by word sense discernment, see: WordNet) and context. This (now theoretical) algorithm and its statistics-based dataset (both to be freely downloadable) might be of use as an algorithm component to some approaches.

Pronouns and other indirect entity references will have a context-menu where the user can select which entity the reference refers to. Again, ideally the correct reference is the first in a list offered and the goal is to minimize the probability of the user having to select 'Other' and select from a comprehensive list of the entities in the document.

So where NLP aims to resolve these completely, the ergonomics of this project aim to make an improving list of options for a user entering the knowledge they are reading. The ergonomics says that the predicate they are going to use next should be on screen (predicted) and the resolving of pronouns should have the correct candidate (hopefully first) in a drop down list.

The ergonomics then is an “easier” problem resembling information retrieval and predictive search that hopefully can be of use to some algorithmic approaches to the “less easy” task of mechanically and accurately doing the entire process (reading) on sentences in documents.

No comments: