Monday, April 30, 2007

Treebuilding Example

Confident in the theory, I went to my bookshelf, grabbed a book and found a random sentence. Using the following (rough draft) operations:

NEW: begin a new predicate
PUSHCUR: push a noun onto the stack, other operations may use the top
POPCUR: remove top element from stack
WRAPL: move from current position into the left position of a new outer predicate; WRAPL(P) = C → P(C,_). After a NEW, the top of the stack is the current position.
SETR: set the right argument of a predicate
IPWRAPL: perform a WRAPL around the last placed argument
TRNS: a transition such as yet, however, etc.
MODP: modify the predicate in the current scope

Leibniz took up the question in his baccalaureate thesis, and argued in the true scholastic style for a principle of individuation which would preserve the independence of universals with respect to ephemeral sensations, and yet embodied universal ideas in the eternal natures of individuals.


NEW
PUSHCUR leibniz
WRAPL TookUp
SETR the question
WRAPL In
SETR his baccalaureate thesis
NEW
WRAPL Argued
WRAPL Manner
SETR true scholastic style
MODP for (Argued → ArguedFor)
SETR principle of individuation
PUSHCUR principle of individuation
NEW
WRAPL Preserves
SETR independence of universals
IPWRAPL RespectTo
SETR ephemeral sensations
NEW
POPCUR
TRNS yet
WRAPL Embodied
SETR universal ideas
IPWRAPL Regarding
SETR eternal natures of individuals

This is a rough draft of an instruction set that would be output from a system and that can generate a sequence of recursive predicates. Some more instructions detailing scoping would be useful, for example to capture yet(A,B) where A and B are sequences of predicates. Ideally, these connectives pairwise connect the members of each sequence from each scope. Some other modifications might be necessary after looking over more sentences. The noun phrases are capable of being further structured, for example the last portion could be

SETR universal ideas
IPWRAPL Regarding
SETR eternal natures
IPWRAPL Of
SETR individuals

The system should be able to tell from the knowledgebase whether the reifiable substructure, e.g “Of(eternal natures,individuals)”, is itself a composite noun or relatable entity.

The use of a stack is a preliminary approach, other example sentences indicate that the data structure(s) for noun handling are more complex, a set of operations might be required that use the last used noun instead of pushing nouns onto a stack. Also possible is that the scoping is related— data structures are used in scopes. Noun handling in the sequential assembly of recursive parse trees is an interesting area.

I'm thinking on connectives such as “which would” and “yet”. Both easily representable— might be advantageous to do so during NLP because the correct usage of connectives like “yet”, “but” and “however”, that illustrate a semantic constrast of some sort between sequences of predicates, is one distinction between NLG and AI, or first and second generation NLU.

Sunday, April 29, 2007

Machine Translation, Machine Reasoning

Using a numerical interlingua, paraphrase generating systems that use the same ontology might have interesting applications for machine translation.

Also it shouldn't be terribly difficult to export from the recursive predicates into CYCL, KIF or RDF for machine reasoning applications. This format can represent its own rule system as well which may allow for implicit knowledge to be obtained via machine reading.

Algorithm Design

After WikiCorpus is up and running, I've a few hypotheses to test on the dataset. The first is the use of finite state automata in the sequential processing hypothesis. The FSM might require one or more stacks or cursors. This would take a preprocessed sentence (via NL tools) as input and output a sequence of treebuilding instructions. Some substrings would have to be mapped to predicate candidates.

This first theory is to transform a sentence into an alphabet of treebuilding instructions that may have its own grammar (universal?). I find the permutations on the predicates at the logic-based or knowledge-based level that allows noun-order paraphrases to be transitioned between, the relationship between this and the treebuilding alphabet and the relation to the sentence paraphrases to be interesting.

Rephrased, each sentence is a sequence of words that can be transformed (FSM, HMM, ?) into one or more sequences of treebuilding instructions and the resulting sequence of predicates (or set of sequence candidates) can be permuted for noun order paraphases using the fact that predicates can map to others with the arguments inverted. It's possible that permutations on the tree can be mapped to transformations on the treebuilding instruction set and this can map back to sentence(s). This level of natural language understanding could also be called a paraphrase generator (a less than exciting name for some rather complicated AI).

Thus, it's theoretically possible that the same system that turns sentences into sequences of predicates can be of use in natural language generation.

Thursday, April 26, 2007

Machine Reading, Paraphrases

A rule system may allow semantic subtrees to be mapped to one another. For example, the paraphrase:

Tommy put the book on the shelf to be helpful.
InOrderTo(Did(tommy,PutOn(book,shelf)),Is(tommy,helpful))

The following rule Is(X,helpful) → Help(X,_) might resemble that if something is helpful then that something help(s/ed) some thing or things. Formulating rules in this manner might transcend, in part, hermeneutic circles that occur when relating lexical elements to one another.

Hermeneutic circle refers to the semantic interconnectedness of a set of entities— for example, the definitions in a dictionary refer to other words in that dictionary. Philosophers from Schleiermacher and Dilthey through Heidegger and Gadamer considered this phenomenon. Wittgenstein said regarding this that light dawns gradually over the whole.

A rule system lexicon might further assist the semantic equivalence of paraphrases allowing semantic substructures to map to one another based on the lexical hermeneutic circle.

Like most things in AI, machine reading is more easily described than programmed. Any system that can equate noun-order paraphrases to the same set of predicates (or permutable equivalents) would be a milestone in my opinion.

Algorithm Design

I'm collecting important sentences from linguistics to parse examples to look at for algorithm design in machine reading.

“The girl whose car is blocking my view of the tree that I planted last year is my friend.”

This sentence is from a psycholinguistics article [1] and illustrates recursion or “the use of relative pronouns to refer back to earlier parts of a sentence.”

1) IsFriendOf(girl,I)
2) Possesses(girl,car)
3) Obscuring(car,ViewOf(I,tree))
4) On(Planted(I,tree),last year)

Each of these is a simpler sentence. The predicates' arguments are bound identically which shows the use of URI or integers as opposed to strings as the goal is to be able to place these four into a knowledgebase where they can be used with other knowledge and to retrieve them and reassemble the sentence (sentence aggregation [2]) or sentences as needed.

Simulating the process of accumulating these predicates when processing the sentence in left to right order, or sequentially:

P1(girl, A2)

P1(girl, A2)
Possesses(girl,A3)

P1(girl, A2)
Possesses(girl,car)

P1(girl, A2)
Possesses(girl,car)
Obscuring(car,A4)

P1(girl, A2)
Possesses(girl,car)
Obscuring(car,ViewOf(I,A5))

P1(girl, A2)
Possesses(girl,car)
Obscuring(car,ViewOf(I,tree))

P1(girl, A2)
Possesses(girl,car)
Obscuring(car,ViewOf(I,tree))
P2(I,tree)

P1(girl, A2)
Possesses(girl,car)
Obscuring(car,ViewOf(I,tree))
Planted(I,tree)

P1(girl, A2)
Possesses(girl,car)
Obscuring(car,ViewOf(I,tree))
On(Planted(I,tree),last year)

IsFriendOf(girl, I)
Possesses(girl,car)
Obscuring(car,ViewOf(I,tree))
On(Planted(I,tree),last year)

Another pair of important sentences are:
1) “Fred saw the plane flying over Zurich.”
2) “Fred saw the mountains flying over Zurich.”

1a) Saw(fred,FlyingOver(plane,zurich))
2) While(Saw(fred,mountains),FlyingOver(fred,zurich))







P1(fred,A2)P1(fred,A2)
Saw(fred,A2)Saw(fred,A2)
Saw(fred,P2(plane,A3))Saw(fred,mountains)
Saw(fred,FlyingOver(plane,A3))While(Saw(fred,mountains),FlyingOver(fred,A3))
Saw(fred,FlyingOver(plane,zurich))While(Saw(fred,mountains),FlyingOver(fred,zurich))


Looking at the bold line, and assuming a sequential processing, it appears that both hypotheses should be kept by an algorithm at that step. These sentences are an argument for knowledge-based processing and lexical data. It does appear that properties of “mountains” and “plane” can distinguish between hypothesized semantic parses. However, a word like “birds” could be in either parse structure or both simultaneously, depending on the context, for example seeing birds from a plane. Theoretically, knowledge-based, statistical and context-based methodologies can help discern between parse candidates.

Another possible representation of those sentences:
1b) While(Saw(fred,plane), FlyingOver(plane,zurich))
2) While(Saw(fred,mountains),FlyingOver(fred,zurich))

This representation makes clear that the difference is in binding the first argument of FlyingOver. The side by side processing of these two would otherwise be equivalent. I'll have to look at more sentences to determine whether 1a or 1b is more useful or if they are equivalent via a rule system. The representation is important in discerning the algorithm. I'm hopeful a corpus will aid in this area.

The sequential processing hypothesis is based on the proof of concept manner in which people read sequentially, however machines need not process text in the same manner. Additionally, even in the sequential processing hypothesis there are possiblities, for example, the text processor could be one or more words ahead of the predicate generator.

Other hypotheses include structural processing where the semantic tree is generated in a top-down or bottom-up manner based on data, patterns and substructural patterns collected and discerned from a corpus. This information can help determine a parse structure based on the fact that one usage of language resulting in one parse structure is extremely rare and the other commonplace.

[1] Psycholinguistics, Wikipedia
[2] Natural Language Generation, Wikipedia

Wednesday, April 25, 2007

Semantic Parsing

A sentence that parses to a tree structure with all the arguments of each predicate filled can be said to be semantically complete. However, with a tree structure of nested predicates, it's possible that some sentences, even grammatically correct ones, will parse to structures having blank arguments. This can be explained by the fact that contextual information may have been already delivered earlier in a document and that the author can then exercise brevity. This brevity may be more commonplace in informal spoken language.

Example:
Tommy put the book on the shelf in order to help.
InOrderTo(Did(tommy,PutOn(book,shelf)),Help(tommy,_))

The context of a previous sentence in a document may make clear that the setting is a library and thus the person writing or speaking the above sentence may choose to omit that information, while maintaining grammatical correctness, with a listener being able to discern the semantically complete version.

It's possible when parsing sentences into a semantic tree representation, as above, that some argument slots may be blank. These slots can be filled by either nouns or semantic subtrees and empty slots can help a natural language understanding system to know what is unknown when processing a document.

Some NLU systems may be able to discern ranked candidates for these from previous content or context. Algorithmically, this can be achieved by maintaining a context state during document processing or utilizing an event driven knowledge acquisition engine.

If there's anything to this sentence-level semantic abbreviation in natural speech and writing, then NLG systems might be able to utilize it to produce less mechanical sounding text; a sequence of semantically complete sentences might sound formal or verbose.

Saturday, April 21, 2007

Speech Technology, Semantic Web

I'm optimistic about the combination of the following technology I've been reading about:

1) Speech to text
2) Machine reading (NLP)
3) Knowledgebase / web
4) Machine writing (NLG)
5) Text to speech

Maybe someday people will be able to talk to their computers to add to and access collective encyclopedic knowledge.

Reification, Ontological Consensus and Ergonomics

Ontological consensus is the goal of the WikiCorpus project and is, in theory, possible via the on screen predictive list of predicates. Users would rather select predicates (that work for the semantics of the sentence) from the list than create new ones, just as users prefer to find the results of a search on the first page of a search engine's results. This list, hopefully produced by an accurate predictive algorithm, will facilitate consistency in the dataset.

In the user interface, the ergonomics of nested predicates would be that a completed predicate can be dragged (moved or copied) into the argument slot of a predicate being formed. The problem is, firstly, that moving and copying the predicate are both done with the drag and drop motion. Secondly, the nesting of predicates might deter from semantic and ontological consensus as complicated constructions are possible from the elements in the predictive list.

Take for example:
1) The book was put on the shelf.
2) Tommy put the book on the shelf.
3) Tommy put the book on the shelf to help the library.

Each is a semantic superset of its previous sentence duplicating and nesting the previous predicate.

1) PutOn(book,shelf)
2) Did(tommy,PutOn(book,shelf))
3) InOrderTo(Did(tommy,PutOn(book,shelf)),Help(tommy,library))

That is, the first sentence is the first predicate, the second is both the first and second, and the third is all three. Two terms related to this method of representing sentences are that sentences have a semantic core and a semantic root (I'll try to find out if other terminology already exists). The core is the predicate constructed from, often the most nested predicate, and the root is the least and the root of the corresponding tree structure.

So, while recursive binary predicates appear to be able to capture natural language, the interface considerations and ergonomics are more complicated. Also, finding the semantic core of sentences appears to relate to placing the sentence into its semantic frame via PropBank or FrameNet corpora; the algorithm would then be to construct the entirety. These nested predicates can be represented in the matrix format with some notational conventions.

Interestingly, similar to the reording of nouns in the matrix format, the reordering of the predicate arguments or nouns can be done, resembling:

InOrderTo(DoneBy(PutOn(book,shelf),tommy),Help(tommy,library))
The book was put on the shelf by Tommy to help the library.

InOrderToInverse(Help(tommy,library),DoneBy(PutOn(book,shelf),tommy))
In order to help the library, the book was put on the shelf by Tommy.
In order for Tommy to help the library, the book was put on the shelf by him.

InOrderToInverse(Help(tommy,library),Did(tommy,PutOn(book,shelf))
In order for Tommy to help the library, he put the book on the shelf.
To help the library, Tommy put the book on the shelf.

So, each binary predicate may be related to another predicate with its arguments in the opposite order. Some predicates may not, limiting the possible noun orderings for paraphrases. This is just one approach to capturing semantics using nested predicates; I look forward to learning other approaches and designing a web interface for a collaborative corpus.

Thursday, April 19, 2007

Knowledge Representation, Paraphrases

In machine reading, a goal is for all paraphrases to be processed into the same set of predicates or other form of representing semantics. One method I found to represent this feature is to represent a sentence as a block matrix representing the pairwise binary predicates between nouns. This utilizes the fact that n-ary predicates can be decomposed into a set of binary predicates.

The rows and columns are in the order of nouns occurring in a sentence and the entry in the i-th row and j-th column is meant as the predicate(s) that relate(s) the i-th and j-th noun (nouns on main diagonal). Using this, permutations can change noun ordering (one variety of paraphrases) and the semantics can be preserved.

Example:
1) Tommy went to the store to get a crescent wrench. <Tommy,store,wrench>
2) To get a crescent wrench, Tommy went to the store. <wrench,Tommy,store>

From this or another noun-order invariant representation format, the goal of NLG is then to compose grammatically correct sentences containing the semantics with the nouns in the order of the underlying matrix. A reason for robustness in this is that noun ordering is often context dependent, as sentences are composed in paragraphs and documents; both noun ordering and sentence aggregation [1] are overarching processes that are part of “good writing style”.

Using the phases of NLG from the article, the process with this knowledge representation may resemble:

Content determination: Knowledge is obtained from a knowledgebase or web.
Discourse planning: Knowledge is moved into a matrix format.
Sentence aggregation: Block diagonalization and utilization of other patterns discerned from a corpus of well-written articles.
Lexicalisation: Putting words to the concepts.
Referring expression generation: Linking words in the sentences by introducing pronouns and other types of means of reference.
Syntactic and morphological realisation: Permutations are applied as per patterns discerned from well-written articles; each sentence is realized with the nouns in the best order.
Orthographic realisation: Matters like casing, punctuation, and formatting are resolved.


[1] Natural Language Generation, Wikipedia article

WikiCorpus, Natural Language Processing

I'm tentatively calling the website 'WikiCorpus Project'. I'm hoping to make the interface as easy to use as possible to make entering read knowledge a rapid and ergonomic process. Interestingly, the design of the interface and algorithms for ergonomics are similar to the process of natural language processing.

There will be an on screen list of predicates offered to the user for each sentence. Hopefully, this list will be accurately predicted so that the user does not have to search for or create a new predicate. This predictive list will likely improve over time, as the corpus is populated, using patterns in the sentence (improved by word sense discernment, see: WordNet) and context. This (now theoretical) algorithm and its statistics-based dataset (both to be freely downloadable) might be of use as an algorithm component to some approaches.

Pronouns and other indirect entity references will have a context-menu where the user can select which entity the reference refers to. Again, ideally the correct reference is the first in a list offered and the goal is to minimize the probability of the user having to select 'Other' and select from a comprehensive list of the entities in the document.

So where NLP aims to resolve these completely, the ergonomics of this project aim to make an improving list of options for a user entering the knowledge they are reading. The ergonomics says that the predicate they are going to use next should be on screen (predicted) and the resolving of pronouns should have the correct candidate (hopefully first) in a drop down list.

The ergonomics then is an “easier” problem resembling information retrieval and predictive search that hopefully can be of use to some algorithmic approaches to the “less easy” task of mechanically and accurately doing the entire process (reading) on sentences in documents.

Monday, April 16, 2007

SRL Techniques (1 of 2)

I'm reading about approaches to SRL that include linearly interpolated relative frequency models [1], HMM's [2], SVM's [3], decision trees [4], and log-linear models [5]. I'm hoping to learn which are most compatible with resolving a sentence's syntax to an exhaustive set of predicates (hopefully capturing all of a sentence's semantics) with simple nouns or noun phrases as arguments.

As a rule, the more abstract roles have been proposed by linguists, who are more concerned with explaining generalizations across verbs in the syntactic realization of their arguments, while the more specific roles are more often proposed by computer scientists, who are more concerned with the details of the realization of the arguments of specific verbs. [1]

[1] Daniel Gildea, Daniel Jurafsky. Automatic Labeling of Semantic Roles. In Computational Linguistics (2002), 28(3):245–288

[2] Freitag, D., McCallum, A.: Information extraction with HMM structures learned by stochastic optimization. In: Proceedings of AAAI. (2000) 584–589

[3] Sameer Pradhan, Wayne Ward, Kadri Hacioglu, James Martin, and Dan Jurafsky. 2004. Shallow semantic parsing using support vector machines. In Proceedings of HLT/NAACL-2004.

[4] Mihai Surdeanu, Sanda Harabagiu, John Williams, and Paul Aarseth. 2003. Using predicate-argument structures for information extraction. In Proceedings of ACL-2003.

[5] Nianwen Xue and Martha Palmer. 2004. Calibrating features for semantic role labeling. In Proceedings of EMNLP-2004.

Sunday, April 15, 2007

Website Design

For this website for collaborative generation of a linguistic corpus and ontology, I'm thinking about a drag and drop based UI, where users can work with sentences and, using gestures, enter the knowledge obtained from reading them.

I downloaded phalanger for PHP integration into Visual Studio and .NET (they also have one for Mono). I'm looking at prototype and mootools for the javascript library. Also found this extension to mediawiki that adds web services to wiki (SOAP, WSDL) — I'll probably code some web services into it.

I'm thinking that the site will show articles with entities (nouns) as draggable objects that can be manipulated in the environment to allow users to easily provide the knowledge they obtain from reading (sentence by sentence). This will include anaphoric resolution and semantic relations between entities. By processing entire articles, context will be obtainable from the dataset. Hopefully the interface will be intuitive enough to encourage users to provide an ever-improving downloadable resource comparable to penn treebank and redwoods.

Saturday, April 14, 2007

Wikipedia, Semantic Web

After I make the web-deliverable interface that allows users to expand these semantic frames into all possible related entities, I refactor some SRL parsers to turn natural language into this semantic data and I point the system at Wikipedia... weeks later, what does this enormous OWL file (possibly stored numerically) have to do with the Semantic Web?

A quote from Cycorp:

The success of the Semantic Web hinges on solving two key problems: (1) enabling novice users to create semantic markup easily, and (2) developing tools that can harvest the semantically rich but ontologically inconsistent web that will result. To solve the first problem, it is important that any novice be able to author a web page effortlessly, with full semantic markup, using any ontology he understands. The Semantic Web must allow novices to construct their own individual or specialized-local ontologies, without imposing the need for them to learn about or integrate with an overarching, globally consistent, master ontology.

Allowing users to type in natural language is the easiest way to generate semantic markup. Because users prefer to use natural language, any ontology that software can roundtrip natural language with will likely be an overarching (prevalent) one. A possible problem with the approach I'm using is that the ontology of the dataset used to train SRL-based parsers would, instead of being handcrafted by an expert, be a collaborative effort of people, hopefully experts, visiting a site — wikiontology is a relatively new idea.

Only after I have the dataset will I be able to say if the ontology from the planned website is advantageous to machine reasoning tasks. It shouldn't be terribly difficult to make a benchmark for the consistency of the wiki-generated ontology — possibly using natural language (after the parser is completed). For example, paraphrase corpora and other instruments could be of use in both generating and refining.

Wikipedia is a proof of concept that people can come together to generate collective knowledge resources, so — if we get the post-NLP/pre-NLG ontology right (prevalent as argued above) — the Semantic Web may resemble a distributed wiki-knowledgebase. The gigabytes of Wikipedia data would be a launching point.

WikiOntology Idea

I'm envisioning a web-deliverable system for the PropBank and FrameNet data that allows users to modify or ammend relations between the arguments for purposes of getting as many of these binary subpredicates into place as possible. Considering there's thousands, some sort of wiki-tech might be useful, allowing users to download the up-to-date dataset. Basically, there would be a UI that indicated the arguments for each n-ary predicate or frame and would allow users to ammend relationship types between them. Users would be able to use URI from other domains or create new ones onto the project's domain.

Given the sentence "The boy went to the store on his bike.", we might obtain from some parsers a predicate resembling "went(the boy, to the store on his bike)". The idea is to allow user input to get a dataset matching that sentence to a set of predicates also including "utilized(the boy, his bike, [to go] to the store)", "owns(the boy, [a] bike)". Some sort of XML output would capture this and a UI allowing pairwise connections between noun arguments would allow users to add all the knowledge contained in a sentence. Ideally, these predicates would be further simplied until only simple nouns were interrelated. The idea is to use wiki-collaboration to create a dataset for SRL algorithms.

Code Storage, Collaboration

If anybody visiting knows a place where I can upload code to and link to here, please feel free to comment (on any blog entry for that matter). I might end up using an open source repository, the only problem is I'm not sure if some of the jar's or IKVM'd dll's (etc) EULA's allow that. As for the gb's of Wikipedia data after I run some tools on it, I might have to torrent that. Feel free to contact me or comment here if any of these projects sound interesting.

Also, if you know of any downloadable SRL projects or tools please let me know. [Update: found swirl and salsa]

Integer-Based Representation

I'm working with integer-based or numerical representations in triples and quadruples. Basically, this approach uses bitfields instead of URI. I designed it to be computationally faster than string-based formats and the bitfields allow differentiation between tense negation, logical negation, set complementing and other operations on entities and relation types. I'm working on some tools to convert XML-based data to and from this format.

Also, n-ary predicates can be converted to a set of triples (or quadruples) by combinatorically relating the n arguments (pairwise) using binary subpredicates. This could be of use in converting sentence predicates into a triples-based language.

A rule system will probably be required to capture overlap between different predicates' subpredicates — which is unfortunate as there are upwards of 4,527 frames in FrameNet and 3,635 in PropBank. This pairwise relating the syntactic elements in n-ary predicates or frames should result in more useful structured knowledge. This is a first attempt at a post-SRL/pre-NLG format.

Friday, April 13, 2007

Ontology and File Compression

Thinking on this post-SRL/pre-NLG ontology made me realize there was no easy way to compare different models.

If we look at rule systems, ontology and taxonomy as interoperating towards efficiently storing knowledge, then there might be a metric. That is, if system A compresses the same knowledge set better than system B and is more computationally efficient (in decompression/utilization), then we can say that system A is superior to system B (on that set) without resorting to aesthetics or philosophy. We have SUMO, the CYC upper ontology, ISO 15926, and others designed around real-world data and it's difficult to rank them.

The metaphor of file compression to knowledgebases might allow competition between differing methods. As systems are envisioned that mechanically generate rules, ontological structure or taxonomy (optimizing generators that create a system for a given knowledgebase), these metrics may be of use in comparing the resulting generated systems. Personally, I think it would be interesting to have algorithms that compress knowledgebases like tar, zip and 7zip do to files. Unfortunately, this approach is storage and speed-based and doesn't consider interface considerations-- for example, sets of things that are categorized for navigation.

Here's a link to a paper describing a relationship between AI (my field of research) and file compression. Apparently, there's a prize for compressing Wikipedia.

Semantic Role Labeling

Semantic Role Labeling (SRL) appears to be the algorithmically independent term for parsing sentences into structures like PropBank or FrameNet.

I'm investigating ontology to store the SRL'd structure that can additionally be used for NLG. I'm looking at loom, kpml, cypher and others to see if there's any overlap. It'd be “easier” to code if there's one format to and from natural language. Not sure if this SRL'd/pre-NLG would work well for machine reasoning (which is ideally what'd be stored in a db).

[1] CoNLL-2005 Shared Task: Semantic Role Labeling

[2] CCG: Semantic Role Labeling Demo

Head-Driven Phrase Structure Grammar

I'm thinking that the PropBank style of parsing (redwoods treebank) is more readily converted to structured knowledge than part of speech tagging. However, looking over the initial results of some software, it appears that some style of recursion would be of use in capturing all the structure (substructure) of a sentence. Some arguments to the main predicate appear to have discernable structure remaining — if an argument to the predicate could be a predicate, then this would capture as much structure as possible. Also noticing the inability of this style of parser to capture parallel predicates with some sentences that use logical connectives. Different levels in the recursion should be able to reuse arguments from across the sentence. I may have to code up a prototype to obtain as much semantic structure as possible, possibly outputting a set of these parses that capture it in parallel.

Also looking into initializing this style of parser with the POS-style to discern the main verb and then the remainder in order. After entity recognition and string concatenation of multiword nouns, parsers seem to function more accurately. I'm going to look at the parse trees for verb hierarchy and SBAR information to construct recursive predicate structures, utilize the NP information to bootstrap entity recognition, and post here which code appears “easiest” to build from.

[1] Miyao, Y. and Tsujii, J. 2004. Deep linguistic analysis for the accurate identification of predicate-argument relations. In Proceedings of the 20th international Conference on Computational Linguistics (Geneva, Switzerland, August 23 - 27, 2004). International Conference On Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, 1392.

[2] Towards Parsing Unrestricted Text into PropBank Predicate-Argument Structures

Thursday, April 12, 2007

Natural Language and Semantic Web

Been browsing on the Web for various tools in natural language processing (NLP) and natural language generation (NLG). Presently looking at cypher, heart of gold, gate, stanford parser, charniak parser, enju, opennlp, lkb, assert, halogen, kpml and a couple of others. Also looking at a couple of methods of storing structured knowledge from a parse including Berkeley's FrameNet ontology and some that came with halogen and kpml. Hoping to throw these tools at the Wikipedia dataset (9.74 gb) and post some results.