Tuesday, July 3, 2007

Dynamo

Been working on a new module, it generates algorithms for mapping sequences to sequences. I'm coding it with c# generics for reusability. Basically, the module will generate memory resident assemblies which map sequences from one domain to sequences from another. Kind of a 'blackbox generator' (it's a superset of FSM-generation). The use I'm planning is mapping parse trees to predicate building sequences and/or quadruple set sequence building sequences. The code generation was fairly easy but I'm working on optimization algos now. Ideally, the generated assemblies will be fast enough for production environments.

The wikicorpus project is also nearing completion; the adaptive AI algos have been bringing more coding joy so I'll probably post a draft of that first.

Tuesday, May 29, 2007

Calculus of Language

I'm tentatively calling the new syntax and related logic a “calculus of language” as it has the properties of a predicate calculus and appears to represent natural language well. Syntactically, it's a superset of predicate calculus. I am writing up a draft of it, going over the formal logic and grammars, and heuristics for the sequential parsing of natural language and hope to publish it soon. I'll comment on any developments as they occur; this technique appears to be working on arbitrary natural language sentences. However, the output is strings, meaning that getting to the exact semantics (URI/integers) appears to be a second step algorithm.

Sunday, May 27, 2007

New Predicate Syntax

While working on the parser, I developed a new syntax that appears to be more readily parsed into. While this new syntax appears more readily processed into (from natural language), the syntax described previously appears to be more readily generated from (to natural language). There's a simple heuristic for converting between the two; the new syntax and grammar are a superset of the old syntax and grammar.

In the new syntax there are notational varieties of predicates. These notational conventions appear to reduce the instruction set for building the predicates sequentially and to reduce the context information required during the processing of a sentence— making the heuristic simpler.

Tuesday, May 22, 2007

PCFG and Adaptive FSM

I'm having some success with a first approach to machine reading. The algorithm is to convert a PCFG parse into a sequence and map this to a sequence of predicate assembly instructions using an adaptive FSM. The adaptive FSM algorithm basically maps sequences from one domain to another and utilizes patterns in the mappings of subsequences to optimize memory usage. This optimization relates to generalization in which patterns can be induced from training examples. I'll post further results as they occur; basically, the algorithm is an online parser (word-by-word) to convert sentences to predicates though the training data is structural (sentence-by-sentence).

Sunday, May 13, 2007

Preprocessing and Predicate Patterns

I found an example sentence that shows a promising relationship between the preprocessing and patterns in the predicates. That is, a correct preprocessing can indicate patterns that an incorrect chunking does not.

For example,

{the scientific method}[Seeks+][To+][Explain]{the complexities}[Of]{nature}[In]{a replicable way}<comma><and>[To+][Use]{these explanations}[To+][Make]{useful predictions}

H(B+C+D(A,F(E,G)),I)
O+P(L+M(A,N),Q)

Notice, that B could move across the <comma><and>

H(B+C+D(A,F(E,G)),I)
O+P(B+L+M(A,N),Q)

However, the preprocessing:

{the scientific method}[Seeks][To+][Explain]{the complexities}[Of]{nature}[In]{a replicable way}<comma><and>[To+][Use]{these explanations}[To][Make]{useful predictions}

B(A,H(C+D(_,F(E,G)),I)&O(L+M(_,N),P(_,Q)))

or

B(A,H(C+D(_,F(E,G)),I))
B(A,O(L+M(_,N),P(_,Q)))

Is more illustrative of the patterns indicative of sentences with <comma><and>, namely P(x, y1&y2...). The ampersand is a notational convention that is useful during sequential processing.

The pattern evidenced in the second predicate output, related to the <comma><and>, indicates that the second chunking was more correct than the first. This indicates that the chunking and choice in predicates is not arbitrary and that a set may emerge from patterns dependent upon the <> symbols in certain sentences.

I will find more example sentences that indicate that patterns in the output can reinforce chunking hypotheses. Some areas, in English, that are of potential complexity include the word 'to'.

Bobby went to the store.
{bobby}[Went][+To]{the store}.
B+C(A,D)
Here the 'to' is in the sense of 'towards'.

Bobby likes to draw.
{bobby}[Likes][To+][Draw].
B(A,C(_,_))
Here the 'to' is in the sense of an infinitive.

Bobby bought crayons to draw.
{bobby}[Bought]{crayons}[To][Draw]{pictures}.
D(B(A,C),E(_,F))
Here the 'to' is in the sense of 'in order to'.

Predicates and chunking appear to be a discernable property of patterns from <> in certain sentences. For example, the argument above is that SeeksToExplain(x,y) is not a predicate and that Seeks(x,y), Explain(x,y) are.

Tuesday, May 8, 2007

Artificial Intelligence

After WikiCorpus and some NL algorithms for machine reading/writing, I'm thinking of moving towards language understanding (NLU) post-processing for sentences like:

“Once the characteristic numbers of most notions are determined, the human race will have a new kind of tool, a tool that will increase the power of the mind much more than optical lenses helped our eyes, a tool that will be as far superior to microscopes or telescopes as reason is to vision.”

— Leibniz


This sentence compares a described tool to microscopes and telescopes, reason to vision, and compares those two comparisons. So, after an initial semantic parsing, a rule system may permute the structured knowledge into a format more conducive to machine reasoning and NLU.

However, this is moving from NL towards artificial intelligence or, at least, machines that can demonstrate intelligent behavior.

user> Birds are to air as fish are to what?
machine> Water.
user> Why?
machine> Birds move through the air and fish move through water.

Or,

user> Telescopes are to vision as what is to reasoning?
machine> A new kind of tool that the human race will have after the characteristic numbers of most notions are determined, according to Leibniz.

This sort of NL interaction appears to require machine reading, kb search, paraphrase generation, kb search and machine writing. A goal of mine after processing text into a knowledge representation is an NL interface (requiring machine reading/writing) to demonstrate NLU. The algorithm for the above examples is tentatively all-paths discovery, path permuting and parallel search of the kb.

Pseudocode:

// n1 is to n2 as n3 is to what?
// n1 is to n2 as what is to n4?
Nodes[] IsToAsIsTo(n1,n2,n3,n4)
{
 Paths[] P = FindAllPaths(n1,n2);
 Paths[] PP = Rephrases(P);
 Nodes[] N = null;
 if( n4 == null ) N = Search(n3,PP);
 else if( n3 == null ) N = Search(PP,n4);
 return N;
}

Sunday, May 6, 2007

WikiCorpus

So far so good on those hypotheses. The WikiCorpus alpha source code will be available (in a few weeks) here. I'm building it presently and testing it on Wikipedia articles. It will provide a user interface for generating corpora for pronoun and reference discernment, and semantic parsing algorithm design.

Friday, May 4, 2007

Hypotheses

The hypotheses that I'm presently testing are

1) Representation Hypothesis

Letting P(X,Y) = ~P(Y,X)

A natural language sentence can be represented:

S → S1 | S1 S2 | S1 S2 S3 ...
Si → P(A,A) | ~P(A,A)
A → P(A,A) | ~P(A,A) | N | _
N → n1 | n2 | n3 ...
P → p1 | p2 | p3 ...

2) Noun Order Hypothesis

We can view S as a sequence of trees. The leaf nodes of those trees, that are nouns, if viewed as a sequence, can be such that a noun can only be in that sequence if the previous noun from the natural language sentence exists previously in the sequence.

3) Hypothesis of Heuristic

It is heuristically possible to convert to and from this representation and natural language.


Optionally, for sentence structures resembling “X makes Y Z” and “X finds Y Z”, and should it help the heuristic, we can ammend the representation hypothesis:

Si → P(A,A) | ~P(A,A) | P(A,P'(A,A)) | ~P(P'(A,A),A)
A → P(A,A) | ~P(A,A) | P(A,P'(A,A)) | ~P(P'(A,A),A) | N | _

Where P' can be determined from P.

Machine Reading, Paraphrases

1) {Bobby} [likes][drawing] {pictures} [with] {crayons}.
Likes(bobby,With(Draw(_,pictures),crayons))

2) {Bobby} [likes][using] {crayons} [to draw] {pictures}.
Likes(bobby,Using(crayons,Draw(_,pictures)))
Likes(bobby,~With(crayons,Draw(_,pictures)))

So, the nesting of predicates can be discerned by noun order paraphrases. This is theoretical at this point, but so far so good. I'm using the premise of semantics-preserving noun order paraphrases as an axiom in theorizing machine reading algorithms.

Representation Notes

This sentence suggests that gerunds should be in noun brackets.
{Bobby} [likes] {drawing}. B(A,C)

However,
{Bobby} [likes] %drawing% B(A,C) or B(A,C(_,_))

{Bobby} [likes drawing] {pictures}. B(A,C)
{Bobby} [likes][drawing] {pictures}. B(A,C(_,D))
{Bobby} [likes][to draw] {pictures}. B(A,C(_,D))

So the gerund can be P(_,_) and the infinitive can be the same or, if a direct object is specified, be P(_,Y).

{Bobby} [drew]. B(A,_)

An argument for structural reuse is in the sentence:
As Bobby likes to draw pictures with crayons, he went to the store and he purchased a box of them.
<As> [[1]], {he}[went to]{the store} and {he}[purchased] {a box}[of]{them}.
As([[1]],[[2]])

The <> brackets represent elements or operators that relate predicates or sequences of predicates, logic operations and other punctuation relevant to processing.

Wikipedia Example 3

The term Artificial Intelligence (AI) was first used by John McCarthy who used it to mean "the science and engineering of making intelligent machines". It can also refer to intelligence as exhibited by an artificial (man-made, non-natural, manufactured) entity. The terms strong and weak AI can be used to narrow the definition for classifying such systems. AI is studied in overlapping fields of computer science, psychology, philosophy, neuroscience and engineering, dealing with intelligent behavior, learning and adaptation and usually developed using customized machines or computers.

Research in AI is concerned with producing machines to automate tasks requiring intelligent behavior. Examples include control, planning and scheduling, the ability to answer diagnostic and consumer questions, handwriting, natural language, speech and facial recognition. As such, the study of AI has also become an engineering discipline, focused on providing solutions to real life problems, knowledge mining, software applications, strategy games like computer chess and other video games. One of the biggest difficulties with AI is that of comprehension. Many devices have been created that can do amazing things, but critics of AI claim that no actual comprehension by the AI machine has taken place. [1]


{The term Artificial Intelligence|A} [was first used by|B] {John McCarthy|C} <who|D> [used|E] {it|F} [to mean|G] <"|H> {the science and engineering|I} [of|J] {making intelligent machines|K} <"|L>. {It|M} [can also refer to|N] {intelligence|O} as [exhibited by|P] {an artificial (man-made, non-natural, manufactured) entity|Q}. {The terms strong and weak AI|R} [can be used to narrow|S] {the definition|T} [for|U] [classifying|V] {such systems|W}. {AI|X} is [studied|Y] [in|Z] {overlapping fields|AA} [of|AB] {computer science|AC}, {psychology|AD}, {philosophy|AE}, {neuroscience and engineering|AF}, [dealing with|AG] {intelligent behavior|AH}, {learning and adaptation|AI} and [usually developed using|AJ] {customized machines|AK} [or|AL] {computers|AM}.

{Research in AI|AN} [is concerned with|AO] [producing|AP] {machines|AQ} [to automate|AR] {tasks|AS} [requiring|AT] {intelligent behavior|AU}. {Examples|AV} [include|AW] {control|AX}, [planning and scheduling|AY], [the ability to answer|AZ] {diagnostic and consumer questions|BA}, {handwriting|BB}, {natural language|BC}, {speech and facial recognition|BD}. <As such|BE>, {the study|BF} [of|BG] {AI|BH} [has also become|BI] {an engineering discipline|BJ}, [focused on providing|BK] {solutions|BL} [to|BM] {real life problems|BN}, {knowledge mining|BO}, {software applications|BP}, {strategy games|BQ} [like|BR] {computer chess and other video games|BS}. {One|BT} [of|BU] {the biggest difficulties|BV} [with|BW] {AI|BX} [is|BY] <that|BZ> <of|CA> {comprehension|CB}. {Many devices|CC} [have been created|CD] that [can do|CE] {amazing things|CF}, <but|CG> {critics|CH} [of|CI] {AI|CJ} [claim|CK] that <no|CL> {actual comprehension|CM} [by|CN] {the AI machine|CO} [has|CP] {taken place|CQ}.

B(A,C)
G(E(C,F),J(I,K))
N(M,P(O,Q))
S(R,U(T,V(_,W)))
Y+Z(X,AB(AA,AC&AD&AE&AF&AG&AH&AI))
AJ(X,AL(AK,AM))
AO(AN,AR(AP(_,AQ),AT(AS,AU)))
BE(AW(AV,AX&AY(_,_)&AZ(_,BA)&BB&BC&BD),BI(BG(BF,BH),BJ))
BE(AW(AV,AX&AY(_,_)&AZ(_,BA)&BB&BC&BD),BK(BJ,BM(BL,BN&BO&BP&BR(BQ,BS))))
BY(BW(BU(BT,BV),BX),CB)
CG(CE(CD(CC,_),CF),CK(CI(CH,CJ),!CP(CN(CM,CO),CQ)))

I'm looking into the linguistic reification of occurrence in the representation format (... CP CQ). A logic-based rule system may aid in this area.

Also interesting in this text is the passive voice, “Many devices have been created that can do amazing things.” The underscore indicates that the creator is unknown to the machine reading algorithm: CA(BZ,_) as CA corrolates to ~Created. I'm using the notation P(_,_) to represent gerunds.

One strategy in developing machine reading algorithms is to look at children's books and texts. These are already categorized by reading level, so any algorithm that appears to somehow build on itself as the reading level is increased would, philosophically, have additional merit, in my opinion.

[1] Artificial Intelligence, Wikipedia

Wikipedia Example 2

Knowledge representation is an issue that arises in both cognitive science and artificial intelligence. In cognitive science it is concerned with how people store and process information. In artificial intelligence (AI) the primary aim is to store knowledge so that programs can process it and achieve the verisimilitude of human intelligence. AI researchers have borrowed representation theories from cognitive science. Thus there are representation techniques such as frames, rules and semantic networks which have originated from theories of human information processing. Since knowledge is used to achieve intelligent behavior, the fundamental goal of knowledge representation is to represent knowledge in a manner as to facilitate inferencing i.e. drawing conclusions from knowledge. [1]


{Knowledge representation|A} [is|B] {an issue|C} that [arises in|D] both {cognitive science and artificial intelligence|E}. [In|F] {cognitive science|G}, {it|H} [is concerned with|I] [how|J] {people|K} [store and process|L] {information|M}. [In|N] {artificial intelligence|O} {the primary aim|P} [is|Q] [to store|R] {knowledge|S} [so|T] that {programs|U} [can process|V] {it|W} and [achieve|X] {the verisimilitude|Y} [of|Z] {human intelligence|AA}. {AI researchers|AB} [have borrowed|AC] {representation theories|AD} [from|AE] {cognitive science|AF}. <Thus|AG> <there are|*> {representation techniques|AH} [such as|AI] {frames|AJ}, {rules|AK} and {semantic networks|AL} which [have originated from|AM] {theories|AN} [of|AO] {human information processing|AP}. <Since|AQ> {knowledge|AR} [is used to achieve|AS] {intelligent behavior|AT}, {the fundamental goal|AU} [of|AV] {knowledge representation|AW} [is|AX] [to represent|AY] {knowledge|AZ} [in|BA] {a manner|BB} as [to facilitate|BC] {inferencing|BD} <i.e.|BE> {drawing conclusions|BF} [from|BG] {knowledge|BH}.

B(A,C)
D(C,E1)
D(C,E2)
I(F(H,G),J(K,L1(K,M)))
I(F(H,G),J(K,L2(K,M)))
Q(N(P,O),T(R(_,S),V(U,W)&X(U,Z(Y,AA))))
AG(AC(AB,AE(AD,AF)),AM(AI(AH,AJ),AO(AN,AP)))
AG(AC(AB,AE(AD,AF)),AM(AI(AH,AK),AO(AN,AP)))
AG(AC(AB,AE(AD,AF)),AM(AI(AH,AL),AO(AN,AP)))
AQ(AS(AR,AT),AX(AV(AU,AW),BC(BA(AY(_,AZ),BB),BD))
BE(BD,BG(BF,BH))

[1] Knowledge Representation, Wikipedia

Natural Language Understanding

Nations can have literatures, as can corporations, philosophical schools or historical periods. Popular belief commonly holds that the literature of a nation, for example, comprises the collection of texts which make it [into/become] a whole nation. The Hebrew Bible, Persian Shahnama, the Indian Mahabharata, Ramayana and Thirukural, the Iliad and the Odyssey, Beowulf, and the Constitution of the United States, all fall within this definition of a kind of literature.


The difference between machine reading and natural language understanding, in my opinion, is that machine reading is towards natural language processing and natural language understanding is towards machine reasoning and artificial intelligence. Looking at the underlined portions of the quote, we can equate or otherwise relate them, but I would say this is a task of natural language understanding as defined and not machine reading.

This is a sort of pronoun or reference resolution, binding strings to the same underlying unique representations may be slightly more advanced than reference resolution as in:

Nations can have literatures, as can corporations, philosophical schools or historical periods. Popular belief commonly holds that the literature of a nation, for example, comprises the collection of texts which make it [into/become] a whole nation. The Hebrew Bible, Persian Shahnama, the Indian Mahabharata, Ramayana and Thirukural, the Iliad and the Odyssey, Beowulf, and the Constitution of the United States, all fall within this definition of a kind of literature.


Revisiting the “X makes Y Z” sentence, the [into/becomes] predicate is unique to the makes predicate as to allow paraphrases and is not necessarily the same predicate as other text occurences of the words “into” or “become”. While predicates and nouns are resolved from text to integers after machine reading structures the sentence, predicates generated during machine reading are algorithmically determined.

Algorithmically, “X P Y Z.” → “X P Y P' Z.” where the empty string is a candidate for P' in P'(X,Y) in NLG, allows permutation based paraphrasing and the generation of sentences that occur naturally as in this document from Wikipedia. The context entered with P should notice when Z, a noun chunk, is to be placed in a predicate position that P' is needed and Z is the right argument.

X P Y Z.

Start a predicate with X as the left argument.
Label that predicate P.
Place Y into the right argument of that predicate.
Z...
Wrap that argument (Y) into a predicate with the previous content in the left argument.
Label that predicate P'.
Place Z into the right argument of that predicate.

So, the “X makes Y Z” required some thinking. If you find any other sentences that are good examples to strengthen algorithm design, comment them here or email me and I'll post on them.

Thursday, May 3, 2007

Wikipedia Example

Nations can have literatures, as can corporations, philosophical schools or historical periods. Popular belief commonly holds that the literature of a nation, for example, comprises the collection of texts which make it a whole nation. The Hebrew Bible, Persian Shahnama, the Indian Mahabharata, Ramayana and Thirukural, the Iliad and the Odyssey, Beowulf, and the Constitution of the United States, all fall within this definition of a kind of literature. [1]


As written, this sentence does not immediately parse as per the working hypothesis. The problem occurs around the "collection of texts which make it a whole nation."

[literature of a nation][comprises][collection of texts] which [make][it][a whole nation].

Labeling the chunks A - F, we can see
B(A,C)
D(A,...)

However, by inserting the word “into”, or “become” into the text:

[literature of a nation][comprises][collection of texts] which [make][it][into][a whole nation].

and labeling the chunks A-G, we can see
B(A,C)
D(A,F(E,G))

So there may be occasions where the machine reading algorithm has to add text to a document to get the algorithm the parse correctly, or use pattern-based rules, for example involving the predicate make(X,Y). These can be reflected in NLG as well to allow sentence candidates to be formed where, as in this example, the “into” or “become” is implicit.

So with that in mind,

Nations can have literatures, as can corporations, philosophical schools or historical periods. Popular belief commonly holds that the literature of a nation, for example, comprises the collection of texts which make it [into] a whole nation. The Hebrew Bible, Persian Shahnama, the Indian Mahabharata, Ramayana and Thirukural, the Iliad and the Odyssey, Beowulf, and the Constitution of the United States, all fall within this definition of a kind of literature. [1]


[Nations][can have][literatures], as [can][corporations], [philosophical schools] or [historical periods]. [Popular belief][commonly holds] that [the literature][of][a nation] ... [comprises] the [collection of texts] which [make][it][into][a whole nation]. [The Hebrew Bible], [Persian Shahnama], [the Indian Mahabharata], [Ramayana and Thirukural], [the Iliad and the Odyssey], [Beowulf], and [the Constitution of the United States], all [fall within] this [definition][of][a kind of literature].

Machine Reading
B(A,C)
B(E&F&G,C)
I(H,N(K(J,L),O))
P(O,R(Q,S))
AA(T&U&V&W&X&Y&Z,AC(AB,AD))

The ampersands indicate the processing of commas.

Start a new predicate with A as the left argument.
Name that predicate B.
Place C in the right argument of that predicate.
Start a new predicate with the label B.
Place C in the right argument of that predicate.
Place E in the left argument of that predicate.
In the context of a comma list,
Place F in the left argument of that predicate.
In the context of a comma list,
Place G in the left argument of that predicate.
Start a new predicate with H as the left argument.
Name that predicate I.
In the right argument of that predicate,
Start a new predicate with J as the left argument.
Name that predicate K.
Place L in the right argument of that predicate.
...
Wrap that predicate in a new predicate with the current content in the left argument.
Name that predicate N.
Place O in the right argument of that predicate.
Start a new predicate with O as the left argument.
Name that predicate P.
In the right argument of that predicate,
Start a new predicate with Q as the left argument.
Name that predicate R.
Start a new predicate with S as the right argument.
Start a new predicate with T as the left argument.
In the context of a comma list,
Place U in the left argument of that predicate.
In the context of a comma list,
Place V in the left argument of that predicate.
In the context of a comma list,
Place W in the left argument of that predicate.
In the context of a comma list,
Place X in the left argument of that predicate.
In the context of a comma list,
Place Y in the left argument of that predicate.
In the context of a comma list,
Place Z in the left argument of that predicate.
Name that predicate AA.
In the right argument of that predicate,
Start a new predicate with AB as the left argument.
Name that predicate AC.
Place AD in the right argument of that predicate.

We can use “symbolic logic” operations to turn:
B(E&F&G,C)
into
B(E,C)
B(F,C)
B(G,C)

resulting in

B(A,C)
B(E,C)
B(F,C)
B(G,C)
I(H,N(K(J,L),O))
P(O,R(Q,S))
AA(T,AC(AB,AD))
AA(U,AC(AB,AD))
AA(V,AC(AB,AD))
AA(W,AC(AB,AD))
AA(X,AC(AB,AD))
AA(Y,AC(AB,AD))
AA(Z,AC(AB,AD))

We can also observe that V and W can become V1&V2, and W1&W2.

B(A,C)
B(E,C)
B(F,C)
B(G,C)
I(H,N(K(J,L),O))
P(O,R(Q,S))
AA(T,AC(AB,AD))
AA(U,AC(AB,AD))
AA(V1,AC(AB,AD))
AA(V2,AC(AB,AD))
AA(W1,AC(AB,AD))
AA(W2,AC(AB,AD))
AA(X,AC(AB,AD))
AA(Y,AC(AB,AD))
AA(Z,AC(AB,AD))

which in quad form is

A B C 1
E B C 2
F B C 3
G B C 4
J K L 5
5 N O 6
H I 6 7
Q R S 8
O P 8 9
AB AC AD 10
T AA 10 11
U AA 10 12
V1 AA 10 13
V2 AA 10 14
W1 AA 10 15
W2 AA 10 16
X AA 10 17
Y AA 10 18
Z AA 10 19

[1] Literature, Wikipedia

Wednesday, May 2, 2007

Machine Translation

The argument is that the sequential set that builds a semantic parse tree using the nouns in the order that they occur is an intermediate representation format in all languages between a set of phrasings and a semantic parse tree, and that changes in the semantic parse tree can be reflected in the sequential set, resulting in a different set of phrasings (paraphrase generation). The illustration of patterns in the examples in English is an existence proof of a heuristic, not an argument that such patterns are identical across all languages. However, the use of integers when representing semantic parse trees or sets of predicates is argued to be language independent.

Interestingly, natural language is a temporal process. The intermediate representation is both temporal and static and the sequence of predicates is a static knowledge representation that allows semantics-preserving permutations.

So, the argument is:

Machine Reading
Sentence(s) phrasing(s)
[patterns suggest heuristic to:]
Intermediate representation
[heuristic apparent to:]
Sequence of predicates

Machine Writing
Sequence of predicates
[heuristic apparent to:]
Intermediate representation
[patterns suggest heuristic to:]
Sentence(s) phrasing(s)

Paraphrase Generation
Sentence(s) phrasing(s)
Intermediate representation
Sequence of predicates
Permutations
Sequence of predicates
Intermediate representation
Sentence(s) phrasing(s)

The paraphrase generation aspect is theoretically important to move the semantics between languages during machine translation. One natural sounding sentence in one language may, for example, require multiple sentences to express the same semantics naturally in another; the sentence structures of a paragraph of content may differ in natural language usages, and so forth. The sequence of predicates is argued to represent the knowledge shared across all articulations.

Tuesday, May 1, 2007

Foundations of Machine Reading/Writing

[Leibniz][took up][the question][in][his baccalaureate thesis], and [argued][in][the true scholastic style][for][a principle of individuation] which [would preserve][the independence of universals] [with respect to][ephemeral sensations], and [yet][embodied][universal ideas][in][the eternal natures of individuals].

Let us label each bracketed chunk with a letter, A through S.

with the [yet] (O), we would have

D(B(A,C),E)
{

G(F+I(A,J),H)
K(J,M(L,N))

}
O
{

P(A,R(Q,S))

}

Theoretically resembling:

D(B(A,C),E)
G(F+I(A,J),H)
K(J,M(L,N))
P(A,R(Q,S))
O(G(F+I(A,J),H),P(A,R(Q,S)))
O(K(J,M(L,N),P(A,R(Q,S)))

but let us look at the following four and call the processing of O a metaoperation.

D(B(A,C),E)
G(F+I(A,J),H)
K(J,M(L,N))
P(A,R(Q,S))

So, in this paradigm of machine reading, the goal is to turn a sequence of chunks into a structure with necessary internal states to manage the noun chunks during the process. Transitions may occur via chunked and non-chunked tokens.

How would we describe the process of building these four predicates?

Start a new predicate with A as the left argument.
Name that predicate B.
Place C in the right argument of that predicate.
Move that predicate into the left position of a new predicate.
Name that predicate D.
Place E in the right argument of that predicate.
Start a new predicate with A as the left argument.
Name that predicate F.
Move that predicate into the left position of a new predicate.
Name that predicate G.
Place H in the right argument of that predicate.
Now back to the last predicate that we just moved into the left position of this predicate.
Add I to that predicate's label.
Place J in the right argument of that predicate.
Start a new predicate with J as the left argument.
Name that predicate K.
The next statement is set in the right argument of that predicate.
Start a new predicate with L as the left argument.
Name that predicate M.
Place N in the right argument of that predicate.
Start a new predicate with A as the left argument.
Name that predicate P.
The next statement is set in the right argument of that predicate.
Start a new predicate with Q as the left argument.
Name that predicate R.
Place S in the right argument of that predicate.

Patterns:

(×4)
Start a new predicate with <X> as the left argument.
Name that predicate <X+1>.

(×3)
Start a new predicate with <X> as the left argument.
Name that predicate <X+1>.
Place <X+2> in the right argument of that predicate.

(×2)
Name that predicate <X>.
The next statement is set in the right argument of that predicate.
Start a new predicate with <X+1> as the left argument.
Name that predicate <X+2>.
Place <X+3> in the right argument of that predicate.

“A” is the subject of the sentence, "Start a new predicate with A as the left argument" (×3)

In two of the occasions that "Start a new predicate with A as the left argument" occurs (Start a new predicate with <SUBJ> as the left argument), it is preceded by "Place <X> in the right argument of that predicate." As the sentence ends with that instruction, it is possible that in a sequence of sentences, all three occurances would be.

I theorize that in a processed document, a continuous sequence of these instructions (across sentence boundaries) would have robust and complex patterns indicative of natural writing style. Furthermore, I theorize that this methodology will be able to explain why people read active tense faster than passive and how these are processed differently in this paradigm.

Monday, April 30, 2007

Treebuilding Example

Confident in the theory, I went to my bookshelf, grabbed a book and found a random sentence. Using the following (rough draft) operations:

NEW: begin a new predicate
PUSHCUR: push a noun onto the stack, other operations may use the top
POPCUR: remove top element from stack
WRAPL: move from current position into the left position of a new outer predicate; WRAPL(P) = C → P(C,_). After a NEW, the top of the stack is the current position.
SETR: set the right argument of a predicate
IPWRAPL: perform a WRAPL around the last placed argument
TRNS: a transition such as yet, however, etc.
MODP: modify the predicate in the current scope

Leibniz took up the question in his baccalaureate thesis, and argued in the true scholastic style for a principle of individuation which would preserve the independence of universals with respect to ephemeral sensations, and yet embodied universal ideas in the eternal natures of individuals.


NEW
PUSHCUR leibniz
WRAPL TookUp
SETR the question
WRAPL In
SETR his baccalaureate thesis
NEW
WRAPL Argued
WRAPL Manner
SETR true scholastic style
MODP for (Argued → ArguedFor)
SETR principle of individuation
PUSHCUR principle of individuation
NEW
WRAPL Preserves
SETR independence of universals
IPWRAPL RespectTo
SETR ephemeral sensations
NEW
POPCUR
TRNS yet
WRAPL Embodied
SETR universal ideas
IPWRAPL Regarding
SETR eternal natures of individuals

This is a rough draft of an instruction set that would be output from a system and that can generate a sequence of recursive predicates. Some more instructions detailing scoping would be useful, for example to capture yet(A,B) where A and B are sequences of predicates. Ideally, these connectives pairwise connect the members of each sequence from each scope. Some other modifications might be necessary after looking over more sentences. The noun phrases are capable of being further structured, for example the last portion could be

SETR universal ideas
IPWRAPL Regarding
SETR eternal natures
IPWRAPL Of
SETR individuals

The system should be able to tell from the knowledgebase whether the reifiable substructure, e.g “Of(eternal natures,individuals)”, is itself a composite noun or relatable entity.

The use of a stack is a preliminary approach, other example sentences indicate that the data structure(s) for noun handling are more complex, a set of operations might be required that use the last used noun instead of pushing nouns onto a stack. Also possible is that the scoping is related— data structures are used in scopes. Noun handling in the sequential assembly of recursive parse trees is an interesting area.

I'm thinking on connectives such as “which would” and “yet”. Both easily representable— might be advantageous to do so during NLP because the correct usage of connectives like “yet”, “but” and “however”, that illustrate a semantic constrast of some sort between sequences of predicates, is one distinction between NLG and AI, or first and second generation NLU.

Sunday, April 29, 2007

Machine Translation, Machine Reasoning

Using a numerical interlingua, paraphrase generating systems that use the same ontology might have interesting applications for machine translation.

Also it shouldn't be terribly difficult to export from the recursive predicates into CYCL, KIF or RDF for machine reasoning applications. This format can represent its own rule system as well which may allow for implicit knowledge to be obtained via machine reading.

Algorithm Design

After WikiCorpus is up and running, I've a few hypotheses to test on the dataset. The first is the use of finite state automata in the sequential processing hypothesis. The FSM might require one or more stacks or cursors. This would take a preprocessed sentence (via NL tools) as input and output a sequence of treebuilding instructions. Some substrings would have to be mapped to predicate candidates.

This first theory is to transform a sentence into an alphabet of treebuilding instructions that may have its own grammar (universal?). I find the permutations on the predicates at the logic-based or knowledge-based level that allows noun-order paraphrases to be transitioned between, the relationship between this and the treebuilding alphabet and the relation to the sentence paraphrases to be interesting.

Rephrased, each sentence is a sequence of words that can be transformed (FSM, HMM, ?) into one or more sequences of treebuilding instructions and the resulting sequence of predicates (or set of sequence candidates) can be permuted for noun order paraphases using the fact that predicates can map to others with the arguments inverted. It's possible that permutations on the tree can be mapped to transformations on the treebuilding instruction set and this can map back to sentence(s). This level of natural language understanding could also be called a paraphrase generator (a less than exciting name for some rather complicated AI).

Thus, it's theoretically possible that the same system that turns sentences into sequences of predicates can be of use in natural language generation.

Thursday, April 26, 2007

Machine Reading, Paraphrases

A rule system may allow semantic subtrees to be mapped to one another. For example, the paraphrase:

Tommy put the book on the shelf to be helpful.
InOrderTo(Did(tommy,PutOn(book,shelf)),Is(tommy,helpful))

The following rule Is(X,helpful) → Help(X,_) might resemble that if something is helpful then that something help(s/ed) some thing or things. Formulating rules in this manner might transcend, in part, hermeneutic circles that occur when relating lexical elements to one another.

Hermeneutic circle refers to the semantic interconnectedness of a set of entities— for example, the definitions in a dictionary refer to other words in that dictionary. Philosophers from Schleiermacher and Dilthey through Heidegger and Gadamer considered this phenomenon. Wittgenstein said regarding this that light dawns gradually over the whole.

A rule system lexicon might further assist the semantic equivalence of paraphrases allowing semantic substructures to map to one another based on the lexical hermeneutic circle.

Like most things in AI, machine reading is more easily described than programmed. Any system that can equate noun-order paraphrases to the same set of predicates (or permutable equivalents) would be a milestone in my opinion.

Algorithm Design

I'm collecting important sentences from linguistics to parse examples to look at for algorithm design in machine reading.

“The girl whose car is blocking my view of the tree that I planted last year is my friend.”

This sentence is from a psycholinguistics article [1] and illustrates recursion or “the use of relative pronouns to refer back to earlier parts of a sentence.”

1) IsFriendOf(girl,I)
2) Possesses(girl,car)
3) Obscuring(car,ViewOf(I,tree))
4) On(Planted(I,tree),last year)

Each of these is a simpler sentence. The predicates' arguments are bound identically which shows the use of URI or integers as opposed to strings as the goal is to be able to place these four into a knowledgebase where they can be used with other knowledge and to retrieve them and reassemble the sentence (sentence aggregation [2]) or sentences as needed.

Simulating the process of accumulating these predicates when processing the sentence in left to right order, or sequentially:

P1(girl, A2)

P1(girl, A2)
Possesses(girl,A3)

P1(girl, A2)
Possesses(girl,car)

P1(girl, A2)
Possesses(girl,car)
Obscuring(car,A4)

P1(girl, A2)
Possesses(girl,car)
Obscuring(car,ViewOf(I,A5))

P1(girl, A2)
Possesses(girl,car)
Obscuring(car,ViewOf(I,tree))

P1(girl, A2)
Possesses(girl,car)
Obscuring(car,ViewOf(I,tree))
P2(I,tree)

P1(girl, A2)
Possesses(girl,car)
Obscuring(car,ViewOf(I,tree))
Planted(I,tree)

P1(girl, A2)
Possesses(girl,car)
Obscuring(car,ViewOf(I,tree))
On(Planted(I,tree),last year)

IsFriendOf(girl, I)
Possesses(girl,car)
Obscuring(car,ViewOf(I,tree))
On(Planted(I,tree),last year)

Another pair of important sentences are:
1) “Fred saw the plane flying over Zurich.”
2) “Fred saw the mountains flying over Zurich.”

1a) Saw(fred,FlyingOver(plane,zurich))
2) While(Saw(fred,mountains),FlyingOver(fred,zurich))







P1(fred,A2)P1(fred,A2)
Saw(fred,A2)Saw(fred,A2)
Saw(fred,P2(plane,A3))Saw(fred,mountains)
Saw(fred,FlyingOver(plane,A3))While(Saw(fred,mountains),FlyingOver(fred,A3))
Saw(fred,FlyingOver(plane,zurich))While(Saw(fred,mountains),FlyingOver(fred,zurich))


Looking at the bold line, and assuming a sequential processing, it appears that both hypotheses should be kept by an algorithm at that step. These sentences are an argument for knowledge-based processing and lexical data. It does appear that properties of “mountains” and “plane” can distinguish between hypothesized semantic parses. However, a word like “birds” could be in either parse structure or both simultaneously, depending on the context, for example seeing birds from a plane. Theoretically, knowledge-based, statistical and context-based methodologies can help discern between parse candidates.

Another possible representation of those sentences:
1b) While(Saw(fred,plane), FlyingOver(plane,zurich))
2) While(Saw(fred,mountains),FlyingOver(fred,zurich))

This representation makes clear that the difference is in binding the first argument of FlyingOver. The side by side processing of these two would otherwise be equivalent. I'll have to look at more sentences to determine whether 1a or 1b is more useful or if they are equivalent via a rule system. The representation is important in discerning the algorithm. I'm hopeful a corpus will aid in this area.

The sequential processing hypothesis is based on the proof of concept manner in which people read sequentially, however machines need not process text in the same manner. Additionally, even in the sequential processing hypothesis there are possiblities, for example, the text processor could be one or more words ahead of the predicate generator.

Other hypotheses include structural processing where the semantic tree is generated in a top-down or bottom-up manner based on data, patterns and substructural patterns collected and discerned from a corpus. This information can help determine a parse structure based on the fact that one usage of language resulting in one parse structure is extremely rare and the other commonplace.

[1] Psycholinguistics, Wikipedia
[2] Natural Language Generation, Wikipedia

Wednesday, April 25, 2007

Semantic Parsing

A sentence that parses to a tree structure with all the arguments of each predicate filled can be said to be semantically complete. However, with a tree structure of nested predicates, it's possible that some sentences, even grammatically correct ones, will parse to structures having blank arguments. This can be explained by the fact that contextual information may have been already delivered earlier in a document and that the author can then exercise brevity. This brevity may be more commonplace in informal spoken language.

Example:
Tommy put the book on the shelf in order to help.
InOrderTo(Did(tommy,PutOn(book,shelf)),Help(tommy,_))

The context of a previous sentence in a document may make clear that the setting is a library and thus the person writing or speaking the above sentence may choose to omit that information, while maintaining grammatical correctness, with a listener being able to discern the semantically complete version.

It's possible when parsing sentences into a semantic tree representation, as above, that some argument slots may be blank. These slots can be filled by either nouns or semantic subtrees and empty slots can help a natural language understanding system to know what is unknown when processing a document.

Some NLU systems may be able to discern ranked candidates for these from previous content or context. Algorithmically, this can be achieved by maintaining a context state during document processing or utilizing an event driven knowledge acquisition engine.

If there's anything to this sentence-level semantic abbreviation in natural speech and writing, then NLG systems might be able to utilize it to produce less mechanical sounding text; a sequence of semantically complete sentences might sound formal or verbose.

Saturday, April 21, 2007

Speech Technology, Semantic Web

I'm optimistic about the combination of the following technology I've been reading about:

1) Speech to text
2) Machine reading (NLP)
3) Knowledgebase / web
4) Machine writing (NLG)
5) Text to speech

Maybe someday people will be able to talk to their computers to add to and access collective encyclopedic knowledge.

Reification, Ontological Consensus and Ergonomics

Ontological consensus is the goal of the WikiCorpus project and is, in theory, possible via the on screen predictive list of predicates. Users would rather select predicates (that work for the semantics of the sentence) from the list than create new ones, just as users prefer to find the results of a search on the first page of a search engine's results. This list, hopefully produced by an accurate predictive algorithm, will facilitate consistency in the dataset.

In the user interface, the ergonomics of nested predicates would be that a completed predicate can be dragged (moved or copied) into the argument slot of a predicate being formed. The problem is, firstly, that moving and copying the predicate are both done with the drag and drop motion. Secondly, the nesting of predicates might deter from semantic and ontological consensus as complicated constructions are possible from the elements in the predictive list.

Take for example:
1) The book was put on the shelf.
2) Tommy put the book on the shelf.
3) Tommy put the book on the shelf to help the library.

Each is a semantic superset of its previous sentence duplicating and nesting the previous predicate.

1) PutOn(book,shelf)
2) Did(tommy,PutOn(book,shelf))
3) InOrderTo(Did(tommy,PutOn(book,shelf)),Help(tommy,library))

That is, the first sentence is the first predicate, the second is both the first and second, and the third is all three. Two terms related to this method of representing sentences are that sentences have a semantic core and a semantic root (I'll try to find out if other terminology already exists). The core is the predicate constructed from, often the most nested predicate, and the root is the least and the root of the corresponding tree structure.

So, while recursive binary predicates appear to be able to capture natural language, the interface considerations and ergonomics are more complicated. Also, finding the semantic core of sentences appears to relate to placing the sentence into its semantic frame via PropBank or FrameNet corpora; the algorithm would then be to construct the entirety. These nested predicates can be represented in the matrix format with some notational conventions.

Interestingly, similar to the reording of nouns in the matrix format, the reordering of the predicate arguments or nouns can be done, resembling:

InOrderTo(DoneBy(PutOn(book,shelf),tommy),Help(tommy,library))
The book was put on the shelf by Tommy to help the library.

InOrderToInverse(Help(tommy,library),DoneBy(PutOn(book,shelf),tommy))
In order to help the library, the book was put on the shelf by Tommy.
In order for Tommy to help the library, the book was put on the shelf by him.

InOrderToInverse(Help(tommy,library),Did(tommy,PutOn(book,shelf))
In order for Tommy to help the library, he put the book on the shelf.
To help the library, Tommy put the book on the shelf.

So, each binary predicate may be related to another predicate with its arguments in the opposite order. Some predicates may not, limiting the possible noun orderings for paraphrases. This is just one approach to capturing semantics using nested predicates; I look forward to learning other approaches and designing a web interface for a collaborative corpus.

Thursday, April 19, 2007

Knowledge Representation, Paraphrases

In machine reading, a goal is for all paraphrases to be processed into the same set of predicates or other form of representing semantics. One method I found to represent this feature is to represent a sentence as a block matrix representing the pairwise binary predicates between nouns. This utilizes the fact that n-ary predicates can be decomposed into a set of binary predicates.

The rows and columns are in the order of nouns occurring in a sentence and the entry in the i-th row and j-th column is meant as the predicate(s) that relate(s) the i-th and j-th noun (nouns on main diagonal). Using this, permutations can change noun ordering (one variety of paraphrases) and the semantics can be preserved.

Example:
1) Tommy went to the store to get a crescent wrench. <Tommy,store,wrench>
2) To get a crescent wrench, Tommy went to the store. <wrench,Tommy,store>

From this or another noun-order invariant representation format, the goal of NLG is then to compose grammatically correct sentences containing the semantics with the nouns in the order of the underlying matrix. A reason for robustness in this is that noun ordering is often context dependent, as sentences are composed in paragraphs and documents; both noun ordering and sentence aggregation [1] are overarching processes that are part of “good writing style”.

Using the phases of NLG from the article, the process with this knowledge representation may resemble:

Content determination: Knowledge is obtained from a knowledgebase or web.
Discourse planning: Knowledge is moved into a matrix format.
Sentence aggregation: Block diagonalization and utilization of other patterns discerned from a corpus of well-written articles.
Lexicalisation: Putting words to the concepts.
Referring expression generation: Linking words in the sentences by introducing pronouns and other types of means of reference.
Syntactic and morphological realisation: Permutations are applied as per patterns discerned from well-written articles; each sentence is realized with the nouns in the best order.
Orthographic realisation: Matters like casing, punctuation, and formatting are resolved.


[1] Natural Language Generation, Wikipedia article

WikiCorpus, Natural Language Processing

I'm tentatively calling the website 'WikiCorpus Project'. I'm hoping to make the interface as easy to use as possible to make entering read knowledge a rapid and ergonomic process. Interestingly, the design of the interface and algorithms for ergonomics are similar to the process of natural language processing.

There will be an on screen list of predicates offered to the user for each sentence. Hopefully, this list will be accurately predicted so that the user does not have to search for or create a new predicate. This predictive list will likely improve over time, as the corpus is populated, using patterns in the sentence (improved by word sense discernment, see: WordNet) and context. This (now theoretical) algorithm and its statistics-based dataset (both to be freely downloadable) might be of use as an algorithm component to some approaches.

Pronouns and other indirect entity references will have a context-menu where the user can select which entity the reference refers to. Again, ideally the correct reference is the first in a list offered and the goal is to minimize the probability of the user having to select 'Other' and select from a comprehensive list of the entities in the document.

So where NLP aims to resolve these completely, the ergonomics of this project aim to make an improving list of options for a user entering the knowledge they are reading. The ergonomics says that the predicate they are going to use next should be on screen (predicted) and the resolving of pronouns should have the correct candidate (hopefully first) in a drop down list.

The ergonomics then is an “easier” problem resembling information retrieval and predictive search that hopefully can be of use to some algorithmic approaches to the “less easy” task of mechanically and accurately doing the entire process (reading) on sentences in documents.

Monday, April 16, 2007

SRL Techniques (1 of 2)

I'm reading about approaches to SRL that include linearly interpolated relative frequency models [1], HMM's [2], SVM's [3], decision trees [4], and log-linear models [5]. I'm hoping to learn which are most compatible with resolving a sentence's syntax to an exhaustive set of predicates (hopefully capturing all of a sentence's semantics) with simple nouns or noun phrases as arguments.

As a rule, the more abstract roles have been proposed by linguists, who are more concerned with explaining generalizations across verbs in the syntactic realization of their arguments, while the more specific roles are more often proposed by computer scientists, who are more concerned with the details of the realization of the arguments of specific verbs. [1]

[1] Daniel Gildea, Daniel Jurafsky. Automatic Labeling of Semantic Roles. In Computational Linguistics (2002), 28(3):245–288

[2] Freitag, D., McCallum, A.: Information extraction with HMM structures learned by stochastic optimization. In: Proceedings of AAAI. (2000) 584–589

[3] Sameer Pradhan, Wayne Ward, Kadri Hacioglu, James Martin, and Dan Jurafsky. 2004. Shallow semantic parsing using support vector machines. In Proceedings of HLT/NAACL-2004.

[4] Mihai Surdeanu, Sanda Harabagiu, John Williams, and Paul Aarseth. 2003. Using predicate-argument structures for information extraction. In Proceedings of ACL-2003.

[5] Nianwen Xue and Martha Palmer. 2004. Calibrating features for semantic role labeling. In Proceedings of EMNLP-2004.

Sunday, April 15, 2007

Website Design

For this website for collaborative generation of a linguistic corpus and ontology, I'm thinking about a drag and drop based UI, where users can work with sentences and, using gestures, enter the knowledge obtained from reading them.

I downloaded phalanger for PHP integration into Visual Studio and .NET (they also have one for Mono). I'm looking at prototype and mootools for the javascript library. Also found this extension to mediawiki that adds web services to wiki (SOAP, WSDL) — I'll probably code some web services into it.

I'm thinking that the site will show articles with entities (nouns) as draggable objects that can be manipulated in the environment to allow users to easily provide the knowledge they obtain from reading (sentence by sentence). This will include anaphoric resolution and semantic relations between entities. By processing entire articles, context will be obtainable from the dataset. Hopefully the interface will be intuitive enough to encourage users to provide an ever-improving downloadable resource comparable to penn treebank and redwoods.

Saturday, April 14, 2007

Wikipedia, Semantic Web

After I make the web-deliverable interface that allows users to expand these semantic frames into all possible related entities, I refactor some SRL parsers to turn natural language into this semantic data and I point the system at Wikipedia... weeks later, what does this enormous OWL file (possibly stored numerically) have to do with the Semantic Web?

A quote from Cycorp:

The success of the Semantic Web hinges on solving two key problems: (1) enabling novice users to create semantic markup easily, and (2) developing tools that can harvest the semantically rich but ontologically inconsistent web that will result. To solve the first problem, it is important that any novice be able to author a web page effortlessly, with full semantic markup, using any ontology he understands. The Semantic Web must allow novices to construct their own individual or specialized-local ontologies, without imposing the need for them to learn about or integrate with an overarching, globally consistent, master ontology.

Allowing users to type in natural language is the easiest way to generate semantic markup. Because users prefer to use natural language, any ontology that software can roundtrip natural language with will likely be an overarching (prevalent) one. A possible problem with the approach I'm using is that the ontology of the dataset used to train SRL-based parsers would, instead of being handcrafted by an expert, be a collaborative effort of people, hopefully experts, visiting a site — wikiontology is a relatively new idea.

Only after I have the dataset will I be able to say if the ontology from the planned website is advantageous to machine reasoning tasks. It shouldn't be terribly difficult to make a benchmark for the consistency of the wiki-generated ontology — possibly using natural language (after the parser is completed). For example, paraphrase corpora and other instruments could be of use in both generating and refining.

Wikipedia is a proof of concept that people can come together to generate collective knowledge resources, so — if we get the post-NLP/pre-NLG ontology right (prevalent as argued above) — the Semantic Web may resemble a distributed wiki-knowledgebase. The gigabytes of Wikipedia data would be a launching point.

WikiOntology Idea

I'm envisioning a web-deliverable system for the PropBank and FrameNet data that allows users to modify or ammend relations between the arguments for purposes of getting as many of these binary subpredicates into place as possible. Considering there's thousands, some sort of wiki-tech might be useful, allowing users to download the up-to-date dataset. Basically, there would be a UI that indicated the arguments for each n-ary predicate or frame and would allow users to ammend relationship types between them. Users would be able to use URI from other domains or create new ones onto the project's domain.

Given the sentence "The boy went to the store on his bike.", we might obtain from some parsers a predicate resembling "went(the boy, to the store on his bike)". The idea is to allow user input to get a dataset matching that sentence to a set of predicates also including "utilized(the boy, his bike, [to go] to the store)", "owns(the boy, [a] bike)". Some sort of XML output would capture this and a UI allowing pairwise connections between noun arguments would allow users to add all the knowledge contained in a sentence. Ideally, these predicates would be further simplied until only simple nouns were interrelated. The idea is to use wiki-collaboration to create a dataset for SRL algorithms.

Code Storage, Collaboration

If anybody visiting knows a place where I can upload code to and link to here, please feel free to comment (on any blog entry for that matter). I might end up using an open source repository, the only problem is I'm not sure if some of the jar's or IKVM'd dll's (etc) EULA's allow that. As for the gb's of Wikipedia data after I run some tools on it, I might have to torrent that. Feel free to contact me or comment here if any of these projects sound interesting.

Also, if you know of any downloadable SRL projects or tools please let me know. [Update: found swirl and salsa]

Integer-Based Representation

I'm working with integer-based or numerical representations in triples and quadruples. Basically, this approach uses bitfields instead of URI. I designed it to be computationally faster than string-based formats and the bitfields allow differentiation between tense negation, logical negation, set complementing and other operations on entities and relation types. I'm working on some tools to convert XML-based data to and from this format.

Also, n-ary predicates can be converted to a set of triples (or quadruples) by combinatorically relating the n arguments (pairwise) using binary subpredicates. This could be of use in converting sentence predicates into a triples-based language.

A rule system will probably be required to capture overlap between different predicates' subpredicates — which is unfortunate as there are upwards of 4,527 frames in FrameNet and 3,635 in PropBank. This pairwise relating the syntactic elements in n-ary predicates or frames should result in more useful structured knowledge. This is a first attempt at a post-SRL/pre-NLG format.

Friday, April 13, 2007

Ontology and File Compression

Thinking on this post-SRL/pre-NLG ontology made me realize there was no easy way to compare different models.

If we look at rule systems, ontology and taxonomy as interoperating towards efficiently storing knowledge, then there might be a metric. That is, if system A compresses the same knowledge set better than system B and is more computationally efficient (in decompression/utilization), then we can say that system A is superior to system B (on that set) without resorting to aesthetics or philosophy. We have SUMO, the CYC upper ontology, ISO 15926, and others designed around real-world data and it's difficult to rank them.

The metaphor of file compression to knowledgebases might allow competition between differing methods. As systems are envisioned that mechanically generate rules, ontological structure or taxonomy (optimizing generators that create a system for a given knowledgebase), these metrics may be of use in comparing the resulting generated systems. Personally, I think it would be interesting to have algorithms that compress knowledgebases like tar, zip and 7zip do to files. Unfortunately, this approach is storage and speed-based and doesn't consider interface considerations-- for example, sets of things that are categorized for navigation.

Here's a link to a paper describing a relationship between AI (my field of research) and file compression. Apparently, there's a prize for compressing Wikipedia.

Semantic Role Labeling

Semantic Role Labeling (SRL) appears to be the algorithmically independent term for parsing sentences into structures like PropBank or FrameNet.

I'm investigating ontology to store the SRL'd structure that can additionally be used for NLG. I'm looking at loom, kpml, cypher and others to see if there's any overlap. It'd be “easier” to code if there's one format to and from natural language. Not sure if this SRL'd/pre-NLG would work well for machine reasoning (which is ideally what'd be stored in a db).

[1] CoNLL-2005 Shared Task: Semantic Role Labeling

[2] CCG: Semantic Role Labeling Demo

Head-Driven Phrase Structure Grammar

I'm thinking that the PropBank style of parsing (redwoods treebank) is more readily converted to structured knowledge than part of speech tagging. However, looking over the initial results of some software, it appears that some style of recursion would be of use in capturing all the structure (substructure) of a sentence. Some arguments to the main predicate appear to have discernable structure remaining — if an argument to the predicate could be a predicate, then this would capture as much structure as possible. Also noticing the inability of this style of parser to capture parallel predicates with some sentences that use logical connectives. Different levels in the recursion should be able to reuse arguments from across the sentence. I may have to code up a prototype to obtain as much semantic structure as possible, possibly outputting a set of these parses that capture it in parallel.

Also looking into initializing this style of parser with the POS-style to discern the main verb and then the remainder in order. After entity recognition and string concatenation of multiword nouns, parsers seem to function more accurately. I'm going to look at the parse trees for verb hierarchy and SBAR information to construct recursive predicate structures, utilize the NP information to bootstrap entity recognition, and post here which code appears “easiest” to build from.

[1] Miyao, Y. and Tsujii, J. 2004. Deep linguistic analysis for the accurate identification of predicate-argument relations. In Proceedings of the 20th international Conference on Computational Linguistics (Geneva, Switzerland, August 23 - 27, 2004). International Conference On Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, 1392.

[2] Towards Parsing Unrestricted Text into PropBank Predicate-Argument Structures

Thursday, April 12, 2007

Natural Language and Semantic Web

Been browsing on the Web for various tools in natural language processing (NLP) and natural language generation (NLG). Presently looking at cypher, heart of gold, gate, stanford parser, charniak parser, enju, opennlp, lkb, assert, halogen, kpml and a couple of others. Also looking at a couple of methods of storing structured knowledge from a parse including Berkeley's FrameNet ontology and some that came with halogen and kpml. Hoping to throw these tools at the Wikipedia dataset (9.74 gb) and post some results.