I'm optimistic about the combination of the following technology I've been reading about:
1) Speech to text
2) Machine reading (NLP)
3) Knowledgebase / web
4) Machine writing (NLG)
5) Text to speech
Maybe someday people will be able to talk to their computers to add to and access collective encyclopedic knowledge.
Showing posts with label semantic web. Show all posts
Showing posts with label semantic web. Show all posts
Saturday, April 21, 2007
Saturday, April 14, 2007
Wikipedia, Semantic Web
After I make the web-deliverable interface that allows users to expand these semantic frames into all possible related entities, I refactor some SRL parsers to turn natural language into this semantic data and I point the system at Wikipedia... weeks later, what does this enormous OWL file (possibly stored numerically) have to do with the Semantic Web?
A quote from Cycorp:
Allowing users to type in natural language is the easiest way to generate semantic markup. Because users prefer to use natural language, any ontology that software can roundtrip natural language with will likely be an overarching (prevalent) one. A possible problem with the approach I'm using is that the ontology of the dataset used to train SRL-based parsers would, instead of being handcrafted by an expert, be a collaborative effort of people, hopefully experts, visiting a site — wikiontology is a relatively new idea.
Only after I have the dataset will I be able to say if the ontology from the planned website is advantageous to machine reasoning tasks. It shouldn't be terribly difficult to make a benchmark for the consistency of the wiki-generated ontology — possibly using natural language (after the parser is completed). For example, paraphrase corpora and other instruments could be of use in both generating and refining.
Wikipedia is a proof of concept that people can come together to generate collective knowledge resources, so — if we get the post-NLP/pre-NLG ontology right (prevalent as argued above) — the Semantic Web may resemble a distributed wiki-knowledgebase. The gigabytes of Wikipedia data would be a launching point.
A quote from Cycorp:
The success of the Semantic Web hinges on solving two key problems: (1) enabling novice users to create semantic markup easily, and (2) developing tools that can harvest the semantically rich but ontologically inconsistent web that will result. To solve the first problem, it is important that any novice be able to author a web page effortlessly, with full semantic markup, using any ontology he understands. The Semantic Web must allow novices to construct their own individual or specialized-local ontologies, without imposing the need for them to learn about or integrate with an overarching, globally consistent, master ontology.
Allowing users to type in natural language is the easiest way to generate semantic markup. Because users prefer to use natural language, any ontology that software can roundtrip natural language with will likely be an overarching (prevalent) one. A possible problem with the approach I'm using is that the ontology of the dataset used to train SRL-based parsers would, instead of being handcrafted by an expert, be a collaborative effort of people, hopefully experts, visiting a site — wikiontology is a relatively new idea.
Only after I have the dataset will I be able to say if the ontology from the planned website is advantageous to machine reasoning tasks. It shouldn't be terribly difficult to make a benchmark for the consistency of the wiki-generated ontology — possibly using natural language (after the parser is completed). For example, paraphrase corpora and other instruments could be of use in both generating and refining.
Wikipedia is a proof of concept that people can come together to generate collective knowledge resources, so — if we get the post-NLP/pre-NLG ontology right (prevalent as argued above) — the Semantic Web may resemble a distributed wiki-knowledgebase. The gigabytes of Wikipedia data would be a launching point.
Friday, April 13, 2007
Ontology and File Compression
Thinking on this post-SRL/pre-NLG ontology made me realize there was no easy way to compare different models.
If we look at rule systems, ontology and taxonomy as interoperating towards efficiently storing knowledge, then there might be a metric. That is, if system A compresses the same knowledge set better than system B and is more computationally efficient (in decompression/utilization), then we can say that system A is superior to system B (on that set) without resorting to aesthetics or philosophy. We have SUMO, the CYC upper ontology, ISO 15926, and others designed around real-world data and it's difficult to rank them.
The metaphor of file compression to knowledgebases might allow competition between differing methods. As systems are envisioned that mechanically generate rules, ontological structure or taxonomy (optimizing generators that create a system for a given knowledgebase), these metrics may be of use in comparing the resulting generated systems. Personally, I think it would be interesting to have algorithms that compress knowledgebases like tar, zip and 7zip do to files. Unfortunately, this approach is storage and speed-based and doesn't consider interface considerations-- for example, sets of things that are categorized for navigation.
Here's a link to a paper describing a relationship between AI (my field of research) and file compression. Apparently, there's a prize for compressing Wikipedia.
If we look at rule systems, ontology and taxonomy as interoperating towards efficiently storing knowledge, then there might be a metric. That is, if system A compresses the same knowledge set better than system B and is more computationally efficient (in decompression/utilization), then we can say that system A is superior to system B (on that set) without resorting to aesthetics or philosophy. We have SUMO, the CYC upper ontology, ISO 15926, and others designed around real-world data and it's difficult to rank them.
The metaphor of file compression to knowledgebases might allow competition between differing methods. As systems are envisioned that mechanically generate rules, ontological structure or taxonomy (optimizing generators that create a system for a given knowledgebase), these metrics may be of use in comparing the resulting generated systems. Personally, I think it would be interesting to have algorithms that compress knowledgebases like tar, zip and 7zip do to files. Unfortunately, this approach is storage and speed-based and doesn't consider interface considerations-- for example, sets of things that are categorized for navigation.
Here's a link to a paper describing a relationship between AI (my field of research) and file compression. Apparently, there's a prize for compressing Wikipedia.
Semantic Role Labeling
Semantic Role Labeling (SRL) appears to be the algorithmically independent term for parsing sentences into structures like PropBank or FrameNet.
I'm investigating ontology to store the SRL'd structure that can additionally be used for NLG. I'm looking at loom, kpml, cypher and others to see if there's any overlap. It'd be “easier” to code if there's one format to and from natural language. Not sure if this SRL'd/pre-NLG would work well for machine reasoning (which is ideally what'd be stored in a db).
[1] CoNLL-2005 Shared Task: Semantic Role Labeling
[2] CCG: Semantic Role Labeling Demo
I'm investigating ontology to store the SRL'd structure that can additionally be used for NLG. I'm looking at loom, kpml, cypher and others to see if there's any overlap. It'd be “easier” to code if there's one format to and from natural language. Not sure if this SRL'd/pre-NLG would work well for machine reasoning (which is ideally what'd be stored in a db).
[1] CoNLL-2005 Shared Task: Semantic Role Labeling
[2] CCG: Semantic Role Labeling Demo
Head-Driven Phrase Structure Grammar
I'm thinking that the PropBank style of parsing (redwoods treebank) is more readily converted to structured knowledge than part of speech tagging. However, looking over the initial results of some software, it appears that some style of recursion would be of use in capturing all the structure (substructure) of a sentence. Some arguments to the main predicate appear to have discernable structure remaining — if an argument to the predicate could be a predicate, then this would capture as much structure as possible. Also noticing the inability of this style of parser to capture parallel predicates with some sentences that use logical connectives. Different levels in the recursion should be able to reuse arguments from across the sentence. I may have to code up a prototype to obtain as much semantic structure as possible, possibly outputting a set of these parses that capture it in parallel.
Also looking into initializing this style of parser with the POS-style to discern the main verb and then the remainder in order. After entity recognition and string concatenation of multiword nouns, parsers seem to function more accurately. I'm going to look at the parse trees for verb hierarchy and SBAR information to construct recursive predicate structures, utilize the NP information to bootstrap entity recognition, and post here which code appears “easiest” to build from.
[1] Miyao, Y. and Tsujii, J. 2004. Deep linguistic analysis for the accurate identification of predicate-argument relations. In Proceedings of the 20th international Conference on Computational Linguistics (Geneva, Switzerland, August 23 - 27, 2004). International Conference On Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, 1392.
[2] Towards Parsing Unrestricted Text into PropBank Predicate-Argument Structures
Also looking into initializing this style of parser with the POS-style to discern the main verb and then the remainder in order. After entity recognition and string concatenation of multiword nouns, parsers seem to function more accurately. I'm going to look at the parse trees for verb hierarchy and SBAR information to construct recursive predicate structures, utilize the NP information to bootstrap entity recognition, and post here which code appears “easiest” to build from.
[1] Miyao, Y. and Tsujii, J. 2004. Deep linguistic analysis for the accurate identification of predicate-argument relations. In Proceedings of the 20th international Conference on Computational Linguistics (Geneva, Switzerland, August 23 - 27, 2004). International Conference On Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, 1392.
[2] Towards Parsing Unrestricted Text into PropBank Predicate-Argument Structures
Labels:
hpsg,
nlp,
propbank,
semantic web,
sentence parsing
Thursday, April 12, 2007
Natural Language and Semantic Web
Been browsing on the Web for various tools in natural language processing (NLP) and natural language generation (NLG). Presently looking at cypher, heart of gold, gate, stanford parser, charniak parser, enju, opennlp, lkb, assert, halogen, kpml and a couple of others. Also looking at a couple of methods of storing structured knowledge from a parse including Berkeley's FrameNet ontology and some that came with halogen and kpml. Hoping to throw these tools at the Wikipedia dataset (9.74 gb) and post some results.
Subscribe to:
Posts (Atom)