How to resolve English sentence verbs semantically - nlp

I am trying to transform English statements into SQL queries.
e.g. How many products were created last year?
This should get transformed to select count(*) from products
where manufacturing date between 1/1/2015 and 31/12/2015
I am not able to understand how to map the verb "created" to "manufacturing date" attribute in my table. I am using Stanford core nlp suite to parse my statement. I am also using wordnet taxonomies with JWI framework.
I have tried to map the verbs to the attributes by defining simple rules. But it is not a very generic approach, since I can not know all the verbs in advance. Is there any better way to achieve this?
I would appreciate any help in this regard.

I know this would require a tool change, but I would reccommend checking out Adapt by Mycroft AI.
It is a very straightforward intent parser which transforms user input into a json semantic representation.
For example:
Input: "Put on my Joan Jett Pandora station."
JSON:
{
"confidence": 0.61,
"target": null,
"Artist": "joan jett",
"intent_type": "MusicIntent",
"MusicVerb": "put on",
"MusicKeyword": "pandora"
}
It looks like the rules are very easy to specify and expand so you would just need to build out your rules and then have whatever tool you want process the JSON and send the SQL query.

Related

How to extract DBPedia categories through DBPedia Spotlight?

I'm trying to extract the types and their respective levels from an entity named through DBPediaSpotlight. I already looked in forums, the documentation of the git hub and found nothing. I would like to know one way to do this extraction. Thank you!
Given that your desired root is <http://www.w3.org/2002/07/owl#Thing>, you're actually looking for the rdf:type tree (not Wikipedia Categories, as such).
The typing of <http://dbpedia.org/resource/Semantic_Web> seems a bit odd, so I've used <http://dbpedia.org/resource/Cat> below. You'll note that the data does not always include a tree of the sort you wish.
This will get explicit rdf:type statements --
SELECT ?type
WHERE
{ <http://dbpedia.org/resource/Cat> a ?type
}
-- and this will climb to the top of any rdf:type trees --
SELECT ?type
WHERE
{ <http://dbpedia.org/resource/Cat> a+ ?type
}
A query to build the full tree would be rather more complex, but is entirely possible.
As mentioned here, you may need this in SPARQL to fetch the categories from DBpedia URI
PREFIX dbr: <http://dbpedia.org/resource/>
SELECT DISTINCT ?subject
WHERE { dbr:Semantic_Web dct:subject ?subject }
LIMIT 100
which might be retrieved in various serializations.
For example in JSON

Stanford CoreNLP API fails to parse some sentences

I have been trying to use the Stanford CoreNLP API included in the 2015-12-09 release. I start the server using:
java -mx5g -cp "./*" edu.stanford.nlp.pipelinStanfordCoreNLPServer
The server works in general, but fails for some setnences including the following:
"Aside from her specifically regional accent, she reveals by the use of the triad, ``irritable, tense, depressed, a certain pedantic itemization that indicates she has some familiarity with literary or scientific language ( i.e., she must have had at least a high­school education ) , and she is telling a story she has mentally rehearsed some time before."
I end up with a result that starts with :
{"sentences":[{"index":0,"parse":"SENTENCE_SKIPPED_OR_UNPARSABLE","basic-dependencies":
I would greatly appriciate some help in setting this up - am I not including some annotators in the nlp pipeline.
This same sentence works at http://corenlp.run/
If you're looking for a dependency parse (like that in corenlp.run), you should look at the basic-dependencies field rather than the parse field. If you want a constituency parse, you should include the parse annotator in the list of annotators you are sending to the server. By default, the server does not include the parser annotator, as it's relatively slow.

Java library to get different declinaison of a word (nlp ?)

For a simple project in Java I would need a library that from a given words it returns me a list of its declinaison (including plural, singular, adjective etc..)
As an example something like this:
"Photo" -> "Photo", "Photograph", photography"
"Walks" -> "Walk", "Walking" ...
I had a look at lib like CoreNLP but I cannot figure out how to achieve this kind of stuff? Plus the doc is kind of bad and I cannot hardly find any nice code example.
Could someone help with this?

English word segmentation in NLP?

I am new in the NLP domain, but my current research needs some text parsing (or called keyword extraction) from URL addresses, e.g. a fake URL,
http://ads.goole.com/appid/heads
Two constraints are put on my parsing,
The first "ads" and last "heads" should be distinct because "ads" in the "heads" means more suffix rather than an advertisement.
The "appid" can be parsed into two parts; that is 'app' and 'id', both taking semantic meanings on the Internet.
I have tried the Stanford NLP toolkit and Google search engine. The former tries to classify each word in a grammar meaning which is under my expectation. The Google engine shows more smartness about "appid" which gives me suggestions about "app id".
I can not look over the reference of search history in Google search so that it gives me "app id" because there are many people have searched these words. Can I get some offline line methods to perform similar parsing??
UPDATE:
Please skip the regex suggestions because there is a potentially unknown number of compositions of words like "appid" in even simple URLs.
Thanks,
Jamin
Rather than tokenization, what it sounds like you really want to do is called word segmentation. This is for example a way to make sense of asentencethathasnospaces.
I haven't gone through this entire tutorial, but this should get you started. They even give urls as a potential use case.
http://jeremykun.com/2012/01/15/word-segmentation/
The Python wordsegment module can do this. It's an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.
Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).
Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.
Installation is easy with pip:
$ pip install wordsegment
Simply call segment to get a list of words:
>>> import wordsegment as ws
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'appid', 'heads']
As you noticed, the old corpus doesn't rank "app id" very high. That's ok. We can easily teach it. Simply add it to the bigram_counts dictionary.
>>> ws.bigram_counts['app id'] = 10.2e6
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'app', 'id', 'heads']
I chose the value 10.2e6 by doing a Google search for "app id" and noting the number of results.

What does "SEM1:3ENCE_B:NW:NG102:EECT300:120:0900:2" mean?

In my project I am developing teachers and their timetable. I was provided with a text file that contains the teacher timetable from my uni. They ware unable to tell me what is the syntax or code language so I would know how to read it and use it in my iPhone app. Can you help me identifying what sot of code is this and how can I read that?
Sample:
SEM1:3ENCE_B:NW:NG102:EECT300:120:0900:2
SEM1:3ENCE_B,3ENCE_C:TW:NLG107:EEEL300:120:0900:1
19:3ENCE_A,3ENCE_B,3ENCE_C:TW:CLG.01:EEEL305_L:120:1100:1
19:3ENCE_A,3ENCE_B,3ENCE_C:TW:NLG107:EEEL305:120:0900:1
SEM1:3ENCE_A,3ENCE_B:TW::EEEL300:120:1100:4
SEM1&2:3ENCE_A,3ENCE_B,3ENCE_C,3ENCE_D:SK:CLG.06:EEEL315_L:120:1400:4
SEM1:3CS_A,3CS_B,3CS_C,3CS_D,3ENCE_A,3ENCE_B,3ENCE_C,3ENCE_D:DHE:CLLT:EICG301_L:120:0900:5
SEM1:3CS_A,3CS_B:ABO,DHE:N5.114:EICG301:120:1100:5
SEM1:3CS_A,3CS_B,3CS_C,3CS_D,3ENCE_A,3ENCE_B,3ENCE_C,3ENCE_D:NW:LTS205:EECT300_L:120:1600:2
27:3ENCE_A,3ENCE_B,3ENCE_C,3ENCE_CS::NG100:EEEL320:120:1100:2
SEM1:3CS_A,3CS_B,3CS_C,3CS_D:NW:C2.14:ECSC302_L:120:0900:3
SEM1:3CS_A:NW:NG100:EECT300:120:1400:2
It's not code, it's data. And the best way of interpreting it is to compare this representation with another : Think Rosetta Stone.
Obviously, colon is used to separate the fields, and each line probably represents a single tinmetable item. Each line appears to have 8 fields on it.
One field looks like a course ID : EECT300
Another looks like a time : 0900
As for the rest, you'll have to work it out...
University of Westminster, maybe...?
It is not a code language.
It is just a plain text file which contains data using colons : as a separator
I guess you have to parse it and retrieve the information for each column. You have to be aware of the signification of each column (if no ask to your uni)

Resources