How to parse the location out of a query string? - search

We're getting a lot of queries like "something in Boston", "something near NY", "something miami fl" and we're looking for the best way to parse this out.

If I interpret your question correctly, you are looking for a way to parse out the location/city from a question?
Since words fly freely in english, the best proposal I have is that you create a table of the most common cities in the country you are interested in, and do a case-insensitive search through the text, scanning for those cities.
Made a quick test implementation using python, using wikipedia to extract a list of the cities in usa and created a fake question with a name of a city in it. The scripts reads both text from file and makes a search for a city; using:
275 cities in the list
question with 145 words
Time for this is shown below:
real 0m0.061s
user 0m0.040s
sys 0m0.016s
Start with a list of the most common cities and their most common misspellings (thanks ted-hop). Then use a simple strategy like
search for a city in a question.
if a city cannot be found, mark the question for manual review and add the city or the misspelling of a city to the list if found.
goto 1.
After a couple of iterations you should have a good list that covers most of the cities.
I can post the code if you are interested, it's a really trivial brute-force search in ~12 lines of python.
Update (since people still seams to read this posts)
Have a look at difflib
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
this will probably ease the matching...

In terms of computational linguistics, you are looking for a methodology/technology called "Named Entity Recognition". There are numerous libraries, systems or solutions available that perform NER that can be found via Google, possibly for your chosen development language.

Related

Given a word get the meaning using WordNet API

I am currently using the wordnet search to get the meaning of words. However, I have a really long list of words and thus, would like to see the possibility of automating it.
For example given the individual word goat I want to get the meaning of it provide by wordnet.
I see questions about getting root word, hyponyms etc, but I could not find a proper solution on how to retrieve the meaning given a word.
Please let me know the possible options of doing it!
Here is how to get definition
from nltk.corpus import wordnet
syns = wordnet.synsets("goat")
print(syns[0].definition())
Output
any of numerous agile ruminants related to sheep but having a beard and straight horns

how to match keywords/phrases in a text?

I have...
a fixed large set (about 1,000,000) of keywords and phrases, like birthday, happy new year, vacation etc.
some variable text between 10 and 500 words.
I'd like to...
identify those keywords/phrases that are present in the text (eg. Hi John, happy birthday to you. matches birthday), preferably with some information about number of equal matches
tolerate grammar variations (vacations should match vacation, countries should match country) or "misspellings" (nodejs == node.js).
In essence something similar what Google does for searching (but they probably use way more complicated methods) or Stackoverflow does for tag matching / searching for answers.
Basically the user enters some text, and my program should do it's best to suggest relevant keywords.
In my case, the algorithm needs to operate mostly on English text, but should also be be applicable to other languages like German, Italian, French, Spanish, ...
Does some Linux / NodeJS library exist that can do that? Or at least a well-known algorithm?
As for the first question you can simply read the whole set or line by line and make a String.match() against each word you need to search.
The second is a little trickier, you dont need the exact match but you need to calculate the similarity of 2 strings.There are many algorithms that can measure how similar two strings are. For example take a look at Levenshtein distance.
There is a good library that implements all the above in node.js
https://github.com/NaturalNode/natural
It can tokenize the text, search for an exact or a similar word and also it implements tf-idf which is the simpler way a search engine can work!

How do I get the context of a sentence?

There is a questionnaire that we use to evaluate the student knowledge level (we do this manually, as in a test paper). It consists of the following parts:
Multiple choice
Comprehension Questions (I.e: Is a spider an insect?)
Now I have been given a task to make an expert system that will automate this. So basically we have a proper answer for this. But my problem is the "comprehension questions". I need to compare the context of their answer to the context of the correct answer.
I already initially searched for the answer, but it seems like it's really a big task to do. What I have search so far is I can do this through NLP which is really new to me. Also, if I'm not mistaken, it seems like that I have to find a dictionary of all words that is possible for the examiner to answer.
Am I on the right track? If no, please suggest of what should I do (study what?) or give me some links to the materials that I need. Also, should I make my own dictionary? Because the words that I will be using are in the Filipino language.
Update: Comprehension question
The comprehension section of the questionnaire contains one paragraph explaining a certain scenario. The questions are fairly simple. Here is an example:
Bonnie's uncle told her to pick apples from the tree. Picking up a stick, she poked the fruits so they would fall. In the middle of doing this, a strong gust of wind blew. Due to her fear of the fruits falling on top of her head, she stopped what she was doing. After this, though, she noticed that the wind had caused apples to fall from the tree. These fallen apples were what she brought home to her uncle.
The questions are:
What did Bonnie's uncle tell her to do?
What caused Bonnie to stop picking apples from the tree?
Is Bonnie a good fruit picker? Please explain your answer.
The possible answers that the answer key states are:
For number 1:
1.1 Bonnie's uncle told her to pick apples from the tree
1.2 Get apples
For number 2:
2.1 A strong gust of wind blew
2.2 She might get hit in the head by the fruits
For number 3:
3.1 No, because the apples she got were already on the ground
3.2 No, because the wind was what caused the fruits to fall
3.3 Yes, because it is difficult to pick fruits when it's windy.
3.4 Yes, because at least she tried
Now there are answers that were given to me. The job that the system shall be able to do is to compare the context of the student's answer to the context of the right answer in order for the system to successfully be able to grade the student's answer.
One simplistic way of doing this that I can think of (off the top of my head) is to use a string similarity metric like cosine or jaccard to identify whether certain keywords appear in a test answer and the known correct answer.
Extracting these keywords automatically could be done with part of speech tagging using NLP. For example, you could extract all nouns (and possibly verbs). Then, representing each answer as a vector of keywords, you could compare the test vector with the known correct vector.
For example, in the second question, the vector for the two possible answers could be
gust, wind, blew
hit, head, fruits
An answer like "she picked up a stick" with the keywords: picked, stick would have a very low score as compared to something like "afraid of fruit falling on her head" with keywords: fruit, falling, head.
Notes:
This can detect only wildly wrong answers. Wrong answers containing the right keywords would not be detected by this technique. :)
I'm not sure about non-english sentences. If that is the case, you might want to take every word in the answer as a keyword (removing stopwords). This question might help as well.

Methods for extracting locations from text?

What are the recommended methods for extracting locations from free text?
What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?
Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.
Does anybody know of better approaches?
Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.
All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)
This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml
You can easily find implementations in other programming languages.
Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.
Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.
Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.
As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.
How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries?
A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer.
Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer.
I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy

Streets recognition, deduction of severity

I'm trying to make an analysis of a set of phrases, and I don't know exactly how "natural language processing" can help me, or if someone can share his knowledge with me.
The objective is to extract streets and localizations. Often this kind of information is not presented to the reader in a structured way, and It's hard to find a way of parsing it. I have two main objectives.
First the extraction of the streets itself. As far as I know NLP libraries can help me to tokenize a phrase and perform an analysis which will get nouns (for example). But where a street begins and where does it ends?. I assume that I will need to compare that analysis with a streets database, but I don't know wich is the optimal method.
Also, I would like to deduct the level of severity , for example, in car accidents. I'm assuming that the only way is to stablish some heuristic by the present words in the phrase (for example, if deceased word appears + 100). Am I correct?
Thanks a lot as always! :)
The first part of what you want to do ("First the extraction of the streets itself. [...] But where a street begins and where does it end?") is a subfield of NLP called Named Entity Recognition. There are many libraries available which can do this. I like NLTK for Python myself. Depending on your choice I assume that a streetname database would be useful for training the recognizer, but you might be able to get reasonable results with the default corpus. Read the documentation for your NLP library for that.
The second part, recognizing accident severity, can be treated as an independent problem at first. You could take the raw words or their part of speech tags as features, and train a classifier on it (SVM, HMM, KNN, your choice). You would need a fairly large, correctly labelled training set for that; from your description I'm not certain you have that?
"I'm assuming that the only way is to stablish some heuristic by the present words in the phrase " is very vague, and could mean a lot of things. Based on the next sentence it kind of sounds like you think scanning for a predefined list of keywords is the only way to go. In that case, no, see the paragraph above.
Once you have both parts working, you can combine them and count the number of accidents and their severity per street. Using some geocoding library you could even generalize to neighborhoods or cities. Another challenge is the detection of synonyms ("Smith Str" vs "John Smith Street") and homonyms ("Smith Street" in London vs "Smith Street" in Leeds).

Resources