Extracting Important words from a sentence using Node - node.js

I admit that I havent searched extensively in the SO database. I tried reading the natural npm package but doesnt seem to provide the feature. I would like to know if the below requirement is somewhat possible ?
I have a database that has list of all cities of a country. I also have rating of these cities (best place to live, worst place to live, best rated city, worsrt rated city etc..). Now from the User interface, I would like to enable the user to enter free text and from there I should be able to search my database.
For e.g Best place to live in California
or places near California
or places in California
From the above sentence, I want to extract the nouns only (may be ) as this will be name of the city or country that I can search for.
Then extract 'best' means I can sort is a particular order etc...
Any suggestions or directions to look for?
I risk a chance that the question will be marked as 'debatable'. But the reason I posted is to get some direction to proceed.

[I came across this question whilst looking for some use cases to test a module I'm working on. Obviously the question is a little old, but since my module addresses the question I thought I might as well add some information here for future searchers.]
You should be able to do what you want with a POS chunker. I've recently released one for Node that is modelled on chunkers provided by the NLTK (Python) and Standford NLP (Java) libraries (the chunk() and TokensRegex() methods, resepectively).
The module processes strings that already contain parts-of-speech, so first you'll need to run your text through a parts-of-speech tagger, such as pos:
var pos = require('pos');
var words = new pos.Lexer().lex('Best place to live in California');
var tags = new pos.Tagger()
.tag(words)
.map(function(tag){return tag[0] + '/' + tag[1];})
.join(' ');
This will give you:
Best/JJS place/NN to/TO live/VB in/IN California/NNP ./.
Now you can use pos-chunker to find all proper nouns:
var chunker = require('pos-chunker');
var places = chunker.chunk(tags, '[{ tag: NNP }]');
This will give you:
Best/JJS place/NN to/TO live/VB in/IN {California/NNP} ./.
Similarly you could extract verbs to understand what people want to do ('live', 'swim', 'eat', etc.):
var verbs = chunker.chunk(tags, '[{ tag: VB }]');
Which would yield:
Best/JJS place/NN to/TO {live/VB} in/IN California/NNP ./.
You can also match words, sequences of words and tags, use lookahead, group sequences together to create chunks (and then match on those), and other such things.

You probably don't have to identify what is a noun. Since you already have a list of city and country names that your system can handle, you just have to check whether the user input contains one of these names.

Well firstly you'll need to find a way to identify nouns. There is no core node module or anything that can do this for you. You need to loop through all words in the string and then compare them against some kind of dictionary database so you can find each word and check if it's a noun.
I found this api which looks pretty promising. You query the API for a word and it sends you back a blob of data like this:
<?xml version="1.0" encoding="UTF-8"?>
<results>
<result>
<term>consistent, uniform</term>
<definition>the same throughout in structure or composition</definition>
<partofspeech>adj</partofspeech>
<example>bituminous coal is often treated as a consistent and homogeneous product</example>
</result>
</results>
You can see that it includes a partofspeech member which tells you that the word "consistent" is an adjective.
Another (and better) option if you have control over the text being stored is to use some kind of markup language to identify important parts of the string before you save it. Something like BBCode. I even found a BBCode node module that will help you do this.
Then you can save your strings to the database like this:
Best place to live in [city]California[/city] or places near [city]California[/city] or places in [city]California[/city].
or
My name is [first]Alex[/first] [last]Ford[/last].
If you're letting user's type whole sentences of text and then you're trying to figure out what parts of those sentences is data you should use in your app then you're making things very unnecessarily hard on yourself. You should either ask them to input important pieces of data into their own text boxes or you should give the user a formatting language such as the aforementioned BBCode syntax so they can identify important bits for you. The job of finding out which parts of a string are important is going to be a huge one for you I think.

Related

finding organization and industry/sector from string in dbpedia

I am generating a short list of 10 to 20 strings which I want to lookup on dbpedia to see if they have an organization tag and if so return the industry/sector tag. I have been looking at the SPARQLwrapper queries on their website but am having trouble constructing one that returns organization and sector/industry for my string. Is there a way to do this?
If I use the code below I get a list of industry types I think rather than the industry of the company.
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery("""
SELECT ?industry WHERE
{ <http://dbpedia.org/resource/IBM> a ?industry}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
Instead of looking at queries which are meant to help you understand the querying tool, you should start by looking at the data which is being queried. For instance, just click http://dbpedia.org/resource/IBM, and look at the properties (the left hand column) to see its rdf:type values (of which there are MANY)!
Note that IBM is not described as a ?industry. IBM is described as a <http://dbpedia.org/resource/Public_company> (among other things). On the other hand, IBM is also described as having three values for <http://dbpedia.org/ontology/industry> --
<http://dbpedia.org/resource/Cloud_computing>
<http://dbpedia.org/resource/Information_technology>
<http://dbpedia.org/resource/Cognitive_computing>
I don't know whether these are what you're actually looking for or not, but hopefully what I've done above will start you down the right path to whatever you do want to get out of DBpedia.

Tokenization not working the same for both case.

I have a document
doc = nlp('x-xxmessage-id:')
When I want to extract the tokens of this one I get 'x', 'xx', 'message' and 'id', ':'. Everything goes well.
Then I create a new document
test_doc = nlp('id')
If I try to extract the tokens of test_doc, I will get 'i' and 'd'. Is there any way to get past this problem? Because I want to get the same token as above and this is creating problems in the text processing.
Just like language itself, tokenization is context-dependent and the language-specific data defines rules that tell spaCy how to split the text based on the surrounding characters. spaCy's defaults are also optimised for general-purpose text, like news text, web texts and other modern writing.
In your example, you've come across an interesting case: the abstract string "x-xxmessage-id:" is split on punctuation, while the isolated lowercase string "id" is split into "i" and "d", because in written text, it's most commonly an alternate spelling of "I'd" or "i'd" ("I could", "I would" etc.). You can find the respective rules here.
If you're dealing with specific texts that are substantially different from regular natural language texts, you usually want to customise the tokenization rules or possibly even add a Language subclass for your own custom "dialect". If there's a fixed number of cases you want to tokenize differently that can be expressed by rules, another option would be to add a component to your pipeline that merges the split tokens back together.
Finally, you could also try using the language-independent xx / MultiLanguage class instead. It still includes very basic tokenization rules, like splitting on punctuation, but none of the rules specific to the English language.
from spacy.lang.xx import MultiLanguage
nlp = MultiLanguage()

MarkLogic - Tokenize Search Phrase based on XML Field as a dictionary of phrases

I have a list of "known phrases" stored in an XML Document under an element named label. I am trying to figure out how to write a function, that can tokenize a search phrase into all of its label pieces (if available).
For instance. I have a Label for North Korea, and ICBM.
If the user types in North Korea ICBM, I would expect to get back two tokens, one for each label as opposed to North and Korea and ICBM.
In another example if the user types in New York City, I would expect only one token (label) of "New York City".
If there is no labels found, it would return the default tokenization of each word.
I tried to start writing this, but am not sure how to do this properly without a while loop facility, and am pretty new to xQuery in general.
The code below was how I started, but quickly realized it would not work for scaling out search terms.
Basically, it checks to see if the full phrase is in the Label fields. If it is not, it starts to strip away from the back of the search phrase checking what's left for a label.
let $label-query := cts:element-value-query(fn:QName('','label'), $searchTerm, ('case-insensitive', 'whitespace-sensitive'))
let $results := cts:search(fn:collection('typea'),$label-query)
let $test :=
if (fn:empty($results)) then
let $tokens := (fn:tokenize($searchTerm, " "))
let $tokenCount := fn:count($tokens)
let $lastWord := $tokens[last()]
let $firstPhrase := $tokens[position() ne (last())]
let $_ :=
if (fn:count($firstPhrase) = 1 ) then
()
else
let $label-query2 := cts:element-value-query(fn:QName('','label'), $firstPhrase, ('case-insensitive', 'whitespace-sensitive'))
let $results2 := cts:search(fn:collection('typea'),$label-query2)
return
if (fn:empty($results2)) then
xdmp:log('second empty')
else
xdmp:log($results2)
let $l := xdmp:log( $firstPhrase )
return $tokens
else
let $_ := xdmp:log('full')
return element {'result'} {$results}
Does anyone have any advice how I could implement this recursively or perhaps any alternate strategies. I am essentially trying to say, break this sentence up into all of the phrases found that exist in the Label fields of the typea collection. If there are no labels found, tokenize by word.
Thanks I look forward to your guidance.
Update to help clarify my ultimate intention.
Below is the document referring to North Korea.
The goal is to parse the search phrase, and use extra information found in these documents to aid in search.
Meaning if the person types in DPRK or North Korea they should both search the same way. It should also include Narrower Labels as an Or Condition on the search, and will more likely than not be updated to include other relationships that will also be included in search. (IE: Kim Jong Un is Notably Associated with North Korea.)
So in short I would like to reconcile the multi phrase search terms using the label field, and then if it was found, use the information from all labels + the narrower labels as well from that document.
Edit 2: Trying to use cts:highlight to get the phrases. Once I have the phrases I will do an element lookup to get to the right document, and then get the associated documents data for submission to query building.
The issue now is that the cts:highlight does not always return the full phrase under one <phrase> tag.
let $phrases := cts:highlight(<nod>New York City FC</nod>, cts:or-query((//label)), <phrase>{ $cts:text }</phrase>)
A possible alternative approach, if you are using MarkLogic 9, is to set up a custom tokenization dictionary. See custom dictionary API documentation1 and search developer's guide2 for details.
But the gist is, if you add an entry "North Korea" in your tokenization dictionary for a language, you'll get it back as a single token for that language. This will apply everywhere: in content or searches.
That said, it isn't clear from your code what you are ultimately trying to achieve with this. If it is more accuracy with phrase searches, there are better ways to achieve this (enabling fast-phrase for 2-word phrases, or word positions for longer ones).
If this us about search parsing only, you could still use the tokenization dictionary approach, but you probably want to use a special language code for it so it doesn't mess up your actual content, and then use cts:tokenize, e.g. cts:tokenize("North Korea ICBM","xen") where "xen" is your special language code.
Another approach is to use cts:highlight to apply markup to matches to your phrases in the string and go from there:
cts:highlight(<node>North Korea ICBM</node>,
cts:or-query((//label)),
<phrase>{$cts:text}</phrase>)
That would embed the markup for any matching phrase: <node><phrase>North Korea</phrase></node>
Some care would have to be taken around overlaps, if you want to force a particular winner, by applying the set you want to win first, and then running a second pass with the others.

Different characters treat as equal in elasticsearch

We are building a searchmachine with elasticsearch to use intern in our company. We are using one inputfield where users can give in their searchwords (Google like). So it should be possible to search on different kind of words separate by spaces.
Because it is possible to search on names, and names can be written on different kind of ways, we would like to treat different characters as equal.
For example the name "Heymans" can be written like "Hymans", "Heimans", "Hijmans", ...
If a user search on "Hijmans", "Heymans" should be found with, preferably, the same score when searching on "Heymans".
Is it possible to set "ei", "ij", "ey" as equal values?
We know that there is the synonym feature, but if we do it that way, the scores are very low.
We do not want to set "Hymans", "Heimans", "Hijmans" as synoniems, because there are other names with the same problem...
Thanks for the help!

How do I get the most popular tags?

How do I get the most popular tags from Instagram API? I searched the API, but didn't find a way to retrieve this data. This website gets it, but how?
The question is old, but for others also struggling the same problem, as Sebastien mentioned there is no such query. However I have been recently needing same functionality and came down idea that small traversal pretty much solves the problem.
Instagram doesn't respond you a list of tags starting with just one letter.
Example:
https://api.instagram.com/v1/tags/search?q=a
This URL returns just one element, which is "a". However if you send a request containing two characters like
https://api.instagram.com/v1/tags/search?q=aa
then you'll end up with the most popular tags starting on "aa"
Afterwards you can simply traverse your desired alphabet in O(N^2) time and by joining responses you'll end up with a list of most popular tags.
Length in case of English(Latin) alphabet would be 26^2 = 676
Though you shouldn't probably try getting any more tags as the limit is still 5,000 requests and 26^3 would go for 17576.
foreach(character in 'a'...'
{
foreach(character in 'a'...'z')
{
Send hashtag request 'aa'...'az'...'zz'
Get json and merge with previous array
}
}
Sort final array by [media_count]
Alternate approach would be to parse some huge dictionary (wikipedia dump for example) and sort out top 5000 most common prefixes and try querying them.
I don't think the API supports that query. What you can do is check this popular media endpoint and deduce popular tags from there:
http://instagram.com/developer/endpoints/media/#get_media_popular
The website you mention could be using multiple Real-time subscriptions and generate that list of popular tags by consolidatingthe harvested information. That wouldbe my guess.
I think the best thing is to ask them directly.

Resources