I am using syntaxnet in Spanish and I have found that all the words have a field called "feats" whose format depends on the type of word (noun, pronoun, verb). There are some fields whose meaning is obvious, but in other cases I cannot figure out what it is showing. For example, this is the case of fields such as "fPOS" or "Case" in pronouns. Is there any guide or list with explanations avaiable?
Universal features are documented here .
Related
I have a text field for tags. For example some entities:
{"tags": "apple. fruits. eat."}
{"tags": "green apple."}
{"tags": "banana. apple."}
I want to select entities with tag apple, not green apple or smth apple smth. Different variants lead to the one point: select a sentence with existing expression and it doesn't matter how this sentence looks like. But in this case it's matter.
How can I do it by using Lucene syntax or Azure Search tools? Or (in general) how can I search for a completely same sentence?
I presume that the "." is a deliminator for the different tags. There may be a way to express this in lucene, but you may need to add some custom analyzers to preserve the "."'s in tokenization.
A better strategy in this case would be use use a field of type Collection(Edm.String). This will allow you to better preserve structure the phrases for the tags, and you can use a filter to select the specific value of "apple". Collection(Edm.String) also allows you to enable faceting of the tags which is useful.
We are using Elastic Search and as part of a requirement we want to be able to distinguish hits generated by the synonym filter from those that are not because of synonyms.
For example if we had a query such as:
(car AND red) AND (NOT ford)
With synonym: color <-> red
Then we want to know:
[the red car] is a simple hit.
But,
[the color of the car] is a hit caused by the synonym filter.
Our synonym filter is defined as follows:
synonym_filter :
type : synonym
synonyms_path : synonyms.txt
ignore_case : true
expand : true
format : solr
Since the synonym filter does its work by modifying the token stream at index time there might not be a straightforward way to do this. Perhaps by using the highlighting functionality there might be an algorithm.
I was wondering if anybody has experience with this kind of solution or if a clever solution exists for this requirement. Thank you in advance.
I believe the best solution would be to search content with synonyms separately from content without.
That is, if you are applying the SynonymFilter at index time, then index the content twice, once without synonyms, and once with synonyms (and possibly any other filters to facilitate a broader search). You could then either run separate queries against the two fields, or you could run a single query with matches against the more direct field significantly boosted.
I m developing question answering system in java,in tht I have created templates manually which will be match to user asked question.
Problem is after pre processing i have list of
Keywords and these keywords I want to match with keywords in stored template to filter search.is there any algorithm?
Ex.ques. wht is features of java?
Keywords-features java
Extract Templates containing keywords features and java.
the thing which i have understood from your question you have some keywords and some other patterns containing those keywords in your lexicon and some others which come to your system from users question. then you need an algorithm to find these patterns in your system input.
as i know in java if you define a pattern from Pattern class then you can do like below to achieve the thing which you want:(A simple example)
Pattern pat = Pattern.compile("[A-Z]+");
Matcher matcher = pat.matcher("ABCD");
if(matcher.matches()) {
System.out.println("it matchs.");
}
I admit that I havent searched extensively in the SO database. I tried reading the natural npm package but doesnt seem to provide the feature. I would like to know if the below requirement is somewhat possible ?
I have a database that has list of all cities of a country. I also have rating of these cities (best place to live, worst place to live, best rated city, worsrt rated city etc..). Now from the User interface, I would like to enable the user to enter free text and from there I should be able to search my database.
For e.g Best place to live in California
or places near California
or places in California
From the above sentence, I want to extract the nouns only (may be ) as this will be name of the city or country that I can search for.
Then extract 'best' means I can sort is a particular order etc...
Any suggestions or directions to look for?
I risk a chance that the question will be marked as 'debatable'. But the reason I posted is to get some direction to proceed.
[I came across this question whilst looking for some use cases to test a module I'm working on. Obviously the question is a little old, but since my module addresses the question I thought I might as well add some information here for future searchers.]
You should be able to do what you want with a POS chunker. I've recently released one for Node that is modelled on chunkers provided by the NLTK (Python) and Standford NLP (Java) libraries (the chunk() and TokensRegex() methods, resepectively).
The module processes strings that already contain parts-of-speech, so first you'll need to run your text through a parts-of-speech tagger, such as pos:
var pos = require('pos');
var words = new pos.Lexer().lex('Best place to live in California');
var tags = new pos.Tagger()
.tag(words)
.map(function(tag){return tag[0] + '/' + tag[1];})
.join(' ');
This will give you:
Best/JJS place/NN to/TO live/VB in/IN California/NNP ./.
Now you can use pos-chunker to find all proper nouns:
var chunker = require('pos-chunker');
var places = chunker.chunk(tags, '[{ tag: NNP }]');
This will give you:
Best/JJS place/NN to/TO live/VB in/IN {California/NNP} ./.
Similarly you could extract verbs to understand what people want to do ('live', 'swim', 'eat', etc.):
var verbs = chunker.chunk(tags, '[{ tag: VB }]');
Which would yield:
Best/JJS place/NN to/TO {live/VB} in/IN California/NNP ./.
You can also match words, sequences of words and tags, use lookahead, group sequences together to create chunks (and then match on those), and other such things.
You probably don't have to identify what is a noun. Since you already have a list of city and country names that your system can handle, you just have to check whether the user input contains one of these names.
Well firstly you'll need to find a way to identify nouns. There is no core node module or anything that can do this for you. You need to loop through all words in the string and then compare them against some kind of dictionary database so you can find each word and check if it's a noun.
I found this api which looks pretty promising. You query the API for a word and it sends you back a blob of data like this:
<?xml version="1.0" encoding="UTF-8"?>
<results>
<result>
<term>consistent, uniform</term>
<definition>the same throughout in structure or composition</definition>
<partofspeech>adj</partofspeech>
<example>bituminous coal is often treated as a consistent and homogeneous product</example>
</result>
</results>
You can see that it includes a partofspeech member which tells you that the word "consistent" is an adjective.
Another (and better) option if you have control over the text being stored is to use some kind of markup language to identify important parts of the string before you save it. Something like BBCode. I even found a BBCode node module that will help you do this.
Then you can save your strings to the database like this:
Best place to live in [city]California[/city] or places near [city]California[/city] or places in [city]California[/city].
or
My name is [first]Alex[/first] [last]Ford[/last].
If you're letting user's type whole sentences of text and then you're trying to figure out what parts of those sentences is data you should use in your app then you're making things very unnecessarily hard on yourself. You should either ask them to input important pieces of data into their own text boxes or you should give the user a formatting language such as the aforementioned BBCode syntax so they can identify important bits for you. The job of finding out which parts of a string are important is going to be a huge one for you I think.
I have stemming enabled in my Solr instance, I had assumed that in order to perform an exact word search without disabling stemming, it would be as simple as putting the word into quotes. This however does not appear to be the case?
Is there a simple way to achieve this?
There is a simple way, if what you're referring to is the "slop" (required similarity) as part of a fuzzy search (see the Lucene Query Syntax here).
For example, if I perform this search:
q=field_name:determine
I see results that contain "determine", "determining", "determined", etc.. If I then modify the query like so:
q=field_name:determine~1
I only see results that contain the word "determine". This is because I'm specifying a required similarity of 1, which means "exact match". I can specify this value anywhere from 0 to 1.
Another thing you can do is index the same text without stemming in one field, and with stemming in another. Boost the non-stemmed field & that should prefer exact versions of words to stemmed versions. Of course you could also write your own query parser that directs quoted phrases to the non-stemmed field only.