How to use an analyzer in compass-lucene search - search

How do I add compass analyzer while indexing and searching data in compass.I am using schema based configuration for compass.I want to use StandardAnalyzer with no stopwords.Because I want to index data as it is,without ignoring search terms like AND , OR , IN . The default analyzer will ignore AND , OR , IN from the data I give for indexing.
How do I configure snowball analyzer either thru code or thru xml. If someone could post me an example.

Below is the example. You can also find more details here
<comp:searchEngine useCompoundFile="false" cacheInvalidationInterval="-1">
<comp:allProperty enable="false" />
<!--
By Default, compass uses StandardAnalyzer for indexing and searching. StandardAnalyzer
will use certain stop words (stop words are not indexed and hence not searcheable) which are
valid search terms in the DataSource World. For e.g. 'in' for Indiana state, 'or' for Oregon etc.
So we need to provide our own Analyzer.
-->
<comp:analyzer name="default" type="CustomAnalyzer"
analyzerClass="com.ICStandardAnalyzer" />
<comp:analyzer name="search" type="CustomAnalyzer"
analyzerClass="com.ICStandardAnalyzer" />
<!--
Disable the optimizer as we will optimize the index as a separate batch job
Also, the merge factor is set to 1000, so that merging doesnt happen during the commit time.
Merging is a time consuming process and will be done by the batched optimizer
-->
<comp:optimizer schedule="false" mergeFactor="1000"/>
</comp:searchEngine>

Related

How to remove a feature from a product?

So I am still learning this amazing complex system folks call Hybris, or SAP Commerce, along with lots of other names :) I ran into a problem and am looking to learn how to get out of it. I have added four new classifications attributes (Utilization, Fit, Material, Function). When I went to add them to the products I added a space between them and the numeric code that came after it:
$feature1=#Utilization, 445 [$clAttrModifiers]; # Style
$feature2=#Fit, 446 [$clAttrModifiers]; # Colour
$feature3=#Material, 447 [$clAttrModifiers]; # Connections
$feature4=#Function, 448 [$clAttrModifiers]; # Function
INSERT_UPDATE Product;code[unique=true];$feature1;$feature2;$feature3;$feature4;$catalogVersion;
;300413166;my;feature;has;a space
The problem is I want to take the space out, as seen in the follow code:
$feature1=#Utilization,445 [$clAttrModifiers];# Style
$feature2=#Fit,446 [$clAttrModifiers];# Colour
$feature3=#Material,447 [$clAttrModifiers];# Connections
$feature4=#Function,448 [$clAttrModifiers];# Function
INSERT_UPDATE Product;code[unique=true];$feature1;$feature2;$feature3;$feature4;$catalogVersion;
;300413166;Bottom;Loose;Yam type;Sportswear
When I run both of these scripts together, I get 8 features:
So how do I remove the four features that have spaces in them?
How do I go about actually removing the first set of features?
Below removes the ClassAttributeAssignment records, which also removes the ProductFeature entries assigned to all products. However, you need to find the correct classification category code (e.g. clasificationCategory1) the attribute (e.g. Utilization, 445) belongs to. The classification category is the grouping/header you will find in the Attributes
$classificationCatalog=ElectronicsClassification
$classificationSystemVersion=systemVersion(catalog(id[default=$classificationCatalog]),version[default='1.0'])[unique=true,default=$classificationCatalog:1.0]
$classificationCatalogVersion=catalogversion(catalog(id[default=$classificationCatalog]),version[default='1.0'])[unique=true]
$class=classificationClass(code,$classificationCatalogVersion)[unique=true]
$attribute=classificationAttribute(code,$classificationSystemVersion)[unique=true]
REMOVE ClassAttributeAssignment[batchmode=true];$class;$attribute
;clasificationCategory1;Utilization, 445
;clasificationCategory1;Fit, 446
You can also remove the ProductFeature entries (for all products) like this, but it doesn't remove the ClassAttribute assignment.
REMOVE ProductFeature[batchmode=true];qualifier[unique=true]
;ElectronicsClassification/1.0/clasificationCategory1.Utilization, 445
Other Reference:
Classification System API: https://help.sap.com/viewer/d0224eca81e249cb821f2cdf45a82ace/1905/en-US/8b7ad17c86691014aa0ee2d228c56dd1.html

Tokenizer Training with StanfordNLP

So my requirement is verbally simple. I need StanfordCoreNLP default models along with my custom trained model, based on custom entities. In a final run, I need to be able to isolate specific phrases from a given sentence (RegexNER will be used)
Following are my efforts :-
EFFORT I :-
So I wanted to use the StanfordCoreNLP CRF files, tagger files and ner model files, along with my custom trained ner models.
I tried to find if there is any official way of doing this, but didnt get anything. There is a property "ner.model" for StanfordCoreNLP pipeline, but it will skip the default ones if used.
EFFORT II :-
Next (might not be the smartest thing ever. Sorry! Just a guy trying to make ends meet!) , I extracted the model stanford-corenlp-models-3.7.0.jar , and copied all :-
*.ser.gz (Parser Models)
*.tagger (POS Tagger)
*.crf.ser.gz (NER CRF Files)
and tried to put Comma Separated Values with properties "parser.model", "pos.model" and "ner.model" respectively, as follows :-
parser.model=models/ner/default/anaphoricity_model.ser.gz,models/ner/default/anaphoricity_model_conll.ser.gz,models/ner/default/classification_model.ser.gz,models/ner/default/classification_model_conll.ser.gz,models/ner/default/clauseSearcherModel.ser.gz,models/ner/default/clustering_model.ser.gz,models/ner/default/clustering_model_conll.ser.gz,models/ner/default/english-embeddings.ser.gz,models/ner/default/english-model-conll.ser.gz,models/ner/default/english-model-default.ser.gz,models/ner/default/englishFactored.ser.gz,models/ner/default/englishPCFG.caseless.ser.gz,models/ner/default/englishPCFG.ser.gz,models/ner/default/englishRNN.ser.gz,models/ner/default/englishSR.beam.ser.gz,models/ner/default/englishSR.ser.gz,models/ner/default/gender.map.ser.gz,models/ner/default/md-model-dep.ser.gz,models/ner/default/ranking_model.ser.gz,models/ner/default/ranking_model_conll.ser.gz,models/ner/default/sentiment.binary.ser.gz,models/ner/default/sentiment.ser.gz,models/ner/default/truecasing.fast.caseless.qn.ser.gz,models/ner/default/truecasing.fast.qn.ser.gz,models/ner/default/word_counts.ser.gz,models/ner/default/wsjFactored.ser.gz,models/ner/default/wsjPCFG.ser.gz,models/ner/default/wsjRNN.ser.gz
ner.model=models/ner/default/english.all.3class.caseless.distsim.crf.ser.gz,models/ner/default/english.all.3class.distsim.crf.ser.gz,models/ner/default/english.all.3class.nodistsim.crf.ser.gz,models/ner/default/english.conll.4class.caseless.distsim.crf.ser.gz,models/ner/default/english.conll.4class.distsim.crf.ser.gz,models/ner/default/english.conll.4class.nodistsim.crf.ser.gz,models/ner/default/english.muc.7class.caseless.distsim.crf.ser.gz,models/ner/default/english.muc.7class.distsim.crf.ser.gz,models/ner/default/english.muc.7class.nodistsim.crf.ser.gz,models/ner/default/english.nowiki.3class.caseless.distsim.crf.ser.gz,models/ner/default/english.nowiki.3class.nodistsim.crf.ser.gz
pos.model=models/tagger/default/english-left3words-distsim.tagger
But, I get the following exception :-
Caused by: edu.stanford.nlp.io.RuntimeIOException: Error while loading a tagger model (probably missing model file)
Caused by: java.io.StreamCorruptedException: invalid stream header: EFBFBDEF
EFFORT III :-
I thought I will be able to handle with RegexNER, and I was successful to some extent. Just that the entities that it learns through RegexNER, it doesn't apply to forthcoming expressions. Eg: It will find the entity "CUSTOM_ENTITY" inside a text, but if i put a RegexNER like ( [ {ner:CUSTOM_ENTITY} ] /with/ [ {ner:CUSTOM_ENTITY} ] ) it never succeeds in finding the right phrase.
Really need help here!!! I don't wanna train the complete model again, Stanford guys got over a GB of model information which is useful to me. Just that I want to add custom entities too.
First of all make sure your CLASSPATH has the proper jars in it.
Here is how you should include your custom trained NER model:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.model <csv-of-model-paths> -file example.txt
-ner.model should be set to a comma separated list of all models you want to use.
Here is an example of what you could put:
edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz,/path/to/custom_model.ser.gz
Note in my example that all of the standard models will be run, and then finally your custom model will be run. Make sure your custom model is in the CLASSPATH.
You also probably need to add this to your command: -ner.combinationMode HIGH_RECALL. By default the NER combination will only use the tags for a particular class from the first model. So if you have model1,model2,model3 only model1's LOCATION will be used. If you set things to HIGH_RECALL then model2 and model3's LOCATION tags will be used as well.
Another thing to keep in mind, model2 can't overwrite decisions by model1. It can only overwrite "O". So if model1 says that a particular token is a LOCATION, model2 can't say it's an ORGANIZATION or a PERSON or anything. So the order of the models in your list matters.
If you want to write rules that use entities found by previous rules, you should look at my answer to this question:
TokensRegex rules to get correct output for Named Entities
from your given context
use this instead of comma separated values and try to have all the jars within the same directory:
parser.model=models/ner/default/anaphoricity_model.ser.gz
parser.model=models/ner/default/anaphoricity_model_conll.ser.gz
parser.model=models/ner/default/classification_model.ser.gz
parser.model=models/ner/default/classification_model_conll.ser.gz
parser.model=models/ner/default/clauseSearcherModel.ser.gz
parser.model=models/ner/default/clustering_model.ser.gz
parser.model=models/ner/default/clustering_model_conll.ser.gz
parser.model=models/ner/default/english-embeddings.ser.gz
parser.model=models/ner/default/english-model-conll.ser.gz
parser.model=models/ner/default/english-model-default.ser.gz
parser.model=models/ner/default/englishFactored.ser.gz
parser.model=models/ner/default/englishPCFG.caseless.ser.gz
parser.model=models/ner/default/englishPCFG.ser.gz
parser.model=models/ner/default/englishRNN.ser.gz
parser.model=models/ner/default/englishSR.beam.ser.gz
parser.model=models/ner/default/englishSR.ser.gz
parser.model=models/ner/default/gender.map.ser.gz
parser.model=models/ner/default/md-model-dep.ser.gz
parser.model=models/ner/default/ranking_model.ser.gz
parser.model=models/ner/default/ranking_model_conll.ser.gz
parser.model=models/ner/default/sentiment.binary.ser.gz
parser.model=models/ner/default/sentiment.ser.gz
parser.model=models/ner/default/truecasing.fast.caseless.qn.ser.gz
parser.model=models/ner/default/truecasing.fast.qn.ser.gz
parser.model=models/ner/default/word_counts.ser.gz
parser.model=models/ner/default/wsjFactored.ser.gz
parser.model=models/ner/default/wsjPCFG.ser.gz
parser.model=models/ner/default/wsjRNN.ser.gz
now copy the above lines,and similarly make the other models too and paste it in a server.properties file.
if u don't have server.properties file then create it.
and use the following command too start you server:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 -serverProperties server.properties

Finding Related Topics using Google Knowledge Graph API

I'm currently working on a behavioral targeting application and I need a considerably large keyword database/tool/provider that enables applications to reach to the similar keywords via given keyword for my app. I've recently found that Freebase, which had been providing a similar service before Google acquired them and then integrated to their Knowledge Graph. I was wondering if it's possible to have a list of related topics/keywords for the given entity.
import json
import urllib
api_key = 'API_KEY_HERE'
query = 'Yoga'
service_url = 'https://kgsearch.googleapis.com/v1/entities:search'
params = {
'query': query,
'limit': 10,
'indent': True,
'key': api_key,
}
url = service_url + '?' + urllib.urlencode(params)
response = json.loads(urllib.urlopen(url).read())
for element in response['itemListElement']:
print element['result']['name'] + ' (' + str(element['resultScore']) + ')'
The script above returns the queries below, though I'd like to receive related topics to yoga, such as health, fitness, gym and so on, rather than the things that has the word "Yoga" in their name.
Yoga Sutras of Patanjali (71.245544)
Yōga, Tokyo (28.808222)
Sri Aurobindo (28.727333)
Yoga Vasistha (28.637642)
Yoga Hosers (28.253984)
Yoga Lin (27.524054)
Patanjali (27.061115)
Yoga Journal (26.635073)
Kripalu Center (26.074436)
Yōga Station (25.10318)
I'd really appreciate any suggestions, and I'm also open to using any other API if there is any that I could make use of. Cheers.
See your point:) So here's the script I use for that using Serpstat's API. Here's how it works:
Script collects the keywords from Serpstat's database
Then, collects search suggestions from Serpstat's database
Finally, collects search suggestions from Google's suggestions
Note that to make script work correctly, it's preferable to fill all input boxes. But not all of them are required.
Keyword — required keyword
Search Engine — a search engine for which the analysis will be carried out. For example, for the US Google, you need to set the g_us. The entire list of available search engines can be found here.
Limit the maximum number of phrases from the organic issue, which will participate in the analysis. You cannot set more than 1000 here.
Default keys — list of two-word keywords. You should give each of them some "weight" to receive some kind of result if something goes wrong.
Format: type, keyword, "weight". Every keyword should be written from a new line.
Types:
w — one word
p — two words
Examples:
"w; bottle; 50" — initial weight of word bottle is 50.
"p; plastic bottle; 30" — initial weight of phrase plastic bottle is 30.
"w; plastic bottle; 20" — incorrect. You cannot use a two-word phrase for the "w" type.
Bad words — comma-separated list of words you want the script to exclude from the results.
Token — here you need to enter your token for API access. It can be found on your profile page.
You can download the source code for script here

Get default stop word list in elastic search

I am trying to find out what the predefined stop word list for elastic search are, but i have found no documented read API for this.
So, i want to find the word lists for this predefined variables (_arabic_, _armenian_, _basque_, _brazilian_, _bulgarian_, _catalan_, _czech_, _danish_, _dutch_, _english_, _finnish_, _french_, _galician_, _german_, _greek_, _hindi_, _hungarian_, _indonesian_, _irish_, _italian_, _latvian_, _norwegian_, _persian_, _portuguese_, _romanian_, _russian_, _sorani_, _spanish_, _swedish_, _thai_, _turkish_)
I found the english stop word list in the documentation, but I want to check if it is the one my server really uses and also check the stop word lists for other languages.
The stop words used by the English Analyzer are the same as the ones defined in the Standard Analyzer, namely the ones you found in the documentation.
The stop word files for all other languages can be found in the Lucene repository in the analysis/common/src/resources/org/apache/lucene/analysis folder.

Using geospatial commands in rethinkdb with changefeed

right now I have a little problem:
I want to use geospatial commands (like getIntersecting) together with the changefeed feature of rethinkdb but I always get:
RqlRuntimeError: Cannot call changes on an eager stream in: r.db("Test").table("Message").getIntersecting(r.circle([-117.220406,32.719464], 10, {unit: 'mi'}), {index: 'loc'})).changes()
the big question is: Can I use getIntersecting with the changes() (couldn't find anything related to that in the docs btw ...) or do I have to abandon the idea of using rethinkdb geospatial features and just use change() to get ALL added or changed documents and do the geospatial stuff outside of rethinkdb?
You can't use .getIntersecting with .changes, but you can write essentially the same query by adding a filter after .changes that checks if the loc is within the circle. While .changes limits what you can write before the .changes, you write basically any query after the .changes and it will work.
r.table('Message')
.changes()
.filter(
r.circle([-117.220406,32.719464], 10, {unit: 'mi'})
.intersects(r.row('new_val')('loc'))
)
Basically, every time there is a change in the table the update will get push to the changefeed, but it will get filtered out. Since there is not a lot of support for geospatial and changfeeds, this is more or less how you would need to integrate the two.
In the future, changefeeds will be much broader and you'll be able to write basically any query with .changes at the end.

Resources