Solr - Include a field only if other fields where found - search

Assuming I have the fields
textFieldA
textFieldB
specialC
in my index. Now I want to query these with
textFieldA:"searchVal" textFieldB:"searchVal" specialC:"somecode"
But I only want to boost matches on specialC if there were also matches on at least one of the other fields.
Example:
DocumentA:
textFieldA:"This is a test" textFieldB:"for clarification" specialC:"megacode"
DocumentB:
textFieldA:"Doesnt contain" textFieldB:"searched word here" specialC:"megacode"
DocumentC:
textFieldA:"But this again" textFieldB:"contains test" specialC:"supercode"
Now when searching for example with
textFieldA:"test" textFieldB:"test" specialC:"supercode"
I want the results
DocumentC
DocumentA
with document C having the highest rank, but document B being excluded.
How can this be achieved?

q=textFieldA:"test" OR textFieldB:"test" OR textFieldA:"test" AND specialC:"supercode" OR textFieldB:"test" AND specialC:"supercode"&bq=(specialC:"supercode")^100
Should return only DocumentC and DocumentA in the desired order. bq means boosting one field/ field value, see more here https://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F .
As far as I know query boosting works only if you actually query for the thing you want to boost (kind of intuitive). That is why I added the last 2 parts to the query.

Related

Irrelevant results returned from view search in arangodb

We have a collection AbstractEvent with field 'text', which contains 1~30 Chinese characters and we want to perform LIKE match with %keyword%, with high performance(less than 0.3 second, for more 2 million records).
After a bunch of effort, we decided to use VIEW and analyzer identity to do this:
FOR i IN AbstractEventView
SEARCH ANALYZER(i.text LIKE '%keyword%', 'identity')
LIMIT 10
RETURN i.text
And here is the definition of view AbstractEventView
{
"name":"AbstractEventView",
"type":"arangosearch",
"links":{
"AbstractEvent":{
"analyzers":[
"identity"
],
"fields":{
"text":{}
}
}
}
}
However, records returned contain irrelevant ones.
The flowlling is an example:
FOR i IN AbstractEventView
SEARCH ANALYZER(i.text LIKE '%速%', 'identity')
LIMIT 10
RETURN i.text
and the result is
[
"全球经济增速虽军官下滑",
"油食用消费出现明显下滑",
"本次国家经济快速下行",
"这场所迅速爆发的情况",
"经济减速风景空间资本大规模流出",
"苜蓿草众人食品物资价格不稳定",
"荤菜价格快速走低",
"情况快速升级",
"情况快速进展",
"四季功劳增速断崖式回落后"
]
油食用消费出现明显下滑and苜蓿草众人食品物资价格不稳定 are irrelavent.
We've been struggling on this for days, can anyone help me out? Thanks.
PS:
Why we do not use FULL-TEXT index?
full-text index indexed fields by tokenized text, so that we can not get matching '货币超发' when keyword is '货',because '货币' is recgonized as a word.
Why we do not use FILTER with LIKE operator directly?
Filtering without index will cost about 1 second and it is not acceptable.

How to extract relationships from a text

I am currently new with NLP and need guidance as of how I can solve this problem.
I am currently doing a filtering technique where I need to brand data in a database as either being correct or incorrect. I am given a structured data set, with columns and rows.
However, the filtering conditions are given to me in a text file.
An example filtering text file could be the following:
Values in the column ID which are bigger than 99
Values in the column Cash which are smaller than 10000
Values in the column EndDate that are smaller than values in StartDate
Values in the column Name that contain numeric characters
Any value that follows those conditions should be branded as bad.
However, I want to extract those conditions and append them to the program that I've made so far.
For instance, for the conditions above, I would like to produce
`if ID>99`
`if Cash<10000`
`if EndDate < StartDate`
`if Name LIKE %[1-9]%`
How can I achieve the above result using the Stanford NLP? (or any other NLP library).
This doesn't look like a machine learning problem; it's a simple parser. You have a simple syntax, from which you can easily extract the salient features:
column name
relationship
target value or target column
The resulting "action rule" is simply removing the "syntactic sugar" words and converting the relationship -- and possibly the target value -- to its symbolic form.
Enumerate all of your critical words for each position in a lexicon. Then use basic string manipulation operators in your chosen implementation language to find the three needed fields.
EXAMPLE
Given the data above, your lexicons might be like this:
column_trigger = "Values in the column"
relation_dict = {
"are bigger than" : ">",
"are smaller than" : "<",
"contain" : "LIKE",
...
}
value_desc = {
"numeric characters" : "%[1-9]%",
...
}
From here, use these items in standard parsing. If you're not familiar with that, please look up the basics of a simple sentence grammar in your favourite programming language, with rules such as such as
SENTENCE => SUBJ VERB OBJ
Does that get you going?

Elasticsearch get a selection of predefined types as result in one query

I've got an ElasticSearch index with a large set of product properties. They are all looking like that:
{'_id':1,'type':'manufacturer','name':'Toyota'},
{'_id':2,'type':'color','name':'Green'},
{'_id':3,'type':'category','name':'SUV Cars'},
{'_id':4,'type':'material','name':'Leather'},
{'_id':5,'type':'manufacturer','name':'BMW'},
{'_id':6,'type':'color','name':'Red'},
{'_id':7,'type':'category','name':'Cabrios'},
{'_id':8,'type':'material','name':'Steel'},
{'_id':9,'type':'category','name':'Cabrios Hardtop'},
{'_id':10,'type':'category','name':'Cabrios Softtop'},
... and 1 Mio. more ...
There are 4 different types of product properties existing: Categories, Manufacturers, Colors and Materials.
The question: How can i query with only one query (it's a settled performance requirement) the best matching result for each type?
So if i request a full text search query i.e. "Green Toyota Cabrios" i should get the following results:
{'_id':2,'type':'color','name':'Green'},
{'_id':1,'type':'manufacturer','name':'Toyota'},
{'_id':7,'type':'category','name':'Cabrios'},
{one matching result of the 'material'-type if found by the query}
That would be the perfect result set, always at maximum 4 results (for each 'type' one result). If there is no matching result for a specific type available there should be just 3 result items returned.
How is that possible with Elasticsearch? Thanks for your ideas!
I don't understand clearly your use case. What are you indexing in fact?
If you index cars, you should index it like:
{
"color": "Green",
"manufacturer": "Toyota",
"category": "Cabrios"
}
That said, from the question you ask:
You can probably define your fields as not_indexed. That way, if you search for "Green Toyota Cabrios" if field "name" you won't get "Cabrios Hardtop".
Not sure I really answered but I don't see your use case...

How do I get all hits from a cts:search() in Marklogic

I have a collection containing lots of documents.
when I search the collection, I need to get a list of matches independent of documents. So if I search for the word "pie". I would get back a list of documents, properly sorted by relevance. However, some of these documents contain the word "pie" on more then one place. I would like to get back a list of all matches, unrelated to the document where the match was found. Also, this list of all hits would need the be sorted by relevance (weight), again totally independent of the document (not grouped by the document).
Following code searches and returns matches grouped by the document...
let $searchfor := "pie"
let $query := cts:and-query((
cts:element-word-query(xs:QName("title"), ($searchfor), (), 16),
cts:element-word-query(xs:QName("para"), ($searchfor), (), 10)
))
let $resultset := cts:search(fn:collection("docs"), $query)[0 to 100]
for $n in $resultset
return cts:score($n)
What I need is $n to be the "match-node", not a "document-node"...
Thanks!
Document relevance is determined by TFIDF. Matches contribute to a document's score but don't have scores relative to each other. cts:search already returns results ordered by document relevance, so you could do this to get match nodes ordered by their ancestor document score:
let $searchfor := "pie"
let $query := cts:and-query((
cts:element-word-query(xs:QName("title"), ($searchfor), (), 16),
cts:element-word-query(xs:QName("para"), ($searchfor), (), 10)
))
return
cts:search(//(title|para),$query)[0 to 100]/cts:highlight(.,$query,element match {$cts:node})//match/*
You need to split the document (fragment it) into smaller documents. Every textnode could be a document, with an stored original xpath so that the context is not lost.
I recommend that you look at the Search API (http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf and http://community.marklogic.com/pubs/5.0/apidocs/SearchAPI.html). This API will give what you want, providing match nodes as well as the URIs for the actual documents. You should also find it easier to use for the general cases, although there will be edge cases where you will need to revert back to cts:search.
search:search is the specific function you will want to use. It will give you back responses similar to this:
<search:response total="1" start="1" page-length="10" xmlns=""
xmlns:search="http://marklogic.com/appservices/search">
<search:result index="1" uri="/foo.xml"
path="fn:doc("/foo.xml")" score="328"
confidence="0.807121" fitness="0.901397">
<search:snippet>
<search:match path="fn:doc("/foo.xml")/foo">
<search:highlight>hello</search:highlight></search:match>
</search:snippet>
</search:result>
<search:qtext>hello sample-property-constraint:boo</search:qtext>
<search:report id="SEARCH-FLWOR">(cts:search(fn:collection(),
cts:and-query((cts:word-query("hello", ("lang=en"), 1),
cts:properties-query(cts:word-query("boo", ("lang=en"), 1))),
()), ("score-logtfidf"), 1))[1 to 10]
</search:report>
<search:metrics>
<search:query-resolution-time>PT0.647S</search:query-resolution-time>
<search:facet-resolution-time>PT0S</search:facet-resolution-time>
<search:snippet-resolution-time>PT0.002S</search:snippet-resolution-time>
<search:total-time>PT0.651S</search:total-time>
</search:metrics>
</search:response>
Here you can see that every result has one or possibly more match elements defined.
How would you determine the relevance of a word independent of the document? Relevance is a measure of document relevance, not word relevance. I don't know how one would measure word relevance.
You could potentially return all words ordered by document relevance, then words for each document in "document order" which means the order in which they appear in the document. That would be relatively easy to do with search:search where you iterate over all results and extract each matching word. What would you present with each match? Its surrounding snippet?
Keep in mind that what you're asking for would potentially take a long time to execute.

Problem searching MySQL table using MATCH AGAINST

I have a MySQL table containing event data.
On this table, I have a FULLTEXT index, incorporating event_title,event_summary,event_details of types varchar,text,text respectively.
Examples of titles include: "Connections Count", "First Aid", "Health & Safety".
I can search the table as follows:
SELECT * FROM events WHERE MATCH (event_title,event_summary,event_details) AGAINST ('connections');
Which returns the events named "Connections Count" no problem.
However, no matter what I try, I get an empty result set when running the following queries:
SELECT * FROM events WHERE MATCH (event_title,event_summary,event_details) AGAINST ('first aid');
SELECT * FROM events WHERE MATCH (event_title,event_summary,event_details) AGAINST ('first');
SELECT * FROM events WHERE MATCH (event_title,event_summary,event_details) AGAINST ('aid');
I tried renaming an event to "Rich Aid" and could search for that just fine. Also, "First Rich" works great too.
Any ideas of why this is happening or how to fix it would be great!
Thanks for your time.
Rich
"first" is a "stopword" and by default words below 4 caracters are not matched unless you specify ft_min_word_len value.

Resources