How to use more than one multivalued field in solr search - search

I have documents that has multivalue fields in my solr. I want to make search according to these multivalue fields.
When I want to query with;
http://localhost:8983/solr/demo/select?q=*:*&fq=id:FEAE38C2-ABFF-4F0C-8AFD-9B8F51036D8A
it gives me the following query result.
response": {
"numFound": 1,
"start": 0,
"docs": [
{
"created_date": "2016-03-23T13:47:46.55Z",
"solr_index_date": "2016-04-01T08:21:59.78Z",
"TitleForUrl": "it-s-a-wonderful-life",
"modified_date": "2016-03-30T08:45:44.507Z",
"id": "FEAE38C2-ABFF-4F0C-8AFD-9B8F51036D8A",
"title": "It's a wonderful life",
"article": "An angel helps a compassionate but despairingly frustrated businessman by showing what life would have been like if he never exis",
"Cast": [
"James Stewart",
"Donna Reed",
"Lionel Barrymore"
],
"IsCastActive": [
"false",
"true",
"true"
]
}
]
}
As you see I have 2 maltivalue fields that are named "Cast" and "IsCastActive".
My problem is When I add filters like Cast:"James Stewart" AND IsCastActive = "true" like the following:
http://localhost:8983/solr/demo/select?q=*:*&fq=id:FEAE38C2-ABFF-4F0C-8AFD-9B8F51036D8A&fq=Cast:"James Stewart"&fq=IsCastActive:"true"
Solr still gives the same result but "James Stewart" is not active in the document. So, I don't want Solr to response any document acconding to my query.
I think I'm doing something wrong. What's the correctly way to do it?

This does not look much possible in a straight forward manner here in Solr . But i think more effective way would be that you keep your Cast member's name as key , and then associate it with the value as true , or false and then filter on your username as key . Something like this : James Stewart :["true"] . Or may be you can use a single field that store cast name and his/her activity status delimited by a colon . . Something like this castInfo:["James Stewart:false","John Sanders:true"] . You can filter on it then by something like this fq=castInfo:"James Stewart:false" .

I want to propose an alternative solution to your problem. Such solution stores true/false as payloads integers. So the idea is to have a field called cast having a definition in the schema like:
<field name="cast" type="payloads" indexed="true" stored="true"/>
<fieldtype name="payloads" stored="false" indexed="true" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="integer"/>
</analyzer>
<similarity class="payloadexample.PayloadSimilarityFactory" />
</fieldtype>
The content can be indexed for instance as:
James Stewart|0
Donna Reed|1
where 0/1 is true/false.
Using payloads would also allow you to read directly from the posting list improving your performance on relevant queries.
Here you can find an example explaining how to achieve what I explained above.

Related

Lowercasing complex object field names in azure data factory data flow

I'm trying to lowercase the field names in a row entry in azure data flow. Inside a complex object I've got something like
{
"field": "sample",
"functions": [
{
"Name": "asdf",
"Value": "sdfsd"
},
{
"Name": "dfs",
"Value": "zxcv"
}
]
}
and basically what I want is for "Name" and "Value to be "name" and "value". However can't seem to use any expressions that will work for the nested fields of a complex object in the expression builder.
I've tried using a something like a select with a rule-based mapping that is the rule being 1 == 1 and lower($$), but with $$ it seems to only work for root columns of the complex object and not the nested fields inside.
As suggested by #Mark Kromer MSFT, for changing case of columns inside complex type select the functions in the Hierarchy level.
Please check the below for your reference:
Here, I have used both.
You can see the difference in results.

ElasticSearch default scoring mechanism

What I am looking for, is plain, clear explanation, of how default scoring mechanism of ElasticSearch (Lucene) really works. I mean, does it use Lucene scoring, or maybe it uses scoring of its own?
For example, I want to search for document by, for example, "Name" field. I use .NET NEST client to write my queries. Let's consider this type of query:
IQueryResponse<SomeEntity> queryResult = client.Search<SomeEntity>(s =>
s.From(0)
.Size(300)
.Explain()
.Query(q => q.Match(a => a.OnField(q.Resolve(f => f.Name)).QueryString("ExampleName")))
);
which is translated to such JSON query:
{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}
There is about 1.1 million documents that search is performed on. What I get in return, is (that is only part of the result, formatted on my own):
650 "ExampleName" 7,313398
651 "ExampleName" 7,313398
652 "ExampleName" 7,313398
653 "ExampleName" 7,239194
654 "ExampleName" 7,239194
860 "ExampleName of Something" 4,5708737
where first field is just an Id, second is Name field on which ElasticSearch performed it's searching, and third is score.
As you can see, there are many duplicates in ES index. As some of found documents have diffrent score, despite that they are exactly the same (with only diffrent Id), I concluded that diffrent shards performed searching on diffrent parts of whole dataset, which leads me to trail that the score is somewhat based on overall data in given shard, not exclusively on document that is actually considered by search engine.
The question is, how exactly does this scoring work? I mean, could you tell me/show me/point me to exact formula to calculate score for each document found by ES? And eventually, how this scoring mechanism can be changed?
The default scoring is the DefaultSimilarity algorithm in core Lucene, largely documented here. You can customize scoring by configuring your own Similarity, or using something like a custom_score query.
The odd score variation in the first five results shown seems small enough that it doesn't concern me much, as far as the validity of the query results and their ordering, but if you want to understand the cause of it, the explain api can show you exactly what is going on there.
The score variation is based on the data in a given shard (like you suspected). By default ES uses a search type called 'query then fetch' which, sends the query to each shard, finds all the matching documents with scores using local TDIFs (this will vary based on data on a given shard - here's your problem).
You can change this by using 'dfs query then fetch' search type - prequery each shard asking about term and document frequencies and then sends a query to each shard etc..
You can set it in the url
$ curl -XGET '/index/type/search?pretty=true&search_type=dfs_query_then_fetch' -d '{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}'
Great explanation in ElasticSearch documentation:
What is relevance:
https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html
Theory behind relevance scoring:
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

couchdb - Map Reduce - How to Join different documents and group results within a Reduce Function

I am struggling to implement a map / reduce function that joins two documents and sums the result with reduce.
First document type is Categories. Each category has an ID and within the attributes I stored a detail category, a main category and a division ("Bereich").
{
"_id": "a124",
"_rev": "8-089da95f148b446bd3b33a3182de709f",
"detCat": "Life_Ausgehen",
"mainCat": "COL_LEBEN",
"mainBereich": "COL",
"type": "Cash",
"dtCAT": true
}
The second document type is a transaction. The attributes show all the details for each transaction, including the field "newCat" which is a reference to the category ID.
{
"_id": "7568a6de86e5e7c6de0535d025069084",
"_rev": "2-501cd4eaf5f4dc56e906ea9f7ac05865",
"Value": 133.23,
"Sender": "Comtech",
"Booking Date": "11.02.2013",
"Detail": "Oki Drucker",
"newCat": "a124",
"dtTRA": true
}
Now if I want to develop a map/reduce to get the result in the form:
e.g.: "Name of Main Category", "Sum of all values in transactions".
I figured out that I could reference to another document with "_ID:" and ?include_docs=true, but in that case I can not use a reduce function.
I looked in other postings here, but couldn't find a suitable example.
Would be great if somebody has an idea how to solve this issue.
I understand, that multiple Category documents may have the same mainCat value. The technique called view collation is suitable to some cases where single join would be used in relational model. In your case it will not help: although you use two document schemes, you really have three level structure: main-category <- category <- transaction. I think you should consider changing the DB design a bit.
Duplicating the data, by storing mainCat value also in the transaction document, would help. I suggest to use meaningful ID for the transaction instead of generated one. You can consider for example "COL_LEBEN-7568a6de86e5e" (concatenated mainCat with some random value, where - delimiter is never present in the mainCat). Then, with simple parser in map function, you emit ["COL_LEBEN", "7568a6de86e5e"] for transactions, ["COL_LEBEN"] for categories, and reduce to get the sum.

Searching and match count for phrase with Solr

I am using Solr to index documents and now I need to search those documents for an exact phrase and sort the results by the number of times this phrase appears on the document. I also have to present the number of times the phrase is matched back to the user.
I was using the following query (here I am searching by the word SAP):
{
:params => {
:wt => "json",
:indent => "on",
:rows => 100,
:start => 0,
:q => "((content:SAP) AND (doc_type:ClientContact) AND (environment:production))",
:sort => "termfreq(content,SAP) desc",
:fl => "id,termfreq(content,SAP)"
}
}
Of course this is a representation of the actual query, that is done by transforming this hash into a query string at runtime.
I managed to get the search working by using content:"the query here" instead of content:the query here, but the hard part is returning and sorting by the termfreq.
Any ideas on how I could make this work?
Obs: I am using Ruby but this is a legacy application and I can't use any RubyGems, I am using the HTTP interface to Solr here.
I was able to make it work adding a ShingleFilter to my schema.xml:
In my case I started using SunSpot, so I just had to make the following change:
<!-- *** This fieldType is used by Sunspot! *** -->
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- This is the line I added -->
<filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="true"/>
</analyzer>
</fieldType>
After doing that change, restarting Solr and reindexing, I was able to use termfreq(content, "the query here") both on my query (q=), on the returning fields (fl=) and even on sorting (sort=).
put debug=results at the end of solr url
it will give you the phrase freq also.

SOLR sorting based on date : storing date as toISOString(), whereas toUTCString fails

During storing date field into SOLR, I convert Date() to toISOString() and it accepts. I tried storing using toUTCString, but it fails.
Now while searching, I am sorting based on date, I do get result, but these are not sorted in an descending order, rather I get it in mixed order.
I tried specifying a range, using [NOW-1YEAR/DAY TO NOW/DAY+1DAY], but the result is still the same. First I get 6 days old document, then 30min old doc and then 2 months old doc.
what should be the right approach ?
EDIT:
Here is the date field that i added in schema.xml
<field name="message_date" type="date" indexed="true" stored="false" />
and here are the parameters, I am sending during each search,
query = "*:*";
var options = {
fq: '{!geofilt}',
sfield: 'location',
pt: latitude+','+longitude,
d: 10,
sort: ["message_date desc", "geodist() asc"],
start: 0,
rows: 10
}
solrclient.query(query, options, function(err, solrRes){
....
});
This is javascript in the server side, node.js code.
The above code is fine and it is working. Problem was, after retrieving the result from SOLR, I do a finer search in my database to get more details and that was not sorted.
So sorted the result from Mongodb after retrieving the result from SOLR and that worked.
Using nodejs and the solr module is, https://github.com/gsf/node-solr

Resources