Remove duplicates without considering position - search

Is there any Filter Factory that can be used to remove duplicates without considering positions?
I cannot use the RemoveDuplicatesTokenFilterFactory because it considers positions [stack].

I had a similar issue with lots of duplicate values within fields where I wanted them to be unique. The solution was to add a processor to the solrconfig.xml file. Below is the example. Every value for the fields listed will be unique. My field names are ingredient_substance, active_moiety ...
<updateRequestProcessorChain>
<processor class="org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory">
<lst name="fields">
<str>ingredient_substance</str>
<str>active_moiety</str>
<str>generic_medicine</str>
<str>inactive_ingredient_substance</str>
</lst>
</processor>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

Related

SOLR: how to exclude fields globally?

Im using apache Solr for my site search. and the website has large number of pages and each page has a field called 'searchEnabled'. This is a boolean field contains values true or false. I want to exclude the disabled pages from all the search results (The site has number of different searches) if the searchEnabled field is set to false.
I can use a filter query(fq) to exclude this field. But my site is using number of different searches with different queries. I do not want to add the filter query in all the search queries across the website. Is there any easy way to disable the indexes with field 'searchEnabled' set to false?
So that no any solr search will return the document/pages where the field value is set to false.
You can add a parameter that will always be present to solrconfig.xml for the request handler you're making your request against.
By using the name appends for your parameter list, the parameter will always be appended to the other given parameters.
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="appends">
<str name="fq">searchEnabled:true</str>
</lst>
</requestHandler>
This will always add a filter query to your requests behind the scenes that limit the result set to those documents that have searchEnabled set to true.

Adding a single repeating node to target schema using BizTalk mapper

I'm currently working on a BizTalk project and I encountered a problem which I'm sure should be solvable within a mapping. However, I don't seem to be able to figure out how. Hopefully someone can help me out.
The situation is as follows: in both the source and the target schema, there's a repeating node with roughly the same child elements underneath it (no records or attributes involved). The repeating node within the source has a structure like this:
<fruit>
<item>
<sort>Apple</sort>
<size>5cm</size>
<colour>red</colour>
</item>
<item>
<sort>Pear</sort>
<size>8cm</size>
<colour>green</colour>
</item>
</fruit>
While the repeating node in the target looks more like this:
<FRUIT>
<SORT>Apple</SORT>
<SIZE>5cm</SIZE>
<COLOUR>red</COLOUR>
</FRUIT>
<FRUIT>
<SORT>Pear</SORT>
<SIZE>8cm</SIZE>
<COLOUR>green</COLOUR>
</FRUIT>
What I need the mapping to do, is ensure that if there's any fruit available, there are at least two records 'fruit' available in the target. This should be achieved by adding a generic fruit (just think of some arbitrary pineapple, if you like) to the target, if there's only one fruit available at the source.
As a first step, I tried just adding one more fruit node to the target and failed to do so. I'm pretty confident that if I know how to do this, I can solve the actual problem by combining it with the 'count records' and 'logical equality'(=1) functoids.
So the question boils down to: how do I add a single record to the target inside a mapping?
I've tried several options (and combinations of these), namely:
using a looping functoid over the elements "fruit" or "item", together with some extra value mapping.
directly adding an extra value mapping
adding direct links between "item" or "fruit" and "FRUIT", either with or without looping functoid inbetween.
In most of these cases, either I got a result containing children like
<FRUIT>
<SORT>Pear</SORT>
<SORT>Pineapple</SORT>
<SIZE>8cm</SIZE>
<COLOUR>green</COLOUR>
</FRUIT>
(and a corresponding error since this doesn't satisfy the schema - which doesn't allow for multiple elements SORT), or a whole bunch of pineapples. Neither of which is wished for.
Can anyone help me out?
Unfortunately, there's no easy way to do this in the BizTalk mapper. When you use a looping functoid, it will create an xsl:for-each and will create the destination node exactly the number of times it appears in the source schema.
Since your destination nodes aren't that complex, your best bet is probably to use a Scripting Functoid with inline XSLT, using the following:
<xsl:for-each select="/fruit/item">
<FRUIT>
<SORT><xsl:value-of select="sort"/></SORT>
<SIZE><xsl:value-of select="size"/></SIZE>
<COLOUR><xsl:value-of select="colour"/></COLOUR>
</FRUIT>
<xsl:if test="last() = 1"> <!-- there's only one node here, we want at least two -->
<FRUIT>
<SORT>pineapple</SORT>
<SIZE>8cm</SIZE>
<COLOUR>green</COLOUR>
</FRUIT>
</xsl:if>
</xsl:for-each>
Link the output of that go to the FRUIT node in your destination schema.

Solr conditional query fields (qf)

Is it possible to define query fields in Solr based on certain conditions? For e.g. I've three fields text, title and product.The solr config definition:
<str name="qf">text^0.5 title^10.0 Product</str>
What I'm looking here is to include "product" as a searchable field only when certain condition is met, for e.g. if author:"Tom", then search in Product as well.
Is there a way to do that during query time using edismax ?
The alternate I've is to add the product information to either text or title of the document (where author=Tom) during index time so it'll be searchable. But, I'm trying to avoid this if possible.
Any pointers will be appreciated.
-Thanks
In order to search in different fields based on different conditions, there is a need to first search for that specific conditions, thus it is more or less the same as issuing multiple queries.
That said, in case there is a need to do it as a one-time query (e.g. for out-of-the-box sorting/grouping/other solr features), the nested queries can be used.
For defining two different conditions (as in the original question, but it can easily be extended with more OR clauses), the q parameter can receive following value:
_query_:"{!edismax fq=$fq1 qf=$qf1 v=$condQuery}"
OR
_query_:"{!edismax fq=$fq2 qf=$qf2 v=$condQuery}"
The query uses Parameter Dereferencing, so there is no need to manually escape any special characters before passing the parameters to solr.
fq1 - first special condition
qf1 - list of fields to search in for first special condition (fq1)
fq2 - second special condition
qf2 - list of fields to search in for first special condition (fq2)
condQuery - the actual search term/query
The fq1 may be empty in order to define a baseline (in this particular case - search in text and title, but not in product).
The raw parameters themselves will look the following way:
fq1=&qf1=text^0.5 title^10.0&fq2=author:"Tom"&qf2=text^0.5 title^10.0 Product&condQuery=5
And the Final query will be something like this:
http://localhost:8983/solr/collection1/select?q=_query_%3A%22%7B!edismax+fq%3D%24fq1+qf%3D%24qf1+v%3D%24condQuery%7D%22+OR+_query_%3A%22%7B!edismax+fq%3D%24fq2+qf%3D%24qf2+v%3D%24condQuery%7D%22&fl=*%2Cscore&wt=xml&indent=true&fq1=&qf1=text^0.5%20title^10.0&fq2=author:%22Tom%22&qf2=text^0.5%20title^10.0%20Product&condQuery=5
.. or the same query returned by solr in solr response (provided only for showing it in a structured way):
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="q">_query_:"{!edismax fq=$fq1 qf=$qf1 v=$condQuery}" OR _query_:"{!edismax fq=$fq2 qf=$qf2 v=$condQuery}"</str>
<str name="condQuery">5</str>
<str name="indent">true</str>
<str name="fl">*,score</str>
<str name="fq1"/>
<str name="qf1">text^0.5 title^10.0</str>
<str name="fq2">author:"Tom"</str>
<str name="qf2">text^0.5 title^10.0 Product</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="..." start="..." maxScore="...">
...
</result>
</response>
Even though it works, I suggest to consider the effect it would have on query time (as each condition will have a separate internal search query) and measure how it affects your specific case.
I didn't try it by myself, but it looks like this could be achievable by using http://wiki.apache.org/solr/FunctionQuery#Boolean_Functions
Shamik,
I don't think there is a way to do this in Solr that is easy. One thing to consider is managing of these rules overtime too, it would be a nightmare for a large system.
If you really wanted to do something like this, maybe you can issue two calls to Solr to get the result set.

Solr: How to specify field relevancy/weight

I'm currently indexing data using Solr that consists of about 10 fields. When I perform a search I would like certain fields to be weighted higher. Could anyone help point me in the right direction?
For example, searching across all fields for a term such as "superman" should return hits in the "Title" field before the "Description" field.
I've found documentation on how to make one field score higher from the query, but I would prefer to set this in a configuration file or similar. The following would require all searches to specify the weight. Is it possible to specify this in the solr config file?
q=title:superman^2 description:superman
Try using qf with ExtendedDisMax your query then would look like that:
q=superman
While your config will look like:
<str name="qf">title^2 description</str>
You can get some working examples here
The qf parameter introduces a list of fields, each of which is assigned a boost factor to increase or decrease that particular field's importance in the query. For example, the query below:
qf="fieldOne^2.3 fieldTwo fieldThree^0.4"
Assigns fieldOne a boost of 2.3, leaves fieldTwo with the default boost (because no boost factor is specified), and fieldThree a boost of 0.4. These boost factors make matches in fieldOne much more significant than matches in fieldTwo, which in turn are much more significant than matches in fieldThree."
Source: Apache Lucene
In your case: qf="title^100 description" may do the trick - if you're using Solr in a library I'd love to chat.
By using edismax we can achieve what you looking for.
Try adding these two fields in your request handler by changing the fields.
You can remove a particular field completely, if you don't want it.
<str name="defType"> edismax </str>
<str name="qf"> YourField^50 YourAnotherField^30 YetAnotherField</str>
The more the power(^) increases, the more priority that field gets.

Solr - index JSON query string from database?

I would like to know if it is possible to index data that contains a JSON string that can be decoded and each JSON value to be indexed with the separate values.
I am using the DIH to connect to a MySQL database and able to index the individual columns.
The result would look like the following:
<response name="response" numFound="1" start="0" maxScore="2.7143538">
...
<result name="response" numFound="1" start="0" maxScore="2.7143538">
<doc>
<float name="score">2.7143538</float>
<str name="id">82</str>
<str name="name">jorge</str>
<str name="otherinfo">{"day":15,"year":1989,"month":"January"}</str>
</doc>
</result>
</response>
The problem is that "otherinfo" is a JSON string that I would like to decode and have something like the following in my index:
<response name="response" numFound="1" start="0" maxScore="2.7143538">
...
<result name="response" numFound="1" start="0" maxScore="2.7143538">
<doc>
<float name="score">2.7143538</float>
<str name="id">82</str>
<str name="name">jorge</str>
<str name="day">15</str>
<str name="year">1989</str>
<str name="month">January</str>
</doc>
</result>
</response>
Would this be possible to do at all with Solr?
Thanks in advance
I commented on this. I decided that I should answer instead.
The fix for your issue isn't at the Solr level. You shouldn't be storing your data this way in the DB to begin with. In the long run, it would be better to fix this problem there, as opposed to trying to hack this at the Solr indexing level.
Your question proves that someone, probably an end user, is interested in searching by this data. This implies that it should probably be stored in the database as an actual Date or Timestamp field so that it can be properly selected or sorted on.
I'm sure people won't like that this doesn't exactly answer your question, but someone needs to tell you this.
If you know your way around Java you could write your own, custom transformer that would handle your specific case.
Have you tried using DIH RegexTransformer to parse JSON?
I think that should be doable, especially if you have fixed json format (doesn't contain document in document in document in ...).
I've just noticed ScriptTransformer, which allows you to write your own parser. I think this is the way to go...
Is the otherinfo field in the DB a JSON string to start with?
You would need dynamic fields (docs, explanation) and client-side code to let Solr store data with arbitary schema.
You would need to define dynamic fields in your schema like:
dyn_string_*: store text as it is
dyn_text__*: store text and index it for search
etc
Then you will need to tell DIH to map DB fields to solr dynamic fields (pseudocode warning; sorry, but I am not familiar with DIH):
Select
day as dyn_number_day,
name as dyn_text_name
from
tablename
Edit
You do have the requirement to query into the data structure. This needs a schema-less datastore.
Document DBs like MongoDB offer exactly the functionality: store data on arbitary fields you determine at insert-time. And it can run any kind of ad-hoc query on your data.
I am not aware of a request handler that can index your data for that. You can write code that fetches updated (or added or removed) rows periodically, decodes the JSON field and index it to Solr.
I reccomend skinny data model to store attributes to properties independent of current DB schema. I asked a question ' Set intersection in MySQL: a clean way ' a while back.
Recap: MongoDB and friends contain exactly the functionality you need. If you want relations and referential integrity, you can keep using RDBMS. If you still want that JSON thing, develop an active system that will parse it and index it to solr. But I recommend moving to a skinny data model, since you can get the same (conditions apply!) query capabilities that Solr gives you by SQL.
Exotic technology: Graph databases like Neo4j contain document database functionality (ad-hoc queries) and relations: a relation directly links one node to another, no joins involved. So it's just one step short of referential integrity.

Resources