Solr conditional adds/updates? - search

I have a fairly simple need to do a conditional update in Solr, which is easily accomplished in MySQL.
For example,
I have 100 documents with a unique field called <id>
I am POSTing 10 documents, some of which may be duplicate <id>s, in which case Solr would update the existing records with the same <id>s
I have a field called <dateCreated> and I would like to only update a <doc> if the new <dateCreated> is greated than the old <dateCreated> (this applies to duplicate <id>s only, of course)
How would I be able to accomplish such a thing?
The context is trying to combat race conditions resulting in multiple adds for the same ID but executing in the wrong order.
Thanks.

I can think of two ways:
Write your own UpdateHandler and override addDoc to implement that checking.
Put the appropriate locks (critical sections) in your client code in order to fetch the stored document, compare the dates, and conditionally add the new document in a thread-safe manner.
Remember that Solr is not a database, comparing it to MySQL is comparing apples and oranges.

As of solr 4.0, optimistic concurrency is enabled via the _version_ field.
http://yonik.com/solr/optimistic-concurrency/
To enable, you need to make sure your schema.xml contains
<field name="_version_" type="long" indexed="true" stored="true"/>
and in solrconfig.xml
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">${solr.data.dir:}</str>
</updateLog>
</updateHandler>

With really custom addition logic like this, I find that writing your own client side updater works better. It keeps you from mucking around in Solr internals, which makes it easier to update in the future. You can definitly do this in SolrJ, but if you aren't a Java dev, there is probably a clientside library in your own preferred language... PHP, Python, Ruby, C# etc...
The rsolr Ruby gem (http://github.com/mwmitchell/rsolr/tree/master) makes it VERY easy to hack together a custom load script.

Related

Can data in Solr be extended with manually defined meta data?

I have several documents in a solr collection that I want to be able to search through. Most of the data comes from web sites I can easily crawl, however, I need to add some attributes manually to because I have to add these attributes manually.
So as an example I get the following info from a site (all attributes returned from crawled site):
Name: Porsche Boxter
Year: 1996
...
I want to add additional fields through a web interface (info not present on crawled sites):
Cool: yes
foo: bar
My questions:
Does it make sense at all to store additional information along the indexed data within Solr (inside the documents) or would a best practice only have all crawled data in Solr and merge with an external managed database during query time? To me it makes more sense to have all my data that is eventually queried in Solr as some of the manually added attributed are required search criteria (e.g. look only for cool cars from the 90s).
Is it possible to use Solr to store additional information about indexed documents? I know the entire schema in advance, perhaps this is useful?
If I store my data exclusively in Solr, how can I ensure that during the next crawl the manually added data is not overwritten? Would partial update be required?
Since I am new to Solr it would also be very helpful if someone could simply manage what to look for in the documentation that describes my use case.
That depends on how often the external data changes. The more often, the less meaningful. Generally it is a good idea to store such data along the index data, because you get them without an additional database query.
Yes. Use indexed:falseand stored:true. If you knew not know all of such fields in advance you could use a dynamicField like <dynamicField name="*_stored" type="string" indexed="false" stored="true" />.
Yes. You have to use partial update. This is no problem in your case, because the fields not updated have stored:true.

Solr: How to specify field relevancy/weight

I'm currently indexing data using Solr that consists of about 10 fields. When I perform a search I would like certain fields to be weighted higher. Could anyone help point me in the right direction?
For example, searching across all fields for a term such as "superman" should return hits in the "Title" field before the "Description" field.
I've found documentation on how to make one field score higher from the query, but I would prefer to set this in a configuration file or similar. The following would require all searches to specify the weight. Is it possible to specify this in the solr config file?
q=title:superman^2 description:superman
Try using qf with ExtendedDisMax your query then would look like that:
q=superman
While your config will look like:
<str name="qf">title^2 description</str>
You can get some working examples here
The qf parameter introduces a list of fields, each of which is assigned a boost factor to increase or decrease that particular field's importance in the query. For example, the query below:
qf="fieldOne^2.3 fieldTwo fieldThree^0.4"
Assigns fieldOne a boost of 2.3, leaves fieldTwo with the default boost (because no boost factor is specified), and fieldThree a boost of 0.4. These boost factors make matches in fieldOne much more significant than matches in fieldTwo, which in turn are much more significant than matches in fieldThree."
Source: Apache Lucene
In your case: qf="title^100 description" may do the trick - if you're using Solr in a library I'd love to chat.
By using edismax we can achieve what you looking for.
Try adding these two fields in your request handler by changing the fields.
You can remove a particular field completely, if you don't want it.
<str name="defType"> edismax </str>
<str name="qf"> YourField^50 YourAnotherField^30 YetAnotherField</str>
The more the power(^) increases, the more priority that field gets.

Expression Engine - Is it possible for exp:channel:entries tag to have complex filters?

Coming from Codeigniter and new to Expression Engine, I am at a loss on how to do complex filters on the exp:channel:entries tag.
I am interested in this filters
status
start_on
stop_before
How do you do you implement filters for a complex condition like this?
(status=X|Y|Z AND start_on=A AND stop_before=B) OR (status=X AND start_on=C AND stop_before=D)
Is this even possible?
You can only use the search= parameter to search on “Text Input”, “Textarea”, and “Drop-down Lists” fields unfortunately. So you'd need to use the query module for this.
If you're just querying on those parameters you should be able to get the entry id's you need from the exp_channel_titles table, then use something like the Stash plugin to feed the entry_id's of the results into a regular channel entries tag. Yes it's nominally one more query that way but as EE abstracts the db schema fairly heavily the alternative is to get lost in a mess of JOINs.
So something like (pseudocode, won't work as is):
Get the entries, statuses are just a string in exp_channel_titles, entry_date is the date column you want - which is stored as a unix timestamp, so you'll need to select it with something like DATE( FROM_UNIXTIME(entry_date)) depending on the format of your filter data.
{exp:stash:set name="filtered_ids"}{exp:query sql="SELECT entry_id
FROM exp_channel_titles
WHERE status LIKE ...<your filter here>"
backspace="1"
}{entry_id}|{/exp:query}{/exp:stash:set}
Later in template:
{exp:channel:entries
entry_id="{exp:stash:get name="filtered_ids"}"
}
{!--loop --}
{/exp:channel:entries}
Yes it's a mess compared to what you're probably used to in pure CI, but the trade off is all the stuff you get for free from EE (CP, templating, member management etc).
Stash is awesome by the way - can be used to massively mitigate most EE performance issues/get around parse order issues
You can get a lot of this functionality using the search= parameter in your {exp:channel:entries...} loop.
It's not immediately clear to me how you'll get the complexity you seek, so you might end up resorting to the query module though.
If you're working with dates you might find the DT plugin useful.

Why did they create the concept of "schema.xml" in Solr?

Lucene does searching and indexing, all by taking "coding"... Why doesn't Solr do the same ? Why do we need a schema.xml ? Whats its importance ? Is there a way to avoid placing all the fields we want into a schema.xml ? ( I guess dynamic fields are the way to go, right ? )
That's just the way it was built. Lucene is a library, so you link your code against it. Solr, on the other hand, is a server, and in some cases you can just use it with very little coding (e.g. using DataImportHandler to index and Velocity plugin to browse and search).
The schema allows you to declaratively define how each field is analyzed and queried.
If you want a schema-less server based on Lucene, take a look at ElasticSearch.
If you want to avoid constantly tweaking your schema.xml, then dynamic fields are indeed the way to go. For an example, I like the Sunspot schema.xml — it uses dynamic fields to set up type-based naming conventions in field names.
https://github.com/outoftime/sunspot/blob/master/sunspot/solr/solr/conf/schema.xml
Based on this schema, a field named content_text would be parsed as a text field:
<dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true"/>
Which corresponds to its earlier definition of the text fieldType.
Most schema.xml files that I work with start off based on the Sunspot schema. I have found that you can save a lot of time by establishing and reusing a good convention in your schema.xml.
Solr acts as a stand-alone search server and can be configured with no coding. You can think of it as a front-end for Lucene. The purspose of the schema.xml file is to define your index.
If possible, I would suggest defining all your fields in the schema file. This gives you greater control over how those fields are indexed and it will allow you to take advantage of copy fields (if you need them).

How do I get the guids of taxonomy/metadata terms for updating lists?

I'm integrating a legacy system with SP2010 using a literal SOAP calls. I can do a call such as CopyIntoItems or UpdateListItems and doing so, set metadata fields by providing a Fields node such as:
<Fields>
<FieldInformation Type="Note" DisplayName="Country_0"
Id="de1e6424-7a8a-42a5-8d21-73402fe2e609"
Value="UK|91c89925-16e6-4d41-9e71-ec45e8f2a113" />
...
</Fields>
In the Value attribute, I have to give the guid of the taxonomy or metadata term. That works fine in the above example since I happen to know the guid of the term UK. However, how can I dynamically figure out what the guid of some other value, say France for example, would be? I was thinking of making a utility to get them all and cache them on my system somehow so I can look them up easily, but where are they defined?
Possibly a stupid question, I get the feeling I've missed something...
The GetList call should allow you to get all the ID's.
But the UpdateListItems (and) call should work using just the internal names.
From the MSDN Sharepoint 2010 forum:
http://social.msdn.microsoft.com/Forums/eu/sharepoint2010general/thread/63553372-b87f-4d69-8d4c-63fb8a7b3c6d
Not sure if this is the best way but it's the answer I'm going with for the time being.

Resources