I have several documents in a solr collection that I want to be able to search through. Most of the data comes from web sites I can easily crawl, however, I need to add some attributes manually to because I have to add these attributes manually.
So as an example I get the following info from a site (all attributes returned from crawled site):
Name: Porsche Boxter
Year: 1996
...
I want to add additional fields through a web interface (info not present on crawled sites):
Cool: yes
foo: bar
My questions:
Does it make sense at all to store additional information along the indexed data within Solr (inside the documents) or would a best practice only have all crawled data in Solr and merge with an external managed database during query time? To me it makes more sense to have all my data that is eventually queried in Solr as some of the manually added attributed are required search criteria (e.g. look only for cool cars from the 90s).
Is it possible to use Solr to store additional information about indexed documents? I know the entire schema in advance, perhaps this is useful?
If I store my data exclusively in Solr, how can I ensure that during the next crawl the manually added data is not overwritten? Would partial update be required?
Since I am new to Solr it would also be very helpful if someone could simply manage what to look for in the documentation that describes my use case.
That depends on how often the external data changes. The more often, the less meaningful. Generally it is a good idea to store such data along the index data, because you get them without an additional database query.
Yes. Use indexed:falseand stored:true. If you knew not know all of such fields in advance you could use a dynamicField like <dynamicField name="*_stored" type="string" indexed="false" stored="true" />.
Yes. You have to use partial update. This is no problem in your case, because the fields not updated have stored:true.
Related
I have an issue of replication and I need your help in it.In couchDb replication,I want to replicate in such a way that during Couchdb replication I want to reset/update some specific attributes of a a document for some purpose and then these edited documents should be saved in replicated db without effecting the original ones.For example:
A document named Student with attributes id,name,class etc.
And I want to replicate this document in the way that its name and class should be reset/updated.
Will you please tell me how can I achieve it.
Thanks.
You can't update docs during the replication.
But you can exclude docs from being replicated with the help of a CouchDB filter (e.g. preventing all docs with a revision higher then 1 from being replicated).
If you want to have multiple versions of the same dataset (e.g. to have dataset revisions) - i use the term "dataset" instead of "doc" to clearly express that not the internal CouchDB doc revision handling is involved - you have to store them as separated docs that have all a unique id and a reference property like original: "UUID_of_the_original".
you can't use the CouchDB doc revision handling for that purpose (thats what many people think when they see the _rev property in the docs)
How the search everything kind of application is indexing & keeping track of data into its search indexes.
Recently I have been working on Apache Solr which is producing amazing results for a search. But it was for one particular products catalog section that is being searched. As Solr is a stores it's data document, we indexed searchable fields as document in solr. I'm not sure how it can be used to build a search everything kind of search? And how should I index data into Solr?
By search everything I mean, to search into different module for information like Customers, Services, Accounts, Orders, Catalog, Support Ticket, etc. So search return results which is combined as a result from a single search form and user don't need to go into different forms for search that module?
Do I need to build different indexes for each such data models or store them into solr as single document? What is the best strategy to implement this.
You can store all that data in a single index with each document having an extra field that stores its type (Customer, Order, etc.). For the within-module search, just restrict the search query to documents of that type. For the Search All functionality, use copyField to copy all the relevant fields in each document type into one big field, and search with the document type field unconstrained.
Lucene does searching and indexing, all by taking "coding"... Why doesn't Solr do the same ? Why do we need a schema.xml ? Whats its importance ? Is there a way to avoid placing all the fields we want into a schema.xml ? ( I guess dynamic fields are the way to go, right ? )
That's just the way it was built. Lucene is a library, so you link your code against it. Solr, on the other hand, is a server, and in some cases you can just use it with very little coding (e.g. using DataImportHandler to index and Velocity plugin to browse and search).
The schema allows you to declaratively define how each field is analyzed and queried.
If you want a schema-less server based on Lucene, take a look at ElasticSearch.
If you want to avoid constantly tweaking your schema.xml, then dynamic fields are indeed the way to go. For an example, I like the Sunspot schema.xml — it uses dynamic fields to set up type-based naming conventions in field names.
https://github.com/outoftime/sunspot/blob/master/sunspot/solr/solr/conf/schema.xml
Based on this schema, a field named content_text would be parsed as a text field:
<dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true"/>
Which corresponds to its earlier definition of the text fieldType.
Most schema.xml files that I work with start off based on the Sunspot schema. I have found that you can save a lot of time by establishing and reusing a good convention in your schema.xml.
Solr acts as a stand-alone search server and can be configured with no coding. You can think of it as a front-end for Lucene. The purspose of the schema.xml file is to define your index.
If possible, I would suggest defining all your fields in the schema file. This gives you greater control over how those fields are indexed and it will allow you to take advantage of copy fields (if you need them).
I have three document types MainCategory, Category, SubCategory... each have a parentid which relates to the id of their parent document.
So I want to set up a view so that I can get a list of SubCategories which sit under the MainCategory (preferably just using a map function)... I haven't found a way to arrange the view so this is possible.
I currently have set up a view which gets the following output -
{"total_rows":16,"offset":0,"rows":[
{"id":"11098","key":["22056",0,"11098"],"value":"MainCat...."},
{"id":"11098","key":["22056",1,"11098"],"value":"Cat...."},
{"id":"33610","key":["22056",2,"null"],"value":"SubCat...."},
{"id":"33989","key":["22056",2,"null"],"value":"SubCat...."},
{"id":"11810","key":["22245",0,"11810"],"value":"MainCat...."},
{"id":"11810","key":["22245",1,"11810"],"value":"Cat...."},
{"id":"33106","key":["22245",2,"null"],"value":"SubCat...."},
{"id":"33321","key":["22245",2,"null"],"value":"SubCat...."},
{"id":"11098","key":["22479",0,"11098"],"value":"MainCat...."},
{"id":"11098","key":["22479",1,"11098"],"value":"Cat...."},
{"id":"11810","key":["22945",0,"11810"],"value":"MainCat...."},
{"id":"11810","key":["22945",1,"11810"],"value":"Cat...."},
{"id":"33123","key":["22945",2,"null"],"value":"SubCat...."},
{"id":"33453","key":["22945",2,"null"],"value":"SubCat...."},
{"id":"33667","key":["22945",2,"null"],"value":"SubCat...."},
{"id":"33987","key":["22945",2,"null"],"value":"SubCat...."}
]}
Which QueryString parameters would I use to get say the rows which have a key that starts with ["22945".... When all I have (at query time) is the id "11810" (at query time I don't have knowledge of the id "22945").
If any of that makes sense.
Thanks
The way you store your categories seems to be suboptimal for the query you try to perform on it.
MongoDB.org has a page on various strategies to implement tree-structures (they should apply to Couch and other doc dbs as well) - you should consider Array of Ancestors, where you always store the full path to your node. This makes updating/moving categories more difficult, but querying is easy and fast.
I have a fairly simple need to do a conditional update in Solr, which is easily accomplished in MySQL.
For example,
I have 100 documents with a unique field called <id>
I am POSTing 10 documents, some of which may be duplicate <id>s, in which case Solr would update the existing records with the same <id>s
I have a field called <dateCreated> and I would like to only update a <doc> if the new <dateCreated> is greated than the old <dateCreated> (this applies to duplicate <id>s only, of course)
How would I be able to accomplish such a thing?
The context is trying to combat race conditions resulting in multiple adds for the same ID but executing in the wrong order.
Thanks.
I can think of two ways:
Write your own UpdateHandler and override addDoc to implement that checking.
Put the appropriate locks (critical sections) in your client code in order to fetch the stored document, compare the dates, and conditionally add the new document in a thread-safe manner.
Remember that Solr is not a database, comparing it to MySQL is comparing apples and oranges.
As of solr 4.0, optimistic concurrency is enabled via the _version_ field.
http://yonik.com/solr/optimistic-concurrency/
To enable, you need to make sure your schema.xml contains
<field name="_version_" type="long" indexed="true" stored="true"/>
and in solrconfig.xml
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">${solr.data.dir:}</str>
</updateLog>
</updateHandler>
With really custom addition logic like this, I find that writing your own client side updater works better. It keeps you from mucking around in Solr internals, which makes it easier to update in the future. You can definitly do this in SolrJ, but if you aren't a Java dev, there is probably a clientside library in your own preferred language... PHP, Python, Ruby, C# etc...
The rsolr Ruby gem (http://github.com/mwmitchell/rsolr/tree/master) makes it VERY easy to hack together a custom load script.