Using Solr with frequently updated data

Using Solr with frequently updated data - search

I have a site search I would like to implement using Solr. Unfortunately, I also have a lot of frequently updated dynamic data in my MySQL database from cron jobs, which I would also like to be searchable.
I would automatically assume that constantly updating records in Solr is not a good idea so is there a workable solution to give me the text-search power of Solr as well as being able to filter based on these frequently updated fields?

I think this depends what "frequently" means and how long your tolerated Solr-lag is.
In my case, i update Solr twice every minute, which works fine.
..based on an MySql DB with some hundred updates a Minute.
In this situation it's important NOT to run an optimize on every Solr update/commit. Better run an optimize every n hoers.
So finally, all the new MySQL stuff will be visible in Solr with max. 30 sec. delay.
It depends on your situation if this is acceptable.

Related

CouchDB works slow for the first time

The problem I am facing with couchDB is whenever I hit any database for the first time, it fetches quite slowly, though the speed is increased from the second time. Is there any workaround we can do so that this glitch gets removed for the first time as well?

Secondary indexes in CouchDB are not updated during document write operations (doc). So the delay is because the view is actually generated for the first time.
For CouchDB 3.x: look into tuning background indexing
For CouchDB 2.x and before: upgrade and/or prefetch your views regularly so they've been built at the moment you need them quickly available
Ah, and if you're doing mango queries, then make sure required indexes are defined in the first place so you're not rescanning the DB every time :)

Elasticsearch indexing speed with Nodejs

I have an Elasticsearch indice with a large amount of documents. I've been using javascript up until this point with Node.js. I made a cron job to run every 24 hours to update the documents individually based on any metadata changes. As some of you may know, this is probably the slowest possible way to do it. Single threaded Node.js with individual indexing on Elasticsearch. When I run the cron job, it runs at a snails pace. I can update a document every maybe 1.5-2 seconds. This means it would take around 27 hours to update all the docuemnts. I'm using a free-tier AWS ES instance so I don't have access to certain features that would help me speed up the process.
Does anyone know of a way to speed up the indexing? If I were to call for a bulk update, how would that manifest in javascript? If I were to use another language to multi-thread it, what would be the fastest option?

I did not understand your question "If I were to call for a bulk update, how would that manifest in javascript?".
Bulk update should be the best solution irrespective of any language/framework. Of-course, you can explore other languages like Ruby to leverage threads to make bulk update more distributed and fast.
From experience, a bulk update with a batch size b/w 4-7k works just fine. You may want to fine tune the size in this range.
Ensure the value of refresh_interval is set to a very great value. This will ensure that the documents are not indexed very frequently. IMO the default value will also do. Read more here.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?

Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

Frequent Updates to Solr Documents - Efficiency/Scalability concerns

I have a Solr index with document fields something like:
id, body_text, date, num_upvotes, num_downvotes
In my application, a document is created with some integer id and some body_text (500 chars max). The date is set to the time of input, and num_upvotes and num_downvotes begin at 0.
My application gives users the ability to upvote and downvote the content mentioned above, and the reason I want to keep track of this in Solr instead of just the DB is that I want to be able to consider the number of upvotes and downvotes into my search.
This is a problem because you can't simply update a solr document (i.e. increment number of up_votes) and you must replace the entire document, which is probably fairly inefficient considering it would require hitting my DB to grab all the relevant data again.
I realize the solution may require a different layout of data, or possibly multiple indexes (although I don't know if you can query/score across solr cores).
Is anyone able to offer any recommendations on how to tackle this?

A solution that I use in a similar problem is to update that information in database and do SOLR Updates/Inserts every ten minutes using the documents that were modified since the last update.
Also every night, when I don't have much traffic I do index optimize.
After each import I set up some warm-up queries in SOLR config.
In my SOLR index I have around 1.5 milion documents,each document has 24 fields, and around 2000 characters in the entire document.
I update the index every 10 minutes around 500 documents ( without optimizing the index ), and I do around 50 warmup queries comprised of most common facets, most used filter queries and free text search.
I don't get negative impact on performance. ( at least it is not visible ) - my queries run average in 0.1 seconds. ( before doing update at every 10 minutes average queries were 0.09 seconds)
LATER EDIT:
I didn't encounter any problems during this updates. I allways take the documents from database and insert them with a Unique key to SOLR. If the document exist in SOLR it is replaced ( this is what I mean by update).
It never takes more than 3 minutes to update SOLR. Actually I am doing 10 minutes break after each update. So I start the update of the index, I wait for it to finish, and then I wait another 10 minutes to start again.
I did not look on the performance over the night, but for me it is not relevant, as I want to have fresh information of data during the users visits peaks.

The Join feature would help you here. Then you could store the up/down votes in a separate document.
The bad news is that you need to wait until Solr 4 unless you're comfortable running with a trunk build.

If you are only going to be updating the up/down votes. Instead of going back to the database, just use the appropriate Solr Client for your application and pull the document from the index, set the up/down values as needed and then reinsert the document back into the index.

There is no solution to your problem within SOLR. You have a database problem and you are trying to solve it with a search engine.
The best way to deal with this is to keep a redis database that records the document id from SOLR and the up/down vote counts. Then your app can merge the data from both sources before displaying.

Solr for constantly updating index

I have a news site with 150,000 news articles. About 250 new articles are added daily to the database at an interval of 5-15 minutes. I understand that Solr is optimized for millions of records and my 150K won't be a problem for it. But I am worried the frequent updation will be a problem, since the cache gets invalidated with every update. In my dev server, cold load of a page takes 5-7 seconds to load (since every page runs a few MLT queries).
Will it help, if I split my index into two - An archive index and a latest index. The archive index will be updated once every day.
Can anyone suggest any ways to optimize my installation for a constantly updating index?
Thanks

My answer is: test it! Don't try to optimize yet if you don't know how it performs. Like you said, 150K is not a lot, it should be quick to build an index of that size for your tests. After that, run a couple of MLT queries from a different concurrent threads (to simulate users) while you index more documents to see how it behaves.
One setting that you should keep an eye on is auto-commit. Since you are indexing constantly, you can't commit at each document (you will bring Solr down). The value that you will choose for this setting will let you tune the latency of the system (how many times it takes for new documents to be returned in results) while keeping the system responsive.

Consider using mlt=true in the main query instead of issuing per-result MoreLikeThis queries. You'll save the roundtrips and so it will be faster.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string