I am trying to request a large number of documents from my database (which has over 400k documents). I started using _all_docs built-in view. I first tried with this query:
http://database:port/databasename/_all_docs?limit=100&include_docs=true
No problem. Completes as expected. Now to ramp it up:
http://database:port/databasename/_all_docs?limit=1000&include_docs=true
Still fine. Took longer, more data, etc. as expected. Ramp it up again:
http://database:port/databasename/_all_docs?limit=10000&include_docs=true
Request never completes. The Dev tools in chrome show Size = 5.3MB (seems to be significant), and this occurs no matter what value for the limit parameter I use that is over 6500ish. No matter if i specify 6500 or 10,000, it always returns 5.3MB downloaded, and the request stalls.
I have also tried other combinations, such as "skip" and it seems that limit + skip must be < 6500 or I get the same stall.
My environment: Couchdb 1.6.1, Ubuntu 14.04.3 LTS, Azure A1 standard
you have to prewarm your queries, just throwing a 100K or more docs and expecting that you'd get them out of couchdb won't work, it just won't work.
When you ask for some items from a view (in your case Default View), at the first read CouchDB will notice that the B-tree for the view doesn't exist yet, so it goes ahead and builds it on the first read. Depending on how many documents you have in your database, that can take a while, putting a good work load on your database.
On every subsequent read, CouchDB will check if documents have changed since the last write, and throw the changed documents at the map and reduce function. So if you only query some views from time to time, but have lots of changes in between, expect some delays on the next read.
There are 2 ways to handle this situation
1. Pre-warm your view - run a cronjob that does reads to make sure that your view has the B-Tree for this View.
2. Prepare your view in advance for a particular query before inserting the data in the couchdb.
and for now if you really want to read all your docs, don't read them all at once, rather use the skip, limit range queries.
Related
The problem I am facing with couchDB is whenever I hit any database for the first time, it fetches quite slowly, though the speed is increased from the second time. Is there any workaround we can do so that this glitch gets removed for the first time as well?
Secondary indexes in CouchDB are not updated during document write operations (doc). So the delay is because the view is actually generated for the first time.
For CouchDB 3.x: look into tuning background indexing
For CouchDB 2.x and before: upgrade and/or prefetch your views regularly so they've been built at the moment you need them quickly available
Ah, and if you're doing mango queries, then make sure required indexes are defined in the first place so you're not rescanning the DB every time :)
According to the internet. You make a request to /_changes?since=0&limit=1 do what you want with the change, then use the last_seq value and pass to since and request again.
My problem is, this skips changes. You can keep requesting /_changes?since=0&limit=1 and get a different change over and over. Only occasionally actually getting the first change to the database. Sometimes you get the 7th change, or the 4th, etc. If you then repeat but using the last_seq value, it skips ahead further, far as I can tell, it never goes back and gets the changes it skipped.
Is there a proper way to periodically watch a couchdb changes feed without using the sockets method instead when using clusters?
What we have right now is a php script that runs on a cron task and requests the last 1000 changes, then it works through them and syncs up SQL databases to match what was in couchdb. With couchdb skipping changes, this is a big problem.
CouchDB 2.x doc states that (see):
"The results returned by _changes are partially ordered. In other words, the order is not guaranteed to be preserved for multiple calls."
So, when you call /_changes?since=0&limit=1 you obtain a different result as the order is not guaranteed.
The _changes response contains a pending attribute with the number of elements that are out of the response. If you take the last_seq value from the last request and use that value as the since attribute in the next request you'll get the next bunch of changes and the pending value is decreased consistently.
Also, you should be careful with the next documentation note:
If the specified replicas of the shards in any given since value are unavailable, alternative replicas are selected, and the last known checkpoint between them is used. If this happens, you might see changes again that you have previously seen. Therefore, an application making use of the _changes feed should be ‘idempotent’, that is, able to receive the same data multiple times, safely.
Read changes in batches is a recommendation of the CouchDB Replication Protocol (see) used by CouchDB compatible clients as Cloudant Sync, so the approach you described should be correct.
Please, don't use the numeric value of the change seq as a reference to infer that there are missed changes as this number is computed from cluster state which may vary between calls. You can check this answer for more detail.
I have been thinking on how I can make my app in NodeJS to go faster, so I have tried querying for only some fields and the entire document, because at MongoDB Documentation says its faster to query for certain fields. The problem is it's seems to me incorrect, where am I failing? Here is the code I am using I have made it to save to csv to get a Chart from Libreoffice:
http://pastebin.com/G8KRRY3n
First Option (A) is get the entire Document.
Second Option (B) is get some fields.
Here is the graph I toke from it (Every operation in miliseconds):
http://prntscr.com/5oofoz
I process almost 9500 users. As you can see, at first (0~200) items procesed, It's the same, but then the second options start to grow in time... I have tried to switch the order of the options because of the garbage collector has something to do, but the results are almost the same.
Yes, the first option is faster at first elements, So the question is... In a High Traffic webapp which option is the recomended? Why? I am newbie at performance field so I am pretty sure I'm doing something wrong...
I understand that CouchDB hashes the source of each design documents against the name of the index file. Whenever I change the source code, the index needs to be rebuild. CouchDB does this when the document is requested for the first time.
What I'd expect to happen and want to happen
Each time I change a design doc, the first call to a view will take significantly longer than usual and may time out. The index will continue to build. Once this is completed, the view will only process changes and will be very fast.
What actually happens
When running an amended view for the first time, I see the process in the status window, slowly reach 100%. This takes about 2 hours. During this time all CPU's are fully utilized.
Once process reaches 99% it remains there for about an hour and then disappears. CPU utilization drops to just one cpu.
When the process has disappeared, the data file for the view keeps growing for about half an hour to an hour. CPU utilization is near 0%
The index file suddenly stops to increase in size.
If I request the view again when I've reached state 4), the characteristics of 3) start again. I have to repeat this process between 5 to 50 times until I can finally retrieve the view values.
If the view get's requested a second time whilst till in stage 1 or 2, it will most definitely run out of memory and I have to restart the CouchDB service. This is despite my DB rarely using more than 2 GByte when runninng just one job and more than 4 GByte free in usual operation.
I have tried to tweak configuration settings, add more memory, but nothing seems to have an impact.
My Question
Do I misunderstand the concept of running views or is something wrong with my setup?
If this is expected, is there anything I can tweak to reduce the number of reruns?
Context
My documents are pretty large (1 to 20 MByte). The data they contain is well structured, they are usually web-analytics reports and would in a relational database be stored as several 10k rows of data.
My map function extracts these rows. It returns the dimensions as key array. The key array sometimes exceeds 20 columns. Most views will only have less than 10 columns.
The reduce function will aggregate (sum) all values in rows with identical keys. The metrics are stored in a dictionary and may contain different keys. The reduce function identifies missing keys in one document and adds these to the aggregate as 0.
I am using CouchDB 1.5.0 on Windows Server 2008 R2 with 2CPUs and 8 GByte memory.
The views are written in javascript using the couchjs query server.
My designs documents usually consist of several views, with a '_lib' view that does not emit any data, but contains an exhaustive library of functions accessed by the actual views.
It is a known issue, but just in case: if you have gigabytes of docs, you can forget about reduce functions. Only build-in ones will work fast enough.
It is possible to set os_process_limit to an extra-low value (1 sec, for sample). This way you can detect which doc takes long to be indexed and optimize your map function for performance.
I have a news site with 150,000 news articles. About 250 new articles are added daily to the database at an interval of 5-15 minutes. I understand that Solr is optimized for millions of records and my 150K won't be a problem for it. But I am worried the frequent updation will be a problem, since the cache gets invalidated with every update. In my dev server, cold load of a page takes 5-7 seconds to load (since every page runs a few MLT queries).
Will it help, if I split my index into two - An archive index and a latest index. The archive index will be updated once every day.
Can anyone suggest any ways to optimize my installation for a constantly updating index?
Thanks
My answer is: test it! Don't try to optimize yet if you don't know how it performs. Like you said, 150K is not a lot, it should be quick to build an index of that size for your tests. After that, run a couple of MLT queries from a different concurrent threads (to simulate users) while you index more documents to see how it behaves.
One setting that you should keep an eye on is auto-commit. Since you are indexing constantly, you can't commit at each document (you will bring Solr down). The value that you will choose for this setting will let you tune the latency of the system (how many times it takes for new documents to be returned in results) while keeping the system responsive.
Consider using mlt=true in the main query instead of issuing per-result MoreLikeThis queries. You'll save the roundtrips and so it will be faster.