Limiting documents in temporary views - couchdb

I'm playing around with the map and reduce through temporary views, but at 1,000,000+ documents it is a bit slow, rather than creating a separate dataset for testing, is it possible to only use a subset of data in the temporary view?

A map-reduce view is more like "CREATE INDEX" than it is like "SELECT * FROM".
In other words, when you do a map-reduce view, CouchDB will crunch through every document.
However, for testing, one thing you can do is make a normal view (not temporary). Just develop your work in a temporary design document, _design/my_experiments.
Save your map-reduce view code and then query the view with the ?stale=update_after option. You will probably get no results, however stale=update_after will tell CouchDB to begin processing the view. Now try your query again. You will see the results that have been processed so far. Now try a third time. You will see even more data reflected.
Roughly speaking, views process documents in the same order that a _changes query returns them to you: basically the first update is processed first, and then in order and the most recent change is processed last.

Related

couchdb request never completes

I am trying to request a large number of documents from my database (which has over 400k documents). I started using _all_docs built-in view. I first tried with this query:
http://database:port/databasename/_all_docs?limit=100&include_docs=true
No problem. Completes as expected. Now to ramp it up:
http://database:port/databasename/_all_docs?limit=1000&include_docs=true
Still fine. Took longer, more data, etc. as expected. Ramp it up again:
http://database:port/databasename/_all_docs?limit=10000&include_docs=true
Request never completes. The Dev tools in chrome show Size = 5.3MB (seems to be significant), and this occurs no matter what value for the limit parameter I use that is over 6500ish. No matter if i specify 6500 or 10,000, it always returns 5.3MB downloaded, and the request stalls.
I have also tried other combinations, such as "skip" and it seems that limit + skip must be < 6500 or I get the same stall.
My environment: Couchdb 1.6.1, Ubuntu 14.04.3 LTS, Azure A1 standard
you have to prewarm your queries, just throwing a 100K or more docs and expecting that you'd get them out of couchdb won't work, it just won't work.
When you ask for some items from a view (in your case Default View), at the first read CouchDB will notice that the B-tree for the view doesn't exist yet, so it goes ahead and builds it on the first read. Depending on how many documents you have in your database, that can take a while, putting a good work load on your database.
On every subsequent read, CouchDB will check if documents have changed since the last write, and throw the changed documents at the map and reduce function. So if you only query some views from time to time, but have lots of changes in between, expect some delays on the next read.
There are 2 ways to handle this situation
1. Pre-warm your view - run a cronjob that does reads to make sure that your view has the B-Tree for this View.
2. Prepare your view in advance for a particular query before inserting the data in the couchdb.
and for now if you really want to read all your docs, don't read them all at once, rather use the skip, limit range queries.

How to implement custom pagination with LINQ2Entities using 1 call to the database

I am working on ASP.NET Web Forms project and I use jquery datatable to visualize data fetched from SQL server. I need to pass the results for the current page and the total number of results for which by far I have this code :
var queryResult = query.Select(p => new[] { p.Id.ToString(),
p.Name,
p.Weight.ToString(),
p.Address })
.Skip(iDisplayStart)
.Take(iDisplayLength).ToArray();
and the result that I get when I return the result to the view like :
iTotalRecords = queryResult.Count(),
is the number of records that the user has chosen to see per page. Logical, but I haven't thought about it while building my Method chaining. Now I think about the optimal way to implement this. Since it's likely to use with relatively big amounts of data (10,000 rows, maybe more) I would like leave as much work as I can to the SQL server. However I found several questions asked about this, and the impression that I get is that I have to make two queries to the database, or manipulate the total result in my code. But I think this will won't be efficient when you have to work with many records.
So what can I do here to get best performance?
In regards to what you’re looking for I don’t think there is a simple answer.
I believe the only way you can currently do this is by running more than one query like you have already suggested, whether this would be encapsulated inside a stored procedure (SPROC) call or generated by EF.
However, I believe you can make optimsations to make your query run quicker.
First of all, every query execution MAY result in the query being recached as you are chaining your methods together, this means that the actual query being executed will need to be recompiled and cached by SQL Server (if that is your chosen technology) before being executed. This normally only takes a few milliseconds, but if the query being executed only takes a few milliseconds then this is relatively expensive.
Entity framework will translate this Linq query and execute it using derived tables. With a small result set of approx. 1k records to be paged your current solution maybe best suited. This would also depend upon on how complex your SQL filtering is as generated by your method chaining.
If your result set to be paged grows up towards 15k, I would suggest writing a SPROC to get the best performance and scalability which would insert the records into a temp table and run two queries against it, firstly to get the paged records, and secondly to get the total results.
alter proc dbo.usp_GetPagedResults
#Skip int = 10,
#Take int = 10
as
begin
select
row_number() over (order by id) [RowNumber],
t.Name,
t.Weight,
t.Address
into
#results
from
dbo.MyTable t
declare #To int = #Skip+#Take-1
select * from #results where RowNumber between #Skip and #To
select max(RowNumber) from #results
end
go
You can use the EF to map a SPROC call to entity types or create a new custom type containing the results and the number of results.
Stored Procedures with Multiple Results
I found that the cost of running the above SPROC was approximately a third of running the query which EF generated to get the same result based upon the result set size of 15k records. It was however three times slower than the EF query if only a 1K record result set due to the temp table creation.
Encapsulating this logic inside a SPROC allows the query to be refactored and optimised as your result set grows without having to change any application based code.
The suggested solution doesn’t use the derived table queries as created by the Entity Framework inside a SPROC as I found there was a marginal performance difference between running the SPROC and the query directly.

Most efficient way to create couchdb views

My CouchDB view indexes are being created slower than I would like. Writing the documents is not such a problem but the users can edit them offline and then bulk update, which seems to slow things right down.
This answer helped but I was just wondering is it better to separate out various views into different design documents (eg1) or to store them all in one (eg2).
Eg. 1
_design/posts/_view/id
_design/comments/_view/id
_design/tags/_view/id
Eg.2
_design/webresources/_view/_id?key="posts"
_design/webresources/_view/_id?key="comments"
_design/webresources/_view/_id?key="tags"
*This example is just for illustration purposes. I am only concerned with the time it takes to build the indexes.
You will gain better performance if you read often. Couchdb views are updated and build at read time. So you can can read the view every time the document updates to keep it hot*.
Or maybe listen to the changes feed and keep a track of documents updated. Once they reach a certain threshold value read a view.
Another option is use stale parameter.
If stale=ok is set, CouchDB will not refresh the view even if it is stale, the benefit is a an improved query latency. If stale=update_after is set, CouchDB will update the view after the stale result is returned
Every design document is a separate erlang process. So separating your views across different design documents will cause them to be built concurrently. However each view will still be built in a blocking manner. That is the two views across different design documents can start updating at the same time but the time it takes to update the individual views will be the same as if they were in the same design document.
*You don't necessarily have to care about the result. Our goal here is to trick couchdb to update the view. So you can fire off a request in a separate async process and be done with it.

Delete all documents in a CouchDB database *except* the design documents

Is it possible to delete all documents in a couchdb database, except design documents, without creating a specific view for that?
My first approach has been to access the _all_docs standard view, and discard those documents starting with _design. This works but, for large databases, is too slow, since the documents need to be requested from the database (in order to get the document revision) one at a time.
If this is the only valid approach, I think it is much more practical to delete the complete database, and create it from scratch inserting the design documents again.
I can think of a couple of ideas.
Use _all_docs
You do not need to fetch all the documents, only the ID and revisions. By default, that is all that _all_docs returns. You can make a pretty big request in a batch (10k or 100k docs at a time should be fine).
Replicate then delete
You could use an _all_docs query to get the IDs of all design documents.
GET /db/_all_docs?startkey="_design/"&endkey="_design0"
Then replicate them somewhere temporary.
POST /_replicator
{ "source":"db", "target":"db_ddocs", "create_target":true
, "user_ctx": {"roles":["_admin"]}
, "doc_ids": ["_design/ddoc_1", "_design/ddoc_2", "etc..."]
}
Now you can just delete the original database and replicate the temporary one back by swapping the "source" and "target" values.
Deleting vs "deleting"
Note, these are really apples vs. oranges techniques. By deleting a database, you are wiping out the edit history of all its documents. In other words, you cannot replicate those deletion events to any other database. When you "delete" a document in CouchDB, it stores a record of that deletion. If you replicate that database, those deletions will be reflected in the target. (CouchDB stores "tombstones" indicating the document ID, its revision history, and its deleted state.)
That may or may not be important to you. The first idea is probably considered more "correct" however I can see the value of the second. You can visualize the entire program to accomplish this in your head. It's only a few queries and you're done. No looping through _all_docs batches, no headache. Your specific situation will probably make it obvious which is better.
Install couchapp, pull down the design doc to your hard disk, delete the db in futon, push the design doc back up to your recreated database. =)
You could write a shell script that goes through the list of all documents and deletes them all one by one except design docs. Apparently couch-batch can do that. Note that you don't need to fetch the whole docs to do that, just the id and revision.
Other than that, I think filtered replication (or the replication proposed by JasonSmith) is your best bet.

Including documents in the emit compared to include_docs = true in CouchDB

I ran across a mention somewhere that doing an emit(key, doc) will increase the amount of time an index takes to build (or something to that effect).
Is there any merit to it, and is there any reason not to just always do emit(key, null) and then include_docs = true?
Yes, it will increase the size of your index, because CouchDB effectively copies the entire document in those cases. For cases in which you can, use include_docs=true.
There is, however, a race condition to be aware of when using this that is mentioned in the wiki. It is possible, during the time between reading the view data and fetching the document, that said document has changed (or has been deleted, in which case _deleted will be true). This is documented here under "Querying Options".
This is a classic time/space tradeoff.
Emitting document data into your index will increase the size of the index file on disk because CouchDB includes the emitted data directly into the index file. However, this means that, when querying your data, CouchDB can just stream the content directly from the index file on disk. This is obviously quite fast.
Relying instead on include_docs=true will decrease the size of your on-disk index, it's true. However, on querying, CouchDB must perform a document read for every returned row. This involves essentially random document lookups from the main data file, meaning that the cost and time of returning data increases significantly.
While the query time difference for small numbers of documents is slow, it will add up over every call made by the application. For me, therefore, emitting needed fields from a document into the index is usually the right call -- disk is cheap, user's attention spans less so. This is broadly similar to using covering indexes in a relational database, another widely echoed piece of advice.
I did a totally unscientific test on this to get a feel for what the difference is. I found about an 8x increase in response time and 50% increase in CPU when using include_docs=true to read 100,000 documents from a view when compared to a view where the documents were emitted directly into the index itself.

Resources