CouchDB slow List-Function - couchdb

In general I know what the problem is, but i have no idea how to solve it.
I have a simple map-function:
function(doc) {
if(doc.Type === 'Mission'){
for(var i in doc.Sections){
emit(doc._id, {_id:doc.Sections[i].id});
}
}
}
Based on the result of the map-function, I use a list-function to do some formatting:
function(head,req){
var result=[];
var row;
topo = require('lib/topojson');
while(row=getRow()){
if (row !== null) {
if(row.value._id){
row.doc.Geometry.properties.IDs.Section_ID = row.value._id;
}else{
row.doc.Geometry.properties.IDs.Section_ID = row.value;
}
geojson = {
type: "Feature",
geometry: row.doc.Geometry.geometry,
properties: row.doc.Geometry.properties
};
result.push(geojson);
}else{
send(JSON.stringify({
status_code: 404
}));
}
}
send(JSON.stringify(result));
}
The more documents are matching the map-function the longer does it take to do the processing with the list-function. The limiting factor is the couchjs view server. First the result from the map-function has to be serialized, after that the list-function can do the work.
As I wrote, for a small amount of documents the processing time isn't dramatical but as the amount of documents increase the time to do the processing by the list function increase as well.
Has someone an idea to improve my way to format the result?
Is it better to let the client do the work?

There exist several tricks to speed up _list functions.
Make your list and map functions live in two different design docs, to ensure they run in different SpiderMonkey instances.
Send response in large chunks, tens or even hundreds of kilobytes. Find out optimal chunk size: too large chunks are bad in terms of TTFB and memory consumption, small chunks produce IO overhead between SM and Erlang.
Minimize overhead of Storage->Erlang->JS serialize/deserialize. Make your map function emit strings that are serialized JSON and parse each row‘s JSON inside your list fn from a plain string. The more simple structure you pass to Erlang, the less time is spent at Erlang side to process it and pass to SM.
You can also use cache approach, but you must clearly understand what you‘re doing. Read more details here.

A list function is executed at runtime, that means the processing time is proportional to the number of documents the view returns. You can use a list function to display the last 20 posts of a blog, but you can't use it to elaborate 100.000 documents. That's something must be done inside a map function. In your place I would modify the map function to perform the operations you are doing inside the list function, or, even better, I would perform them before to save the document.

Related

How to filter view query results based on reduce value (and not just on key)

Using map/reduce functions only (not Mango),and the following example from the documentation, using the map and reduce functions below One may obtain the number of unique labels:
Documents return by the view
{"total_rows":9,"offset":0,"rows":[
{"id":"3525ab874bc4965fa3cda7c549e92d30","key":"bike","value":null},
{"id":"3525ab874bc4965fa3cda7c549e92d30","key":"couchdb","value":null},
{"id":"53f82b1f0ff49a08ac79a9dff41d7860","key":"couchdb","value":null},
{"id":"da5ea89448a4506925823f4d985aabbd","key":"couchdb","value":null},
{"id":"3525ab874bc4965fa3cda7c549e92d30","key":"drums","value":null},
{"id":"53f82b1f0ff49a08ac79a9dff41d7860","key":"hypertext","value":null},
{"id":"da5ea89448a4506925823f4d985aabbd","key":"music","value":null},
{"id":"da5ea89448a4506925823f4d985aabbd","key":"mustache","value":null},
{"id":"53f82b1f0ff49a08ac79a9dff41d7860","key":"philosophy","value":null}
]}
Map function
function(doc) {
if(doc.name && doc.tags) {
doc.tags.forEach(function(tag) {
emit(tag, 1);
});
}
}
Reduce function
function(keys, values) {
return sum(values);
}
Response with grouping
{"rows":[
{"key":"bike","value":1},
{"key":"couchdb","value":3},
{"key":"drums","value":1},
{"key":"hypertext","value":1},
{"key":"music","value":1},
{"key":"mustache","value":1},
{"key":"philosophy","value":1}
]}
Now my question is, using map/reduce views only (not Mango) how can I query the view to only select rows having a specific value following reduce (for example "3"). It looks like all view parameters focus on filtering based on the key, but I need to filter based on value. Ideally, being able to also use greater than, lesser than for reduce value filtering would also be great.
The ability to filter based on the value is essential for scenarios like the one above, but also for more advanced scenarios involving linked documents. Of course, I am not interested in filtering in memory in the application layer since in real world scenarios, the result set would be much larger than a dozen lines.

Efficiently validating large list of objects

I have a function that is meant to remove items from a Collection if a certain field does not pass a validation check (either email or phone, but that's not important in this context). Problem is that a regular expression is relatively slow, and I have lists of 1 million+ items.
My function
public HashSet<ListItemModel> RemoveInvalid(HashSet<ListItemModel> listItems)
{
string pattern = (this.phoneOrEmail == "email")//phoneOrEmail is set via config file
?
//RFC 5322 compliant email regex. see http://www.regular-expressions.info/email.html
#"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?"
:
//north-american phone number regex. see http://stackoverflow.com/questions/12101125/regex-to-allow-only-digits-hypens-space-parentheses-and-should-end-with-a-dig
#"(?:\d{3}(?:\d{7}|\-\d{3}\-\d{4}))|(?:\(\d{3}\)(?:\-\d{3}\-)|(?: \d{3} )\d{4})";
Regex re = new Regex(pattern);
if (phoneOrEmail == "email")
{
return new HashSet<ListItemModel>(listItems.Where(x => re.IsMatch(x.Email,0)));
}
else
{
return new HashSet<ListItemModel>(listItems.Where(x => re.IsMatch(x.Tel, 0)));
}
}
This takes way too long to execute. Is there a faster way of returning a subset that contains only valid emails/phone numbers?
I need to come up with something that is lightning quick. My other operations usually take only a couple of seconds on 700k+ items, but this method is taking forever and I hate that. I will be experimenting with a series of LINQ .Contains(x,y,z) checks, but in the meantime, I'd like some input from people who are smarter than me.

CouchDB - Filtered Replication - Can the speed be improved?

I have a single database (300MB & 42,924 documents) consisting of about 20 different kinds of documents from about 200 users. The documents range in size from a few bytes to many KiloBytes (150KB or so).
When the server is unloaded, the following replication filter function takes about 2.5 minutes to complete.
When the server is loaded, it takes >10 minutes.
Can anyone comment on whether these times are expected, and if not, suggest how I might optimize things in order to
get better performance?
function(doc, req) {
acceptedDate = true;
if(doc.date) {
var docDate = new Date();
var dateKey = doc.date;
docDate.setFullYear(dateKey[0], dateKey[1], dateKey[2]);
var reqYear = req.query.year;
var reqMonth = req.query.month;
var reqDay = req.query.day;
var reqDate = new Date();
reqDate.setFullYear(reqYear, reqMonth, reqDay);
acceptedDate = docDate.getTime() >= reqDate.getTime();
}
return doc.user_id && doc.user_id == req.query.userid && doc._id.indexOf("_design") != 0 && acceptedDate;
}
Filtered replications works slow because for each fetched document runs complex logic to decide whether to replicate it or not:
CouchDB fetches next document;
Because filter function has to be applied the document gets converted to JSON;
JSONifyed document passes through stdio to query server;
Query server handles document and decodes it from JSON;
Now, query server lookups and runs your filter function which returns true or false value to CouchDB;
If result is true document goes to be replicated;
Go to p.1 and loop for all documents;
For non-filtered replications take this list, throw away p.2-5 and let p.6 has always true result. This overhead slows down whole replication process.
To significantly improve filtered replication speed, you may use Erlang filters via Erlang native server. They runs inside CouchDB, doesn't pass through any stdio interface and there is no JSON decode/encode overhead applied.
NOTE, that Erlang query server runs not inside sandbox like JavaScript one, so you need to really trust code that you run with it.
Another option is to optimize your filter function e.g. reduce object creation, method calls, but actually you wouldn't win much with this.

couchdb, disabling rereduce

I'm attempting to get a key value pair out of couch db. The key is the player id, and the value is how many games exist where it's their turn. I have a map method that successfully gets a list of playerID,gameID where the playerID is who's turn it is for the gameID. My reduce function is a simple length call.
function(keys, values){
return values.length;
}
When I run this from Futon, it runs fine. I get the sample output:
5,11
6,3
However, when I call it from Divan (C# lib for couchdb), I get the result
null, 14
My guess is it's merging these into one item through a rereduce. Is there a way to disable rereduce?
Thanks.
-Nick
No, you can't disable rereduce. However, the difference here is that Futon is adding group=true when calling your view but Divan is not, which explains the different results.
You should replace your reduce function with "_count" which correctly handles both the reduce and re-reduce cases. Your function returns the length of the values array, which is only correct for the reduce case. A correct solution in javascript would look like this;
function(keys, values, rereduce) {
if (rereduce) {
return sum(values);
} else {
return values.length
}
}
In the reduce call, then the values array contains whatever you emitted as the value, one entry for each emit. Since you're counting, you don't care what that value is, only how many of them there were. In the re-reduce call, the values array contains values previously emitted by a reduce call. Here the length of the values array is completely irrelevant, instead you want the sum of the lengths of previous reduce phases.

Parallel.ForEach Ordered Execution

I am trying to execute parallel functions on a list of objects using the new C# 4.0 Parallel.ForEach function. This is a very long maintenance process. I would like to make it execute in the order of the list so that I can stop and continue execution at the previous point. How do I do this?
Here is an example. I have a list of objects: a1 to a100. This is the current order:
a1, a51, a2, a52, a3, a53...
I want this order:
a1, a2, a3, a4...
I am OK with some objects being run out of order, but as long as I can find a point in the list where I can say that all objects before this point were run. I read the parallel programming csharp whitepaper and didn't see anything about it. There isn't a setting for this in the ParallelOptions class.
Do something like this:
int current = 0;
object lockCurrent = new object();
Parallel.For(0, list.Count,
new ParallelOptions { MaxDegreeOfParallelism = MaxThreads },
(ii, loopState) => {
// So the way Parallel.For works is that it chunks the task list up with each thread getting a chunk to work on...
// e.g. [1-1,000], [1,001- 2,000], [2,001-3,000] etc...
// We have prioritized our job queue such that more important tasks come first. So we don't want the task list to be
// broken up, we want the task list to be run in roughly the same order we started with. So we ignore tha past in
// loop variable and just increment our own counter.
int thisCurrent = 0;
lock (lockCurrent) {
thisCurrent = current;
current++;
}
dothework(list[thisCurrent]);
});
You can see how when you break out of the parallel for loop you will know the last list item to be executed, assuming you let all threads finish prior to breaking. I'm not a big fan of PLINQ or LINQ. I honestly don't see how writing LINQ/PLINQ leads to maintainable source code or readability.... Parallel.For is a much better solution.
If you use Parallel.Break to terminate the loop then you are guarenteed that all indices below the returned value will have been executed. This is about as close as you can get. The example here uses For but ForEach has similar overloads.
int n = ...
var result = new double[n];
var loopResult = Parallel.For(0, n, (i, loopState) =>
{
if (/* break condition is true */)
{
loopState.Break();
return;
}
result[i] = DoWork(i);
});
if (!loopResult.IsCompleted &&
loopResult.LowestBreakIteration.HasValue)
{
Console.WriteLine("Loop encountered a break at {0}",
loopResult.LowestBreakIteration.Value);
}
In a ForEach loop, an iteration index is generated internally for each element in each partition. Execution takes place out of order but after break you know that all the iterations lower than LowestBreakIteration will have been completed.
Taken from "Parallel Programming with Microsoft .NET" http://parallelpatterns.codeplex.com/
Available on MSDN. See http://msdn.microsoft.com/en-us/library/ff963552.aspx. The section "Breaking out of loops early" covers this scenario.
See also: http://msdn.microsoft.com/en-us/library/dd460721.aspx
For anyone else who comes across this question - if you're looping over an array or list (rather than an IEnumberable ), you can use the overload of Parallel.Foreach that gives the element index to maintain original order too.
string[] MyArray; // array of stuff to do parallel tasks on
string[] ProcessedArray = new string[MyArray.Length];
Parallel.ForEach(MyArray, (ArrayItem,loopstate,ArrayElementIndex) =>
{
string ProcessedArrayItem = TaskToDo(ArrayItem);
ProcessedArray[ArrayElementIndex] = ProcessedArrayItem;
});
As an alternate suggestion, you could record which object have been run and then filter the list when you resume exection to exclude the objects which have already run.
If this needs to be persistent across application restarts, you can store the ID's of the already executed objects (I assume here the objects have some unique identifier).
For anybody looking for a simple solution, I have posted 2 extension methods (one using PLINQ and one using Parallel.ForEach) as part of an answer to the following question:
Ordered PLINQ ForAll
Not sure if question was altered as my comment seems wrong.
Here improved, basically remind that parallel jobs run in out of your control order.
ea printing 10 numbers might result in 1,4,6,7,2,3,9,0.
If you like to stop your program and continue later.
Problems alike this usually endup in batching workloads.
And have some logging of what was done.
Say if you had to check 10.000 numbers for prime or so.
You could loop in batches of size 100, and have a prime log1, log2, log3
log1= 0..99
log2=100..199
Be sure to set some marker to know if a batch job was finished.
Its a general aprouch since the question isnt that exact either.

Resources