Traverse a graph in parallel - multithreading

I'm revising for an exam (still) and have come across a question (posted below) that has me stumped. I think, in summary, the question is asking "Think of any_old_process that has to traverse a graph and do some work on the objects it finds, including adding more work.". My question is, what data structure can be parallelised to achieve the goals set out in the question?
The role of a garbage collector (GC) is to reclaim unused memory.
Tracing collectors must identify all live objects by traversing graphs
of objects induced by aggregation relationships. In brief, the GC has
some work-list of tasks to perform. It repeatedly (a) acquires a task
(e.g. an object to inspect), (b) performs the task (e.g. marks the
object unless it is already marked), and (c) generates further tasks
(e.g. adds the children of an unmarked task to the work-list). It is
desirable to parallelise this operation.
In a single-threaded
environment, the work-list is usually a single LIFO stack. What would
you have to do to make this safe for a parallel GC? Would this be a
sensible design for a parallel GC? Discuss designs of data structure
to support a parallel GC that would scale better. Explain why you
would expect them to scale better.

The natural data structure for a graph is, well, a graph, i.e. a set of graph elements (nodes) which can refer other elements. Though, for the better cache reuse, the elements can be placed/allocated in an array or arrays (generally, vectors) in order to put neighbor elements as close in memory as possible. Generally, each element or a group of elements should have a mutex (spin_mutex) to protect access to it, the contention means that some other thread is busy working on it, so no need to wait. Though, if possible, an atomic operation over the flag/state fields is preferable to mark the element as visited without a lock. For example, the simplest data structure can be the following:
struct object {
vector<object*> references;
atomic<bool> is_visited; // for simplicity, or epoch counter
// if nothing resets it to false
void inspect(); // processing method
};
vector<object> objects; // also for simplicity, if it can be for real
// things like `parallel_for` would be perfect here
Given this data structure and the way how GC work is described, it perfectly fits for a recursive parallelism like divide-and-conquer pattern:
void object::inspect() {
if( ! is_visited.exchange(true) ) {
for( object* o : objects ) // alternatively it can be `parallel_for` in some variants
cilk_spawn o->inspect(); // for Cilk or `task_group::run` for TBB or PPL
// further processing of the object
}
}
If the data structure in the question is how the tasks are organized. I'd recommend a work-stealing scheduler (like tbb or cilk. There are tons of papers on this subject. To put it simple, each worker thread has its own but shared deque of tasks, and when the deque is empty, a thread steals tasks from others deques.
The scalability comes from the property that each task can add some other tasks which can work in prarallel..

Your questions:
Think of any_old_process that has to traverse a graph and do some work on the objects it finds, including adding more work.
... what data structure can be parallelised to achieve the goals set out in the question?
Quoted questions:
Some stuff about garbage collection.
Since you are specifically interested in parallelizing graph algorithms, I'll give an example of one kind of graph traversal that can be parallelized well.
Executive Summary
Finding local minima ("basins") or maxima ("peaks") are useful operations in digital image processing. A concrete example is geological watershed analysis. One approach to the problem treats each pixel or small group of pixels in the image as a node and finds non-overlapping minimum spanning trees (MST) with the local minima as the tree roots.
Gory details
Below is a simplistic example. It's a web interview question from Palantir Technologies brought to Programming Puzzles & Code Golf by AnkitSablok. It's simplified by two assumptions (bolded below):
That a pixel/cell only has 4 neighbors instead of the usual eight.
That a cell has all uphill neighbors (it's the local minima) or has a unique downhill neighbor. I.e., plains aren't allowed.
Below that is some JavaScript that solves this problem. It violates every reasonable coding standard against use of side-effects, but illustrates where some of the opportunities for parallelization exist.
In the "Create list of sinks (i.e. roots)" loop, note that each cell can be evaluated completely independently for elevation with respect to it's neighbors as long as the elevation data is static. In a sequential program, one thread of execution examines each cell. In a parallel program, the cells are divvied up so that one, and only one, thread reads and writes the local minima state information (sink[] in the program below). If generating the list of minima/roots in parallel, the queuing operations for the stack would have to be synchronized. For a discussion how to do that for stacks and other queues, see "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms", Michael & Scott, 1996. For modern updates, follow the citation tree on Google Scholar (no mutex required :).
In the "Each root explores it's basin" loop, note that each basin could explored/enumerated/flooded in parallel.
If you want dive deeper into parallelizing MSTs, see "Scalable Parallel Minimum Spanning Forest Computation", Nobari, Cao, arras, Bressan, 2012. The first two pages contain a clear and concise survey of the field.
Simplified example
A group of farmers has some elevation data, and we’re going to help them understand how rainfall flows over their farmland. We’ll represent the land as a two-dimensional array of altitudes and use the following model, based on the idea that water flows downhill:
If a cell’s four neighboring cells all have higher altitudes, we call this cell a sink; water collects in sinks. Otherwise, water will flow to the neighboring cell with the lowest altitude. If a cell is not a sink, you may assume it has a unique lowest neighbor and that this neighbor will be lower than the cell.
Cells that drain into the same sink – directly or indirectly – are said to be part of the same basin.
Your challenge is to partition the map into basins. In particular, given a map of elevations, your code should partition the map into basins and output the sizes of the basins, in descending order.
Assume the elevation maps are square. Input will begin with a line with one integer, S, the height (and width) of the map. The next S lines will each contain a row of the map, each with S integers – the elevations of the S cells in the row. Some farmers have small land plots such as the examples below, while some have larger plots. However, in no case will a farmer have a plot of land larger than S = 5000.
Your code should output a space-separated list of the basin sizes, in descending order. (Trailing spaces are ignored.)
Here's an example:
Input:
5
1 0 2 5 8
2 3 4 7 9
3 5 7 8 9
1 2 5 4 2
3 3 5 2 1
Output: 11 7 7
The basins, labeled with A’s, B’s, and C’s, are:
A A A A A
A A A A A
B B A C C
B B B C C
B B C C C
// lm.js - find the local minima
// Globalization of variables.
/*
The map is a 2 dimensional array. Indices for the elements map as:
[0,0] ... [0,n]
...
[n,0] ... [n,n]
Each element of the array is a structure. The structure for each element is:
Item Purpose Range Comment
---- ------- ----- -------
h Height of cell integers
s Is it a sink? boolean
x X of downhill cell (0..maxIndex) if s is true, x&y point to self
y Y of downhill cell (0..maxIndex)
b Basin name ('A'..'A'+# of basins)
Use a separate array-of-arrays for each structure item. The index range is
0..maxIndex.
*/
var height = [];
var sink = [];
var downhillX = [];
var downhillY = [];
var basin = [];
var maxIndex;
// A list of sinks in the map. Each element is an array of [ x, y ], where
// both x & y are in the range 0..maxIndex.
var basinList = [];
// An unordered list of basin sizes.
var basinSize = [];
// Functions.
function isSink(x,y) {
var myHeight = height[x][y];
var imaSink = true;
var bestDownhillHeight = myHeight;
var bestDownhillX = x;
var bestDownhillY = y;
/*
Visit the neighbors. If this cell is the lowest, then it's the
sink. If not, find the steepest downhill direction.
*/
function visit(deltaX,deltaY) {
var neighborX = x+deltaX;
var neighborY = y+deltaY;
if (myHeight > height[neighborX][neighborY]) {
imaSink = false;
if (bestDownhillHeight > height[neighborX][neighborY]) {
bestDownhillHeight = height[neighborX][neighborY];
bestDownhillX = neighborX;
bestDownhillY = neighborY;
}
}
}
if (x !== 0) {
// upwards neighbor exists
visit(-1,0);
}
if (x !== maxIndex) {
// downwards neighbor exists
visit(1,0);
}
if (y !== 0) {
// left-hand neighbor exists
visit(0,-1);
}
if (y !== maxIndex) {
// right-hand neighbor exists
visit(0,1);
}
downhillX[x][y] = bestDownhillX;
downhillY[x][y] = bestDownhillY;
return imaSink;
}
function exploreBasin(x,y,currentSize,basinName) {
// This cell is in the basin.
basin[x][y] = basinName;
currentSize++;
/*
Visit all neighbors that have this cell as the best downhill
path and add them to the basin.
*/
function visit(x,deltaX,y,deltaY) {
if ((downhillX[x+deltaX][y+deltaY] === x) && (downhillY[x+deltaX][y+deltaY] === y)) {
currentSize = exploreBasin(x+deltaX,y+deltaY,currentSize,basinName);
}
return 0;
}
if (x !== 0) {
// upwards neighbor exists
visit(x,-1,y,0);
}
if (x !== maxIndex) {
// downwards neighbor exists
visit(x,1,y,0);
}
if (y !== 0) {
// left-hand neighbor exists
visit(x,0,y,-1);
}
if (y !== maxIndex) {
// right-hand neighbor exists
visit(x,0,y,1);
}
return currentSize;
}
// Read map from file (1st argument).
var lines = $EXEC('cat "' + $ARG[0] + '"').split('\n');
maxIndex = lines.shift() - 1;
for (var i = 0; i<=maxIndex; i++) {
height[i] = lines.shift().split(' ');
// Create all other 2D arrays.
sink[i] = [];
downhillX[i] = [];
downhillY[i] = [];
basin[i] = [];
}
for (var i = 0; i<=maxIndex; i++) { print(height[i]); }
// Everyone decides if they are a sink. Create list of sinks (i.e. roots).
for (var x=0; x<=maxIndex; x++) {
for (var y=0; y<=maxIndex; y++) a
if (sink[x][y] = isSink(x,y)) {
// This node is a root (AKA sink).
basinList.push([x,y]);
}
}
}
//for (var i = 0; i<=maxIndex; i++) { print(sink[i]); }
// Each root explores it's basin.
var basinName = 'A';
for (var i=basinList.length-1; i>=0; --i) { // i-- makes Closure Compiler sad
var x = basinList[i][0];
var y = basinList[i][5];
basinSize.push(exploreBasin(x,y,0,basinName));
basinName = String.fromCharCode(basinName.charCodeAt() + 1);
}
for (var i = 0; i<=maxIndex; i++) { print(basin[i]); }
// Done.
print(basinSize.sort(function(a, b){return b-a}).join(' '));

Related

I cannot find out why this code keeps skipping a loop

Some background on what is going on:
We are processing addresses into standardized forms, this is the code to take addresses scored by how many components found and then rescore them using a levenshtein algorithm across similar post codes
The scores are how many components were found in that address divided by the number missed, to return a ratio
The input data, scoreDict, is a dictionary containing arrays of arrays. The first set of arrays is the scores, so there are 12 arrays because there are 12 scores in this file (it adjusts by file). There are then however many addresses fit that score in their own separate arrays stored in that. Don't ask me why I'm doing it that way, my brain is dead
The code correctly goes through each score array and each one is properly filled with the unique elements that make it up. It is not short by any amount, nothing is duplicated, I have checked
When we hit the score that is -1 (this goes to any address where it doesn't fit in some rule so we can't use its post code to find components so no components are found) the loop specifically ONLY DOES EVERY OTHER ADDRESS IN THIS SCORE ARRAY
It doesn't do this to any other score array, I have checked
I have tried changing the number to something else like 99, same issue except one LESS address got rescored, and the rest stayed at the original failing score of 99
I am going insane, can anyone find where in this loop something may be going wrong to cause it to only do every other line. The index counter of line and sc come through in the correct order and do not skip over. I have checked
I am sorry this is not professional, I have been at this one loop for 5 hours
Rescore: function Rescore(scoreDict) {
let tempInc = 0;
//Loop through all scores stored in scoreDict
for (var line in scoreDict) {
let addUpdate = "";
//Loop through each line stored by score
for (var sc in scoreDict[line.toString()]) {
console.log(scoreDict[line.toString()].length);
let possCodes = new Array();
const curLine = scoreDict[line.toString()][sc];
console.log(sc);
const curScore = curLine[1].split(',')[curLine[1].split(',').length-1];
switch (true) {
case curScore == -1:
let postCode = (new RegExp('([A-PR-UWYZ][A-HK-Y]?[0-9][A-Z0-9]?[ ]?[0-9][ABD-HJLNP-UW-Z]{2})', 'i')).exec(curLine[1].replace(/\\n/g, ','));
let areaCode;
//if (curLine.split(',')[curLine.split(',').length-2].includes("REFERENCE")) {
if ((postCode = (new RegExp('(([A-Z][A-Z]?[0-9][A-Z0-9]?(?=[ ]?[0-9][A-Z]{2}))|[0-9]{5})', 'i').exec(postCode))) !== null) {
for (const code in Object.keys(addProper)) {
leven.LoadWords(postCode[0], Object.keys(addProper)[code]);
if (leven.distance < 2) {
//Weight will have adjustment algorithms based on other factors
let weight = 1;
//Add all codes that are close to the same to a temp array
possCodes.push(postCode.input.replace(postCode[0], Object.keys(addProper)[code]).split(',')[0] + "(|W|)" + (leven.distance/weight));
}
}
let highScore = 0;
let candidates = new Array();
//Use the component script from cityprocess to rescore
for (var i=0;i<possCodes.length;i++) {
postValid.add([curLine[1].split(',').slice(0,curLine[1].split(',').length-2) + '(|S|)' + possCodes[i].split("(|W|)")[0]]);
if (postValid.addChunk[0].split('(|S|)')[postValid.addChunk[0].split('(|S|)').length-1] > highScore) {
candidates = new Array();
highScore = postValid.addChunk[0].split('(|S|)')[postValid.addChunk[0].split('(|S|)').length-1];
candidates.push(postValid.addChunk[0]);
} else if (postValid.addChunk[0].split('(|S|)')[postValid.addChunk[0].split('(|S|)').length-1] == highScore) {
candidates.push(postValid.addChunk[0]);
}
}
score.Rescore(curLine, sc, candidates[0]);
}
//} else if (curLine.split(',')[curLine.split(',').length-2].contains("AREA")) {
// leven.LoadWords();
//}
break;
case curScore > 0:
//console.log("That's a pretty good score mate");
break;
}
//console.log(line + ": " + scoreDict[line].length);
}
}
console.log(tempInc)
score.ScoreWrite(score.scoreDict);
}
The issue was that I was calling the loop on the array I was editing, so as each element got removed from the array (rescored and moved into a separate array) it got shorter by that element, resulting in an issue that when the first element was rescored and removed, and then we moved onto the second index which was now the third element, because everything shifted up by 1 index
I fixed it by having it simply enter an empty array for each removed element, so everything kept its index and the array kept its length, and then clear the empty values at a later time in the code

How to detect string tone from FFT

I've got spectrum from a Fourier transformation. It looks like this:
Police was just passing nearby
Color represents intensity.
X axis is time.
Y axis is frequency - where 0 is at top.
While whistling or a police siren leave only one trace, many other tones seem to contain a lot of harmonic frequencies.
Electric guitar plugged directly into microphone (standard tuning)
The really bad thing is, that as you can see there is no major intensity - there are 2-3 frequencies that are almost equal.
I have written a peak detection algorithm to highlight the most sigificant peak:
function findPeaks(data, look_range, minimal_val) {
if(look_range==null)
look_range = 10;
if(minimal_val == null)
minimal_val = 20;
//Array of peaks
var peaks = [];
//Currently the max value (that might or might not end up in peaks array)
var max_value = 0;
var max_value_pos = 0;
//How many values did we check without changing the max value
var smaller_values = 0;
//Tmp variable for performance
var val;
var lastval=Math.round(data.averageValues(0,4));
//console.log(lastval);
for(var i=0, l=data.length; i<l; i++) {
//Remember the value for performance and readibility
val = data[i];
//If last max value is larger then the current one, proceed and remember
if(max_value>val) {
//iterate the ammount of values that are smaller than our champion
smaller_values++;
//If there has been enough smaller values we take this one for confirmed peak
if(smaller_values > look_range) {
//Remember peak
peaks.push(max_value_pos);
//Reset other variables
max_value = 0;
max_value_pos = 0;
smaller_values = 0;
}
}
//Only take values when the difference is positive (next value is larger)
//Also aonly take values that are larger than minimum thresold
else if(val>lastval && val>minimal_val) {
//Remeber this as our new champion
max_value = val;
max_value_pos = i;
smaller_values = 0;
//console.log("Max value: ", max_value);
}
//Remember this value for next iteration
lastval = val;
}
//Sort peaks so that the largest one is first
peaks.sort(function(a, b) {return -data[a]+data[b];});
//if(peaks.length>0)
// console.log(peaks);
//Return array
return peaks;
}
The idea is, that I walk through the data and remember a value that is larger than thresold minimal_val. If the next look_range values are smaller than the chosen value, it's considered peak. This algorithm is not very smart but it's very easy to implement.
However, it can't tell which is the major frequency of the string, much like I anticipated:
The red dots highlight the strongest peak
Here's a jsFiddle to see how it really works (or rather doesn't work).
What you see in the spectrum of a string tone is the set of harmonics at
f0, 2*f0, 3*f0, ...
with f0 being the fundamental frequency or pitch of your string tone.
To estimate f0 from the spectrum (Output of FFT, abs value, probably logarithmic) you should not look for the strongest component, but the distance between all these harmonics.
One very nice method to do so is a second (inverse) FFT of the (abs, real) spectrum. This produces a strong line at t0 == 1/f0.
The sequence fft -> abs() -> fft-1 is equivalent to calculating the auto-correlation function (ACF) thanks to the Wiener–Khinchin theorem.
The precission of this approach depends on the length of the FFT (or ACF) and your sampling rate. You can improve precission a lot if you interpolate the "real" max between the sampling points of the result using a sinc function.
For even better results you could correct the intermediate spectrum: Most sounds have an average pink spectrum. If you amplify the higher frequencies (according an inverse pink spectrum) before the inverse FFT the ACF will be "better" (It takes the higher harmonics more into account, improving acuracy).

Bayes' formula for updating probabilistic map

I'm trying to get a mobile robot to map an arena based on what it can see from a camera. I've created a map, and managed to get the robot to identify items placed in the arena and give an estimated location, however, as I'm only using an RGB camera the resulting numbers can vary slightly ever frame due to noise, or change in lighting, etc. What am now trying to do is create a probability map using Bayes' formula to give a better map of the arena.
Bayes' Formula
P(i | x) = (p(i)p(x|i))/(sum(p(j)(p(x|j))
This is what I've got so far. All points on the map are initialised to 0.5.
// Gets the Likely hood of the event being correct
// Para 1 = Is the object likely to be at that location
// Para 2 = is the sensor saying it's at that location
private double getProbabilityNum(bool world, bool sensor)
{
if (world && sensor)
{
// number to test the function works
return 0.6;
}
else if (world && !sensor)
{
// number to test the function works
return 0.4;
}
else if (!world && sensor)
{
// number to test the function works
return 0.2;
}
else //if (!world && !sensor)
{
// number to test the function works
return 0.8;
}
}
// A function to update the map's probability of an object being at location (x,y)
// Para 3 = does the sensor pick up the an object at (x,y)
public double probabilisticMap(int x,int y,bool sensor)
{
// gets the current likelihood from the map (prior Probability)
double mapProb = get(x,y);
//decide if object is at location (x,y)
bool world = (mapProb < threshold);
//Bayes' formula to update the probability
double newProb =
(getProbabilityNum(world, sensor) * mapProb) / ((getProbabilityNum(world, sensor) * mapProb) + (getProbabilityNum(!world, sensor) * (1 - mapProb)));
// update the location on the map
set(x,y,newProb);
// return the probability as well
return newProb;
}
It does work, but the numbers seem to jump rapidly, and then flicker when they are at the top, it also errors if the numbers drop too near to zero. Anyone have any idea why this might be happening? I think it's something to do with the way the equations is coded, but I'm not too sure. (I found this, but I don't quite understand it, so I'm not sure of it's relevents, but it seems to be talking about the same thing
Thanks in Advance.
Use log-likelihoods when doing numerical computations involving probabilities.
Consider
P(i | x) = (p(i)p(x|i))/(sum(p(j)(p(x|j)).
Because x is fixed, the denominator, p(x), is a constant. Thus
P(i | x) ~ p(i)p(x|i)
where ~ denotes "is proportional to."
The log-likelihood function is just the log of this. That is,
L(i | x) = log(p(i)) + log(p(x|i)).

How to compute the visible area based on a heightmap?

I have a heightmap. I want to efficiently compute which tiles in it are visible from an eye at any given location and height.
This paper suggests that heightmaps outperform turning the terrain into some kind of mesh, but they sample the grid using Bresenhams.
If I were to adopt that, I'd have to do a line-of-sight Bresenham's line for each and every tile on the map. It occurs to me that it ought to be possible to reuse most of the calculations and compute the heightmap in a single pass if you fill outwards away from the eye - a scanline fill kind of approach perhaps?
But the logic escapes me. What would the logic be?
Here is a heightmap with a the visibility from a particular vantagepoint (green cube) ("viewshed" as in "watershed"?) painted over it:
Here is the O(n) sweep that I came up with; I seems the same as that given in the paper in the answer below How to compute the visible area based on a heightmap? Franklin and Ray's method, only in this case I am walking from eye outwards instead of walking the perimeter doing a bresenhams towards the centre; to my mind, my approach would have much better caching behaviour - i.e. be faster - and use less memory since it doesn't have to track the vector for each tile, only remember a scanline's worth:
typedef std::vector<float> visbuf_t;
inline void map::_visibility_scan(const visbuf_t& in,visbuf_t& out,const vec_t& eye,int start_x,int stop_x,int y,int prev_y) {
const int xdir = (start_x < stop_x)? 1: -1;
for(int x=start_x; x!=stop_x; x+=xdir) {
const int x_diff = abs(eye.x-x), y_diff = abs(eye.z-y);
const bool horiz = (x_diff >= y_diff);
const int x_step = horiz? 1: x_diff/y_diff;
const int in_x = x-x_step*xdir; // where in the in buffer would we get the inner value?
const float outer_d = vec2_t(x,y).distance(vec2_t(eye.x,eye.z));
const float inner_d = vec2_t(in_x,horiz? y: prev_y).distance(vec2_t(eye.x,eye.z));
const float inner = (horiz? out: in).at(in_x)*(outer_d/inner_d); // get the inner value, scaling by distance
const float outer = height_at(x,y)-eye.y; // height we are at right now in the map, eye-relative
if(inner <= outer) {
out.at(x) = outer;
vis.at(y*width+x) = VISIBLE;
} else {
out.at(x) = inner;
vis.at(y*width+x) = NOT_VISIBLE;
}
}
}
void map::visibility_add(const vec_t& eye) {
const float BASE = -10000; // represents a downward vector that would always be visible
visbuf_t scan_0, scan_out, scan_in;
scan_0.resize(width);
vis[eye.z*width+eye.x-1] = vis[eye.z*width+eye.x] = vis[eye.z*width+eye.x+1] = VISIBLE;
scan_0.at(eye.x) = BASE;
scan_0.at(eye.x-1) = BASE;
scan_0.at(eye.x+1) = BASE;
_visibility_scan(scan_0,scan_0,eye,eye.x+2,width,eye.z,eye.z);
_visibility_scan(scan_0,scan_0,eye,eye.x-2,-1,eye.z,eye.z);
scan_out = scan_0;
for(int y=eye.z+1; y<height; y++) {
scan_in = scan_out;
_visibility_scan(scan_in,scan_out,eye,eye.x,-1,y,y-1);
_visibility_scan(scan_in,scan_out,eye,eye.x,width,y,y-1);
}
scan_out = scan_0;
for(int y=eye.z-1; y>=0; y--) {
scan_in = scan_out;
_visibility_scan(scan_in,scan_out,eye,eye.x,-1,y,y+1);
_visibility_scan(scan_in,scan_out,eye,eye.x,width,y,y+1);
}
}
Is it a valid approach?
it is using centre-points rather than looking at the slope between the 'inner' pixel and its neighbour on the side that the LoS passes
could the trig in to scale the vectors and such be replaced by factor multiplication?
it could use an array of bytes since the heights are themselves bytes
its not a radial sweep, its doing a whole scanline at a time but away from the point; it only uses only a couple of scanlines-worth of additional memory which is neat
if it works, you could imagine that you could distribute it nicely using a radial sweep of blocks; you have to compute the centre-most tile first, but then you can distribute all immediately adjacent tiles from that (they just need to be given the edge-most intermediate values) and then in turn more and more parallelism.
So how to most efficiently calculate this viewshed?
What you want is called a sweep algorithm. Basically you cast rays (Bresenham's) to each of the perimeter cells, but keep track of the horizon as you go and mark any cells you pass on the way as being visible or invisible (and update the ray's horizon if visible). This gets you down from the O(n^3) of the naive approach (testing each cell of an nxn DEM individually) to O(n^2).
More detailed description of the algorithm in section 5.1 of this paper (which you might also find interesting for other reasons if you aspire to work with really enormous heightmaps).

faster min and max of different array components with CouchDb map/reduce?

I have a CouchDB database with a view whose values are paired numbers of the form [x,y]. For documents with the same key, I need (simultaneously) to compute the minimum of x and the maximum of y. The database I am working with contains about 50000 documents. Building the view takes several hours, which seems somewhat excessive. (The keys are themselves length-three arrays.) I show the map and reduce functions below, but the basic question is: how can I speed up this process?
Note that the builtin functions won't work because the values have to be numbers, not length-two arrays. It is possible that I could make two different views (one for min(x) and one for max(y)), but it is unclear to me how to combine them to get both results simultaneously.
My current map function looks basically like
function(doc) {
emit ([doc.a, doc.b, doc.c], [doc.x, doc.y])
}
and my reduce function looks like
function(keys, values) {
var x = null;
var y = null;
for (i = 0; i < values.length; i++) {
if (values[i][0] == null) break;
if (values[i][1] == null) break;
if (x == null) x = values[i][0];
if (y == null) y = values[i][1];
if (values[i][0] < x) x = values[i][0];
if (values[i][1] > y) y = values[i][1];
}
emit([x, y]);
}
Just two more notes. Using Math.max() and Math.min() should be a little faster.
function(keys, values) {
var x = -Infinity,
y = Infinity;
for (var i = 0, v; v = values[i]; i++) {
x = Math.max(x, v[0]);
y = Math.min(y, v[1]);
}
return [x, y];
}
And if CouchDB is treating the values as strings, it is because you are storing them as strings in the document.
Hope it helps.
This turned out to be a combination of two factors. One is obvious in the code posted above, where uses "emit" when it should use "return".
The other factor is less obvious and was only found by making a smaller version of the database and logging the steps in the reduce function. Although the entries in "values" were meant to be integers, they were being treated by CouchDB as character strings. Using the parseInt function corrected that problem.
After those two fixes, the entire build of the reduced view took about five minutes, so the speed problem evaporated.
Please check http://www.geeksforgeeks.org/archives/4583 . This may be extended to your application.

Resources