Say I have a map that emits the following objects
{"basePoints": 2000, "bonusPoints": 1000}
{"basePoints": 1000, "bonusPoints": 50}
{"basePoints": 10000, "bonusPoints": 5000}
How could I write a reduce in Erlang (not javascript) that would return an aggregate object like this:
{"basePoints": 13000, "bonusPoints": 6050}
(I would rather not have to write 2 separate views that emits each value separately if I can help it)
Many Thanks!
You actually do not need special reduce, in this case you can use standard _sum, since it’s able to sum not only numbers, but also arrays of numbers.
Just emit [basePointsNum, 0] for basePoints and [0, bonusPointsNum] for bonusPoints. Or if you have both fields in one doc you might emit [basePointsNum, bonusPointsNum].
After reducing using built-in _sum you will receive an array of two numbers, each is a sum of appropriate index column. This feature seems to be undocumented, however works for both CouchDB and PouchDB, and it’s blazing fast.
Related
Basically what I want to do is the Sequelize's equivalent of this question:
More efficient way of querying for this data?
My use case is a bit different from the above question though, much more troublesome. In particular:
Unlike the original question, I use MySQL.
My case could potentially have not just a pair of values, but a set of up to 4 different values (number of values in each set are not fixed), all thanks to my company's immaculate database
The maximum amount of sets is not just limited to ~100 sets. I can see this easily exceeds 2000 sets. (this is my main concern)
This query is a part of a already rather complex function. I tried to trim the thing down as much as possible already, but it still take quite a while to do. This query would be triggered, in my estimation, 5 to 7 times throughout the runtime of the function. I have tried the following:
The conventional way of just stuffing the processed search set inside of [Op.or] would fire up a really long query, which could exceed MySQL's query line limit (I'm not allowed to change this).
Querying item by item is reliable but slower.
The main function right now runs in approximately 1 minute (note that this is me using a smaller set of data for the purpose of testing, actual runtime can easily be 4-5 times this), which I don't think is acceptable as it is called multiple times a day. I also can't heavily modify the database itself, as it is a legacy database which is also used by other applications. If the original database had been designed properly, we wouldn't have gone to this, but alas, I can only try my best.
Any help would be very appreciated.
In MySQL, you can use tuple in WHERE clause and you can fill the missing value with ANY_VALUE(attribute name) to match anything.
SELECT * FROM Employees
WHERE (name, age, dept, salary) IN (
('Alice', 40, ANY_VALUE(dept), ANY_VALUE(salary)),
('Bob', ANY_VALUE(age), 'Tech', 120),
('Mike', 25, 'HR', ANY_VALUE(salary))
)
I tested with 100k data with 1k criterion and the query returns with 2.954s on my laptop.
========================================================
UPDATE
If you always have 4 values and no needs of ANY_VALUE, it can write in Sequelize with least literal.
const criteria = [
['Alice', 40, 'Tech', 120],
['Bob', 30, 'Tech', 120],
['Mike', 25, 'HR', 120]
];
const result = await db.Employee.findAll({
where: Sequelize.where(Sequelize.literal('(name, age, dept, salary)'), Op.in, [criteria])
});
However, in your case, the set doesn't guarantee to have all 4 values, thus needs ANY_VALUE. Unfortunately, I cannot use Sequelize.fn('ANY_VALUE', 'name') in Sequelize.where as it tries to escape it and it cannot be escaped.
Therefore, the 3rd argument for Sequelize.where also need to be replaced with literal. At this point, the code is mostly literal and I don't see any differences to just using Sequelize.query unless if you are using many other options such as offset, limit, attributes... that can still benefit the Sequelize's query generator.
const result = await db.sequelize.query(`
SELECT * FROM Employees WHERE (name, age, dept, salary) IN (${constructedCriteria})`,
{ type: Sequelize.QueryTypes.SELECT } // This will let Sequelize to format the response as in `findAll` function.
);
Some thoughts.
What is the use case of >1k criterion? Are all 1k criterion distinct?
Maybe this scenario is more suited with search engine like ElasticSearch? (if your situation is flexible)
Can we use process.hrtime() as universal unique id within the current process ?
var uuid = parseInt(process.hrtime().join(''));
You can use process.hrtime() to create identifiers with low chance of collision, but they are not unique, especially not across application restarts (which matters if you persist any of them to a database or similar), and not when several threads/processes/instances are involved.
From the documentation:
These times are relative to an arbitrary time in the past, and not related to the time of day
Also, by using parseInt(....join('')), you are introducing a second way for collisions to happen: e.g. [1, 23] and [12, 3] will lead to the same result.
If you want to build your own solution (a[0] * 1e9 + a[1] comes to mind as a naive approach), you should also be aware of the precision limits of JavaScript numbers -- there's a reason why hrtime() returns a tuple and not just a single number. When in doubt, when you need proper UUIDs, you should probably use proper UUIDs ;-)
This question is rather old, but I managed to figure out something that works like this (partially, on a single machine, and NOT completely validated yet). See Is this a viable, monotonically increasing timeId in javascript?. (Note, requires Node 10 or 11, and I've only validated on Mac OS Mojave so far)
That solution will not provide a UUID, but it should produce an ID that is always increasing in value. Some other machine/process ID would have to be appended to it to make it really unique.
I am writing a custom search strategy with builds() (this doesn't matter w.r.t. this question) which shall use hypothesis.strategies.integers(min_value=None, max_value=None) to generate integer data with an explicit step size other than, let's say delta 10. I do not need a list of values like [10, 20, 30, 40, etc.]. Instead I need subsequent calls of the test function to be called with integer values with step size of 10, e.g. with 10 for the first call, 20 for the second call, etc. How can I achieve this easiest?
You can easily adapt existing strategies, for example generating even numbers via:
integers().map(lambda x: x * 2)
And just to check - are you using a recent version of Hypothesis? You linked to the documentation for v1.8, which is unsupported and significantly less powerful than the current version 3.48.
Finally, consider a composite strategy if you need to have a particular relationship between the parts of whatever you're constructing - builds() is simpler but doesn't support dependencies between arguments.
I need subsequent calls of the test function to be called with integer values with step size of 10, e.g. with 10 for the first call, 20 for the second call, etc.
Hypothesis only supports stateful testing via the hypothesis.stateful module.
By design, each example provided by #given is independent of any other - if this doesn't work for your use case Hypothesis is probably the wrong tool for the job.
With a couchdb view, we get results ordered by key. I have been using this to get values associated with a highest number. For example, take this result (in key: value form):
{1:'sam'}
{2:'jim'}
{4:'joan'}
{5:'jill'}
couchDB will sort those according to the key. (It could be helpful to think of the key as the "score".) I want to find out who has the highest or lowest score.
I have written a reduce function like so:
function(keys, values) {
var len = values.length;
return values[len - 1];
}
I know there's _stat and the like, but these are not possible in my application (this is a slimmed down, hypothetical example).
Usually when I run this reduce, i will get either 'sam' or 'jill' depending on whether descending is set. This is what I want. However, in large data-sets, sometimes I get someone from the middle of the list.
I suspect this is happening on rereduce. I had assumed that when rereduce has been run, the order of results is preserved. However, I can find no assurances that this is the case. I know that on rereduce, the key is null, so by the normal sorting rules they would not be sorted. Is this the case?
If so, any advice on how to get my highest scorer?
Yeah, I don't think sorting order is guaranteed, probably because it cannot be guaranteed in clustered environments. I suspect the way you're using map/reduce here is a little iffy, but you should post your view code if you really want a good answer here.
I have some documents with a "status" field of "Green", "Red", "Amber".
I'm sure it's possible to use MapReduce to produce a grouped response containing three keys (one for each status), each with a value containing an array of all the documents with that key. However, I'm struggling on how to use re(reduce) functions.
Map function:
function(doc) {
emit(doc.status, doc);
}
Reduce function: ???
This is not a problem that reduce is intended to solve; reduce in CouchDB is for aggregation.
If I understand you correctly, you want this;
Map:
function(doc) {
for (var i in doc.status) {
emit(doc.status[i], null);
}
}
You can then find all docs of status Green with;
/_design/foo/_view/bar?key="Green"&include_docs=true
This will return a list of all docs with that status. If you wish to find docs of more than one status in a single query, then use http POST with a body of this form;
{"keys":["Green", "Red"]}
HTH,
B.
Generally speaking, you will not use a reduce function to obtain your list of documents. A reduce is meant to take a list, and reduce it to a single value. In fact, there is an upper limit to the size of a reduce value anyways, and using entire documents will trigger a reduce_overflow error. Examples of reduces are counts, sums, averages, etc. Stick with the map query, and you will have your values collated and sorted by the status value.
On another, possibly unrelated note, I would not emit the document with your view. You can just use the include_docs view query parameter, and achieve the same effect, while saving disk-space in the process. The trade-off is that internally the doc will have to be retrieved one-by-one. (but since they're indexed already by _id anyways, it's usually a negligible difference.