mongodb mapreduce aggregation solution or merge into alternative - node.js

Below is the logic that I am trying to implement but I am finding it really difficult to figure out a way with MongoDB/ Node.js app
Data: country, state, val1
I need to compute mean and std. deviation using the below formula. I checked other stack overflow posts but the std dev formula that i am working is not the same:
for each row -> group by country, state
mean = sum(val1)/count ->
for each row ->
deviation += Math.pow((val1 - mean), 2)
for each row -> group by country, state
std dev = Math.sqrt(dev/ count)
the problem is with the way deviation needs to be computed. It looks like I need an aggregation for Mean before computing the deviation/ std dev through Map reduce which I dont find a way to compute. Could anyone suggest a way to do this?
If it is not possible, do we have a way to issue an update statement in mongodb similar to the below traditional merge query? I shall update the mean value for all the rows and would later invoke Mapreduce for the deviation/std dev.
merge into Tbl1 a using
(select b.country, b.state, sum(b.val1)/count(b.val1) as mean
from Tbl1 b
group by b.country, b.state) c
on (a.country = c.country and
a.state = c.state)
when matched
then update
set a.mean = c.mean
I am pretty new to the nosql and nodejs and it would be great if you guys could suggest a solution/ alternative.

Yes, computing standard deviation using map-reduce is tricky as you need to compare each data value to the mean in the traditional algorithm.
Take a look at this solution based upon the parallel calculation algorithm: https://gist.github.com/RedBeard0531/1886960

Related

Update a pyspark Delta Table using a python boolean function

so I have a delta table that I want to update based on a condition of two column values combined;
i.e.
delta_table.update(
condition=is_eligible(col("name"), col("age"))
set={"pension_eligible": lit("yes")}
)
I'm aware that I can do something similar to:
delta_table.update(
condition=(col("name") == "Einar") & (col("age") > 65)
set={"pension_eligible": lit("yes")}
)
But since my logic for computing this is quite complex (I need to look up the name in a database) I would like to define my own Python function for computing this (is_eligible(...)). Other reasons are because this function is used elsewhere and I would like to minimize code duplication.
Is this possible at all? As I understand you could define it as an UDF, but they only take one parameter and I need at least two. I can not find anything about more complex conditions in the delta lake documentation, so I'd really appreciate some guidance here.

Use of Subquery in Informix for Left Outer Join

I have inherited a slow query in Informix. I suspect part of the slowness is due to the use of subqueries to do left outer joins. Here is a sample of the code:
FROM intide_rec AS IDE
LEFT OUTER JOIN (SELECT idp_cmpy_id, idp_idc_ctl_no, idp_itm_ctl_no, idp_brh, idp_invt_typ, idp_frm, idp_grd, idp_size, idp_fnsh, idp_whs, idp_mill, idp_heat, idp_tag_no, idp_num_size1, idp_num_size2, idp_num_size3, idp_num_size4, idp_num_size5, idp_wdth, idp_lgth, idp_idia, idp_odia, idp_ga_size, idp_ohd_mat_val, idp_ohd_pcs, idp_ohd_wgt, idp_invt_sts, idp_invt_qlty, idp_bgt_for, idp_ownr_id FROM intidp_rec) AS IDP ON (IDE.ide_cmpy_id = IDP.idp_cmpy_id AND IDE.ide_idc_ctl_no = IDP.idp_idc_ctl_no)
LEFT OUTER JOIN (SELECT prm_pep, prm_frm, prm_grd, prm_size, prm_fnsh FROM inrprm_rec) AS PRM ON
(IDP.idp_frm = PRM.prm_frm AND IDP.idp_grd = PRM.prm_grd AND IDP.idp_size = PRM.prm_size AND IDP.idp_fnsh = PRM.prm_fnsh)
Notice that the subqueries are simply retrieving columns. There is no manipulation of the columns. What is odd to me is why there are SELECT statements, i.e. subqueries, here.
Why not just remove the subqueries, move the columns out of the subqueries and into the main SELECT statement since there is no manipulation of columns and write the joins like this:
FROM intide_rec AS IDE
LEFT OUTER JOIN intidp_rec AS IDP ON (IDE.ide_cmpy_id = IDP.idp_cmpy_id AND IDE.ide_idc_ctl_no = IDP.idp_idc_ctl_no)
LEFT OUTER JOIN inrprm_rec AS PRM ON (IDP.idp_frm = PRM.prm_frm AND IDP.idp_grd = PRM.prm_grd AND IDP.idp_size = PRM.prm_size AND IDP.idp_fnsh = PRM.prm_fnsh)
What are your thoughts on the original code and subqueries vs the way I have rewritten the code? Is it inefficient from a performance perspective? Or is it acceptable from a performance perspective?
Thanks for any thoughts.
One way to provide some answer is to analyze the output from SET EXPLAIN ON for the two queries. Ideally, there shouldn't be a difference between the query plans. If the query plans are demonstrably 'the same' or 'equivalent', then the optimizer is doing its stuff well. Determining that they are equivalent may be harder than either of us would like. However, if there is a major difference in the query plans, the subqueries probably are slower and your rewrite should be at least as fast as the original and probably faster. Also, remember that query plans are only indicative of what the optimizer thinks will happen — time the different queries on production data as well.
You don't mention which version of Informix you're using or which platform you're using it on. It probably doesn't matter and it must be a relatively recent version to support the LEFT OUTER JOIN notation (this millennium rather than the last, at any rate). However, it is beneficial to state that. Note that only versions 12.10 and 14.10 are under support unless you've made special arrangements with IBM or HCL.

When to use sum v.s. lpSum using pulp?

In the case study "A Set Partitioning Problem" in the pulp documentation "sum" is used to express the constraints, for example:
#A guest must seated at one and only one table
for guest in guests:
seating_model += sum([x[table] for table in possible_tables
if guest in table]) == 1, "Must_seat_%s"%guest
Whereas "lpSum" seems to be used otherwise. Consider for example the following constraint in the case study A Transportation Problem
for b in Bars:
prob += lpSum([vars[w][b] for w in Warehouses])>=demand[b], "Sum_of_Products_into_Bar%s"%b
Why is "sum" used in the first example? And when to use "sum" v.s. "lpSum"?
You should be using lpSum every time you have at least one pulp variable inside the sum expression. Using sum is not wrong, just more inefficient. We should probably change the docs so they are consistent. Feel free to open an issue or, even better, do a PR so we can correct the Set Partitioning Problem example.

Raw sql with many columns

I'm building a CRUD application that pulls data using Persistent and executes a number of fairly complicated queries, for instance using window functions. Since these aren't supported by either Persistent or Esqueleto, I need to use raw sql.
A good example is that I want to select rows in which the value does not deviate strongly from the previous value, so in pseudo-sql the condition is WHERE val - lag(val) <= x. I need to run this selection in SQL, rather than pulling all data and then filtering in Haskell, because otherwise I'd have way to much data to handle.
These queries return many columns. However, the RawSql instance maxes out at tuples with 8 elements. So now I am writing additional functions from9, to9, from10, to10 and so on. And after that, all these are converted using functions with type (Single a, Single b, ...) -> DesiredType. Even though this could be shortened using code generation, the approach is simply hacky and clearly doesn't feel like good Haskell. This concerns me because I think most of my queries will require rawSql.
Do you have suggestions on how to improve this? Currently, my main thought is to un-normalize the database and duplicate data, e.g. by including the lagged value as column, so that I can query the data with Esqueleto.

Update the quantile for a dataset when a new datapoint is added

Suppose I have a list of numbers and I've computed the q-quantile (using Quantile).
Now a new datapoint comes along and I want to update my q-quantile, without having stored the whole list of previous datapoints.
What would you recommend?
Perhaps it can't be done exactly without, in the worst case, storing all previous datapoints.
In that case, can you think of something that would work well enough?
One idea I had, if you can assume normality, is to use the inverse CDF instead of the q-quantile.
Keep track of the sample variance as you go and then you can compute InverseCDF[NormalDistribution[sampleMean,sampleVariance], q] which should be the value such that a fraction q of the values are smaller, which is what the q-quantile is.
(I see belisarius was thinking along the same lines.
Here's the link he pointed to: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm )
Unless you know that your underlying data comes from some distribution, it is not possible to update arbitrary quantiles without retaining the original data. You can, as others suggested, assume that the data has some sort of distribution and store the quantiles this way, but this is a rather restrictive approach.
Alternately, have you thought of programming this somewhere besides Mathematica? For example, you could create a class for your datapoints that contains (1) the Double value and (2) some timestamp for when the data came in. In a SortedList of these datapoints classes (which compares based on value), you could get the quantile very fast by simply referencing the index of the datapoints. Want to get a historical quantile? Simply filter on the timestamps in your sorted list.

Resources