Is there a STDDEV equivalent in Cloud Spanner functions? - google-cloud-spanner

I don't see it in the documentation; is it perhaps part of an existing function I didn't notice, or available in some other way?

As Andrei Tigau mentioned, STDDEV is not supported yet. That said, you need to calculate it in two pass. Assuming you are interested in column x of YourTable,
SELECT SQRT(SUM(POW(x - avg, 2)/(n-1)))
FROM (SELECT AVG(x) AS avg, COUNT(*) AS n FROM YourTable)
CROSS JOIN YourTable;
You may try following one pass solution as well.
SELECT SQRT(s2/(n-1) - POW(s/n, 2))
FROM (
SELECT COUNT(*) AS n, SUM(x) AS s, SUM(x*x) AS s2
FROM YourTable
);
Depending on type, you may have to cast it to double (especially for s2) to avoid overflow. Both will suffer from floating point errors.

This function does not exist in the official documentation that you have send, so probably it still doesn't exist. If you need something like that you should probably calculate the standard deviation by yourself programmaticaly. You have the AVG function that helps you get at least the median value which will be helpful in the calculation of the standard deviation and the COUNT function for the number of entries.
double standardDeviation ; // standard deviation
double sumOfDiffrences = 0;
for ( int i = 0; i < count; i++ ){
sumOfDiffrences = sumOfDiffrences + pow((entry(i)-avg),2); // entry(i) is an entry of the column you want to create the S.D.
}
standardDeviation = sqrt((sumOfDiffrences)/(count-1));

STDDEV is supported since 2020-06-03.
https://cloud.google.com/spanner/docs/release-notes?hl=en#June_03_2020

Related

Divide a value in excel by a set of preset values to find out how many of each are needed

I am curious if there is a way to make my life easier. In excel I am producing a total value, say 750 and need to find out how many orders of pipe I need from values of 50,100,200,250,500. Is there anyway to have excel take a value and then return how many of each of these numbers I would need, so for the 750 case 1 500 and 1 250?
Currently the solution is just worked out in my head
Assuming you want to try to fit pipes in decreasing order of size,and that you have access to the required functions, you can use Reduce as demonstrated here to step through the sizes and successively divide by each one although the formula is a little laboured:
=LET(pipes,{500;250;200;100;50},reqd,750,DROP(REDUCE(0,pipes,
LAMBDA(a,c,VSTACK(a,QUOTIENT(reqd-IF(ROWS(a)>1,SUM(DROP(a,1)*TAKE(pipes,ROWS(a)-1)),0),c)))),1))
As pointed out by #Jos Woolley, this may not give you the answer you want if the total is something like 749. It will fit as many values in as possible and give a result 500+200 total 700 (remainder 49). You could fix it perhaps by rounding up to the next multiple of 50.
For the example of 823, you would have:
=LET(pipes,{500;250;200;100;50},reqd,CEILING(823,MIN(pipes)),DROP(REDUCE(0,pipes,
LAMBDA(a,c,VSTACK(a,QUOTIENT(reqd-IF(ROWS(a)>1,SUM(DROP(a,1)*TAKE(pipes,ROWS(a)-1)),0),c)))),1))
which gives 500+250+100=850.
Well I've got a bit obsessed with this now and I am determined to get a lambda working to find the optimal answer! I have been looking at the brute-force solution to finding the minimum number of coins required to make up a given total in the reference mentioned previously and have managed to translate it into a lambda using Reduce:
Mincoins1= LAMBDA(coins, m, v,
IF(
v <= 0,
0,
REDUCE(
999,
coins,
LAMBDA(a, c,
IF(v >= c, LET(mc, mincoins1.mincoins1(coins, m, v - c) + 1, IF(mc < a, mc, a)), a)
)
)
)
)
This does give the correct answer, 2, for the case when you want to make up a value of 400 from the list of pipes given. The next step will be to modify the code to return the list of pipes which give that total (200,200).
https://www.enjoyalgorithms.com/blog/minimum-coin-change
Here is the lambda modified to return a string containing the chosen pipes:
Mincoins2= LAMBDA(coins, m, v,
IF(
v <= 0,
"",
REDUCE(
rept("x",999),
coins,
LAMBDA(a, c,
IF(v >= c, LET(mc, c&"|"&mincoins2.mincoins2(coins, m, v - c), IF(len(mc) < len(a), mc, a)), a)
)
)
)
);
It does work BUT (and this is a big but) it hits a limit as soon as the value to be produced exceeds 1000 and you get a #value error. Disappointing. But interesting I think as a proof of concept.
Not sure I understand the question but lets try.
if you have 1 450 to divide, have a formula that divides 1 450 with you highest lenght (750) and then round it down.
so the formula would be something of the line: = rounddown(1 450 / 750; 0)
you will then get the answer that you need 1 of the length 750.
then keep the info about how much length you have remaining. So a formula like:
=1 450 - 750 * [the answer from previous formula = 1]. this would sum to 700.
then start over with the same thing, but divide 700 with 500 (second largest size).
Your question is extremely difficult: one might think for this easy solution, starting with value_begin:
amount_of_500 = value_begin DIV 500; // integer division
temp = value_begin - 500 * amount_of_500;
amount_of_250 = temp DIV 250; // again integer division
temp = temp - 250 * amount_of_250;
amount_of_200 = temp DIV 200; // again integer division
temp = temp - 200 * amount_of_200;
...
However, this will not work because of the value 200, which is far too close to 250: just start with value_begin equal to 400 (algorithm solution : 250 + 100 + 50, while best solution : 200 + 200).
Are you sure you need both 200 and 250 as possible numbers to divide by? If yes, you might have a serious problem getting this implemented.

Pyomo: define objective Rule based on condition

In a transport problem, I'm trying to insert the following rule into the objective function:
If a supply of BC <19,000 tons, then we will have a penalty of $ 125 / MT
I added a constraint to check the condition but would like to apply the penalty in the objective function.
I was able to do this in Excel Solver, but the values ​​do not match. I've already checked both, and debugged the code, but I could not figure out what's wrong.
Here is the constraint:
def bc_rule(model):
return sum(model.x[supplier, market] for supplier in model.suppliers \
for market in model.markets \
if 'BC' in supplier) >= 19000
model.bc_rules = Constraint(rule=bc_rule, doc='Minimum production')
The problem is in the objective rule:
def objective_rule(model):
PENALTY_THRESHOLD = 19000
PENALTY_COST = 125
cost = sum(model.costs[supplier, market] * model.x[supplier, market] for supplier in model.suppliers for market in model.markets)
# what is the problem here?
bc = sum(model.x[supplier, market] for supplier in model.suppliers \
for market in model.markets \
if 'BC' in supplier)
if bc < PENALTY_THRESHOLD:
cost += (PENALTY_THRESHOLD - bc) * PENALTY_COST
return cost
model.objective = Objective(rule=objective_rule, sense=minimize, doc='Define objective function')
I'm getting a much lower value than found in Excel Solver.
Your condition (if) depends on a variable in your model.
Normally, ifs should never be used in a mathematical model, and that is not only for Pyomo. Even in Excel, if statements in formulas are simply converted to scalar value before optimization, so I would be very careful when saying that it is the real optimal value.
The good news is that if statements are easily converted into mathematical constraints.
For that, you need to add a binary variable (0/1) to your model. It will take the value of 1 if bc <= PENALTY_TRESHOLD. Let's call this variable y, and is defined as model.y = Var(domain=Binary).
You will add model.y * PENALTY_COST as a term of your objective function to include the penalty cost.
Then, for the constraint, add the following piece of code:
def y_big_M(model):
bigM = 10000 # Should be a big number, big enough that it will be bigger than any number in your
# model, but small enough that it will stay around the same order of magnitude. Avoid
# utterly big number like 1e12 and + if you don't need to, since having numbers too
# large causes problems.
PENALTY_TRESHOLD = 19000
return PENALTY_TRESHOLD - sum(
model.x[supplier, market]
for supplier in model.suppliers
for market in model.markets
if 'BC' in supplier
) <= model.y * bigM
model.y_big_M = Constraint(rule=y_big_M)
The previous constraint ensures that y will take a value greater than 0 (i.e. 1) when the sum that calculates bc is smaller than the PENALTY_TRESHOLD. Any value of this difference that is greater than 0 will force the model to put 1 in the value of variable y, since if y=1, the right hand side of the constraint will be 1 * bigM, which is a very big number, big enough that bc will always be smaller than bigM.
Please, also check your Excel model to see if your if statements really works during the solver computations. Last time I checked, Excel solver do not convert if statements into bigM constraints. The modeling technique I showed you works for absolutely all programming method, even in Excel.

Two independent sample proportion test

This question is pretty simple (I hope). I am working my way through some introductory to SAS material and cannot find the proper way of running a two sample proportion test of location.
proc freq data;
tables / binomial (p=...)
run;
requires a known proportion (i.e. testing against a known value). I'd like to compare two samples of categorical variables with null hypothesis p1 = p2 and p1 < p2.
Data resembles:
V1 Yes
V1 No
V2 Yes
V2 No
For many lines. I need to compare the proportion of Yes's and No's between the two populations (V1 and V2). Can someone point me towards the correct procedure? Google search has left me spinning.
Thanks.
1/0 and 1/0 seems like a Chi Squared test.
data for_test;
do _t = 1 to 20;
x1 = ifn(ranuni(7)<0.5,1,0);
x2 = ifn(ranuni(7)<0.5,1,0);
output;
end;
run;
proc freq data=for_test;
tables x1*x2/chisq;
run;
The chi square test for quality of proportion. Here the use case of p1 < p2 is still pending. It is possible to do the test for p1 < p2 using chi

Parallel For to sum an array of ushorts (big array 18M)

I would like to use parallel processing for taking array statistics for large arrays of unsigned short (16 bit) values.
ushort[] array = new ushort[2560 * 3072]; // x = rows(2560) y = columns(3072)
double avg = Parallel.For (0, array.Length, WHAT GOES HERE);
The same for standard deviation & standard deviation of row means.
I have normal for loop versions of these functions and they take too long when combined with Median Filter methods.
The end product is to try and get a Median Filter for the array. But the first steps are important as well. So if you have the whole solution great but if you want to help with the first parts as well it is all appreciated.
Have you tried PLINQ?
double average = array.AsParallel().Average(n => n);
I'm not sure how performant it will be with a large array of ushort values, but it's worth testing to see if it meets your needs.

What is an efficient way to compute the Dice coefficient between 900,000 strings?

I have a corpus of 900,000 strings. They vary in length, but have an average character count of about 4,500. I need to find the most efficient way of computing the Dice coefficient of every string as it relates to every other string. Unfortunately, this results in the Dice coefficient algorithm being used some 810,000,000,000 times.
What is the best way to structure this program for increased efficiency? Obviously, I can prevent computing the Dice of sections A and B, and then B and A--but this only halves the work required. Should I consider taking some shortcuts or creating some sort of binary tree?
I'm using the following implementation of the Dice coefficient algorithm in Java:
public static double diceCoefficient(String s1, String s2) {
Set<String> nx = new HashSet<String>();
Set<String> ny = new HashSet<String>();
for (int i = 0; i < s1.length() - 1; i++) {
char x1 = s1.charAt(i);
char x2 = s1.charAt(i + 1);
String tmp = "" + x1 + x2;
nx.add(tmp);
}
for (int j = 0; j < s2.length() - 1; j++) {
char y1 = s2.charAt(j);
char y2 = s2.charAt(j + 1);
String tmp = "" + y1 + y2;
ny.add(tmp);
}
Set<String> intersection = new HashSet<String>(nx);
intersection.retainAll(ny);
double totcombigrams = intersection.size();
return (2 * totcombigrams) / (nx.size() + ny.size());
}
My ultimate goal is to output an ID for every section that has a Dice coefficient of greater than 0.9 with another section.
Thanks for any advice that you can provide!
Make a single pass over all the Strings, and build up a HashMap which maps each bigram to a set of the indexes of the Strings which contain that bigram. (Currently you are building the bigram set 900,000 times, redundantly, for each String.)
Then make a pass over all the sets, and build a HashMap of [index,index] pairs to common-bigram counts. (The latter Map should not contain redundant pairs of keys, like [1,2] and [2,1] -- just store one or the other.)
Both of these steps can easily be parallelized. If you need some sample code, please let me know.
NOTE one thing, though: from the 26 letters of the English alphabet, a total of 26x26 = 676 bigrams can be formed. Many of these will never or almost never be found, because they don't conform to the rules of English spelling. Since you are building up sets of bigrams for each String, and the Strings are so long, you will probably find almost the same bigrams in each String. If you were to build up lists of bigrams for each String (in other words, if the frequency of each bigram counted), it's more likely that you would actually be able to measure the degree of similarity between Strings, but then the calculation of Dice's coefficient as given in the Wikipedia article wouldn't work; you'd have to find a new formula.
I suggest you continue researching algorithms for determining similarity between Strings, try implementing a few of them, and run them on a smaller set of Strings to see how well they work.
You should come up with some kind of inequality like: D(X1,X2) > 1-p, D(X1,X3) < 1-q and p D(X2,X3) < 1-q+p . Or something like that. Now, if 1-q+p < 0.9, then probably you don't have to evaluate D(X2,X3).
PS: I am not sure about this exact inequality, but I have a gut feeling that this might be right (but I do not have enough time to actually do the derivations now). Look for some of the inequalities with other similarity measures and see if any of them are valid for Dice co-efficient.
=== Also ===
If there are a elements in set A, and if your threshold is r (=0.9), then set B should have number of elements b should be such that: r*a/(2-r) <= b <= (2-r)*a/r . This should eliminate need for lots of comparisons IMHO. You can probably sort the strings according to length and use the window describe above to limit comparisons.
Disclaimer first: This will not reduce the number of comparisons you'll have to make. But this should make a Dice comparison faster.
1) Don't build your HashSets every time you do a diceCoefficient() call! It should speed things up considerably if you just do it once for each string and keep the result around.
2) Since you only care about if a particular bigram is present in the string, you could get away with a BitSet with a bit for each possible bigram, rather than a full HashMap. Coefficient calculation would then be simplified to ANDing two bit sets and counting the number of set bits in the result.
3) Or, if you have a huge number of possible bigrams (Unicode, perhaps?) - or monotonous strings with only a handful of bigrams each - a sorted Array of bigrams might provide faster, more space-efficent comparisons.
Is their charset limited somehow? If it is, you can compute character counts by their code in each string and compare these numbers. After such pre-computation (it will occupy 2*900K*S bytes of memory [if we assume no character is found more then 65K time in the same string], where S is different character count). Then computing the coefficent would take O(S) time. Sure, this would be helpful if S<4500.

Resources