python3: how to calculate all the variations of 10 factors (each has 15 values) groupped by 3 - python-3.x

Would you help me, please, to calculate all the variations of 10 factors (each has 15 values) groupped by 3.
We have 10 factors.
Each factor has 15 values. E.g. 1,2,3,4,5,6...15
All the possible combinations of the first tripple of the factors (e.g. factor1, factor2, factor3) are:
15 (factor1 combination values) x 15 (factor2 combination values) x 15 (factor3 combination values) = 3 375
This should be calculated for all the possible tripplets among 10 factors :
3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 = 59 049 possible combinations of 3 factors
(except duplicates like factor1, factor1, factor2)
As a result we have 59 049 combinations of 3 factors x 3 375 combinations of its values = 199 mln records
Desirable output:
1st place 2nd place 3rd place 1st place value 2nd place value 3rd place value
factor1 factor2 factor3 1 1 1
factor1 factor2 factor3 1 1 2
factor1 factor2 factor3 1 1 3
… … … … … …
factor8 factor9 factor10 15 15 15
Thank you for every prompt how to meet the goal.

Key to your question: Number of combinations "except duplicates" is simply a binomial coefficient, and the instances can be generated by itertools.product() or pandas.MultiIndex.from_product() (this anwer also).
Therefore, the exact number of (factor1, factor2, factor3) is binom(10, 3) =120 instead of 3**10=59,049. The total number of rows is thus 120*3375=405,000.
Solution:
I parameterized all the numbers just to make the mathematical logic clear. In addition, this solution can be applied to varying number of values by recalculating comb_facs accordingly.
import pandas as pd
import numpy as np
import itertools
from scipy.special import comb
# data and parameters
n_cols = 10
k_cols = 3 # binomial coeff. (n k)
n_vals = 15 # 15 vals
dic = {}
for i in range(1, n_cols+1):
dic[f"f{i}"] = np.array([j for j in range(1, 1+n_vals)], dtype=object)
df = pd.DataFrame(dic)
# preallocate the output arrays: factors and values
comb_cols = comb(n_cols, k_cols) # binom(10,3) = 120
comb_facs = int(n_vals ** k_cols) # NOTE: must recalculate if number of values are not constant
total_len = int(comb_cols * comb_facs)
factors = np.zeros((total_len, k_cols), dtype=object)
values = np.zeros((total_len, k_cols), dtype=int)
# the actual iteration
for i, tup in enumerate(itertools.combinations(df.columns, k_cols)):
# 1. Cartesian product of (facA, facB, facC).
# can also use list(itertools.product())
vals = pd.MultiIndex.from_product(
[df[tup[i]].values for i in range(k_cols)] # df.f1, df.f2, df.f3
)
arr_vals = pd.DataFrame(index=vals).reset_index().values
# 2. Populate factor names and values into output arrays
factors[i * comb_facs:(i + 1) * comb_facs, :] = tup # broadcasting
values[i * comb_facs:(i + 1) * comb_facs, :] = arr_vals
# result
pd.concat([pd.DataFrame(factors, columns=["1p fac", "2p fac", "3p fac"]),
pd.DataFrame(values, columns=["1p val", "2p val", "3p val"])], axis=1)
Out[41]:
1p fac 2p fac 3p fac 1p val 2p val 3p val
0 f1 f2 f3 1 1 1
1 f1 f2 f3 1 1 2
2 f1 f2 f3 1 1 3
3 f1 f2 f3 1 1 4
4 f1 f2 f3 1 1 5
... ... ... ... ... ...
404995 f8 f9 f10 15 15 11
404996 f8 f9 f10 15 15 12
404997 f8 f9 f10 15 15 13
404998 f8 f9 f10 15 15 14
404999 f8 f9 f10 15 15 15
[405000 rows x 6 columns]

Related

Pandas, by row calculations with the first value that satisfy condition

I have a dataframe:
df = pd.DataFrame(
[
[10,20,40],
[2,1,26],
[1, 2, 60],
], columns = ['f1', 'f2', 'f3']
)
df['cumsum'] = df.sum(axis=1)
df['cumsum_perc'] = (df['cumsum'] * 0.1).astype(int)
| | f1 | f2 | f3 | cumsum | cumsum_perc |
|---:|-----:|-----:|-----:|---------:|--------------:|
| 0 | 10 | 20 | 40 | 70 | 7 |
| 1 | 2 | 1 | 26 | 29 | 2 |
| 2 | 1 | 2 | 60 | 63 | 6 |
As you can see, for each row I have calculated cumulative sum, and than out of cumulative sum an arbitrary (in this case 10%) percentage of the cumulative sum.
Each f column has its ponder value (f_pon), f1 = 1, f2 = 2, f3 = 3.
Now, for each row, I have to find out f column with highest value, whose value is less or equal then cumsum_perc (f_le) in order to determine its f_pon.
Let's consider the third row, for an example.
f_le = f2 (2 < 6), which implies f_pon = 2.
Now I have to see is there any reminder in cumsum_perc - f_le column.
rem = cumsum_perc (6) - f_le (2) = 4.
I have to calculate percentage of reminder considering the value of the first f column to the right of f_le (f3), so here we have rem_perc = rem (4) / f3 (60) = 0.066.
Final result for the third row is f_pon (2) + rem_perc = 2.066.
If we apply the same logic for the first row, than f1 is f_le, and there is no reminder because cumsum_perc (7) - f_le (10) = -3. If rem is negative it should be set to 0.
So result is f1_pon (1) + rem (0) / f2 (20) = 1
For second row, the result is also 1, because there is no reminder.
How to calculate final results for each row in the most efficient way?
To be honest it is difficult to follow your rules, but since you now your rules, I suggest to implement a helper function and use df.apply(helper, axis=1) row wise.
This might not be the fastest implementation, but at least you get you results.
def helper(x):
basic_set = x[['f1','f2','f3']]
cumsum_perc = x['cumsum_perc']
f_pon = basic_set[basic_set<cumsum_perc].max()
rem = cumsum_perc - f_pon
if not rem:
rem = 0
rem_perc = rem / x['cumsum']
if not rem_perc:
rem_perc = 0
return f_pon + rem_perc
df['ans'] = df.apply(helper, axis=1)
>>> df
f1 f2 f3 cumsum cumsum_perc ans
0 10 20 40 70 7 NaN
1 2 1 26 29 2 1.034483
2 1 2 60 63 6 2.063492
I think you can adapt the helper, if mine is wrong.

Locate dataframe rows where values are outside bounds specified for each column

I have a dataframe with k columns and n rows, k ~= 10, n ~= 1000. I have a (2, k) array representing bounds on values for each column, e.g.:
# For 5 columns
bounds = ([0.1, 1, 0.1, 5, 10],
[10, 1000, 1, 1000, 50])
# Example df
a b c d e
0 5 3 0.3 17 12
1 12 50 0.5 2 31
2 9 982 0.2 321 21
3 1 3 1.2 92 48
# Expected output with bounds given above
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21
Crucially, the bounds on each column are different.
I would like to identify and exclude all rows of the dataframe where any column value falls outside the bounds for that respective column, preferably using array operations rather than iterating over the dataframe. The best I can think of so far involves iterating over the columns (which isn't too bad but still seems less than ideal):
for i in len(df.columns):
df = df.query('(bounds[0][i] < df.columns[i]) & (df.columns[i] < bounds[1][i])')
Is there a better way to do this? Or alternatively, to select only the rows where all column values are within the respective bounds?
One way using pandas.DataFrame.apply with pandas.Series.between:
bounds = dict(zip(df.columns, zip(*bounds)))
new_df = df[~df.apply(lambda x: ~x.between(*bounds[x.name])).any(1)]
print(new_df)
Output:
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21

How to find exponential formula coefficients?

I have the following pairs of values:
X Y
1 2736
2 3124
3 3560
4 4047
5 4594
6 5205
7 5890
8 6658
9 7518
10 8480
18 21741
32 108180
35 152237
36 170566
37 191068
38 214087
39 239838
40 268679
When I put these pairs in Excel, I get a exponential formula:
Y = 2559*e^(0.1167*X)
with an accuracy of 99,98%.
Is there a way to ask from Excel to provide a formula in the following format:
Y = (A/B)*C^X-D
If not, is it possible to convert the above formula to the wanted one?
Note, that I am not familiar with Matlab.
You already have it !
A = 2559
B = 1
C = exp(0.1167)
D = 0
You'll see that it is equivalent to your formula Y = 2559*e^(0.1167*X), because e^(0.1167*X) = (e^0.1167)^X

Why does Excel average gives different result?

Here's the table:
Should not they have the same result mathematically? (the average score of the per column and per row average)
The missing cells mean that your cells aren't all weighted evenly.
For example, row 11 has only two cells 82.67 and 90. So for your row average for row 11 they are weighted much more heavily than in your column averages where they are 1/13 and 1/14 of a column instead of 1/2 of a row.
Try filling up all the empty cells with 0 and the averages should match.
Taking a more extreme version of Ruslan Karaev's example:
5 5 5 | 5
1 | 1 Average of Average of Rows = (5 + 1 + 0) / 3 = 2
0 | 0
-----
2 5 5
Average of Average of Columns = (2 + 5 + 5) / 3 = 4
Yes, for example, the following two expressions:
/ a + b X + Y \ / a + X b + Y \
( ----- + ----- ) ( ----- + ----- )
\ 2 2 / \ 2 2 /
------------------- -------------------
2 2
are indeed mathematically equivalent, both coming out to be (a + b + X + Y) / 4.
However, short of having enough sufficient precision to store values, you may find that rounding errors accumulate differently depending on the order of operations.
You can see this sort of effect in a much simpler example if you assume a 3-digit precision and divide one by three, then multiply the result of that by three again:
1 / 3 -> 0.333, 0.333 x 3 -> 0.999
Contrast that with doing the operations in the oppisite order:
1 x 3 = 3, 3 / 1 = 1

Pandas multi-index subtract from value based on value in other column part 2

Based on a thorough and accurate response to this question, I am now faced with a new issue based on slightly different data.
Given this data frame:
df = pd.DataFrame({
('A', 'a'): [23,3,54,7,32,76],
('B', 'b'): [23,'n/a',54,7,32,76],
('possible','possible'):[100,100,100,100,100,100]
})
df
A B possible
a b possible
0 23 23 100
1 3 n/a 100
2 54 54 100
3 7 n/a 100
4 32 32 100
5 76 76 100
I'd like to subtract 4 from 'possible', per row, for any instance (column) where the value is 'n/a' for that row (and then change all 'n/a' values to 0).
A B possible
a b possible
0 23 23 100
1 3 n/a 96
2 54 54 100
3 7 n/a 96
4 32 32 100
5 76 76 100
Some conditions:
It may occur that a column is all floats (though they appear to be integers upon inspection). This was not factored into the original question.
It may also occur that a row contains two instances (columns) of 'n/a' values. This was addressed by the previous solution.
Here is the previous solution:
idx = pd.IndexSlice
df.loc[:, idx['possible', 'possible']] -= (df.loc[:, idx[('A','B'),:]] == 'n/a').sum(axis=1) * 4
df.replace({'n/a':0}, inplace=True)
It works, except for where a column (A or B) contains all floats (seemingly integers). When that's the case, this error occurs:
TypeError: Could not compare ['n/a'] with block values
I think you can add casting to string by astype to condition:
idx = pd.IndexSlice
df.loc[:, idx['possible', 'possible']] -=
(df.loc[:, idx[('A','B'),:]].astype(str) == 'n/a').sum(axis=1) * 4
df.replace({'n/a':0}, inplace=True)
print df
A B possible
a b possible
0 23 23 100
1 3 0 96
2 54 54 100
3 7 0 96
4 32 32 100
5 76 76 100

Resources