Pandas, by row calculations with the first value that satisfy condition - python-3.x

I have a dataframe:
df = pd.DataFrame(
[
[10,20,40],
[2,1,26],
[1, 2, 60],
], columns = ['f1', 'f2', 'f3']
)
df['cumsum'] = df.sum(axis=1)
df['cumsum_perc'] = (df['cumsum'] * 0.1).astype(int)
| | f1 | f2 | f3 | cumsum | cumsum_perc |
|---:|-----:|-----:|-----:|---------:|--------------:|
| 0 | 10 | 20 | 40 | 70 | 7 |
| 1 | 2 | 1 | 26 | 29 | 2 |
| 2 | 1 | 2 | 60 | 63 | 6 |
As you can see, for each row I have calculated cumulative sum, and than out of cumulative sum an arbitrary (in this case 10%) percentage of the cumulative sum.
Each f column has its ponder value (f_pon), f1 = 1, f2 = 2, f3 = 3.
Now, for each row, I have to find out f column with highest value, whose value is less or equal then cumsum_perc (f_le) in order to determine its f_pon.
Let's consider the third row, for an example.
f_le = f2 (2 < 6), which implies f_pon = 2.
Now I have to see is there any reminder in cumsum_perc - f_le column.
rem = cumsum_perc (6) - f_le (2) = 4.
I have to calculate percentage of reminder considering the value of the first f column to the right of f_le (f3), so here we have rem_perc = rem (4) / f3 (60) = 0.066.
Final result for the third row is f_pon (2) + rem_perc = 2.066.
If we apply the same logic for the first row, than f1 is f_le, and there is no reminder because cumsum_perc (7) - f_le (10) = -3. If rem is negative it should be set to 0.
So result is f1_pon (1) + rem (0) / f2 (20) = 1
For second row, the result is also 1, because there is no reminder.
How to calculate final results for each row in the most efficient way?

To be honest it is difficult to follow your rules, but since you now your rules, I suggest to implement a helper function and use df.apply(helper, axis=1) row wise.
This might not be the fastest implementation, but at least you get you results.
def helper(x):
basic_set = x[['f1','f2','f3']]
cumsum_perc = x['cumsum_perc']
f_pon = basic_set[basic_set<cumsum_perc].max()
rem = cumsum_perc - f_pon
if not rem:
rem = 0
rem_perc = rem / x['cumsum']
if not rem_perc:
rem_perc = 0
return f_pon + rem_perc
df['ans'] = df.apply(helper, axis=1)
>>> df
f1 f2 f3 cumsum cumsum_perc ans
0 10 20 40 70 7 NaN
1 2 1 26 29 2 1.034483
2 1 2 60 63 6 2.063492
I think you can adapt the helper, if mine is wrong.

Related

Pandas find max column, subtract from another column and replace the value

I have a df like this:
A | B | C | D
14 | 5 | 10 | 5
4 | 7 | 15 | 6
100 | 220 | 6 | 7
For each row in column A,B,C, I want the find the max value and from it subtract column D and replace it.
Expected result:
A | B | C | D
9 | 5 | 10 | 5
4 | 7 | 9 | 6
100 | 213 | 6 | 7
So for the first row, it would select 14(the max out of 14,5,10), subtract column D from it (14-5 =9) and replace the result(replace initial value 14 with 9)
I know how to find the max value of A,B,C and from it subctract D, but I am stucked on the replacing part.
I tought on putting the result in another column called E, and then find again the max of A,B,C and replace with column E, but that would make no sense since I would be attempting to assign a value to a function call. Is there any other option to do this?
#Exmaple df
list_columns = ['A', 'B', 'C','D']
list_data = [ [14, 5, 10,5],[4, 7, 15,6],[100, 220, 6,7]]
df= pd.DataFrame(columns=list_columns, data=list_data)
#Calculate the max and subctract
df['e'] = df[['A', 'B']].max(axis=1) - df['D']
#To replace, maybe something like this. But this line makes no sense since it's backwards
df[['A', 'B','C']].max(axis=1) = df['D']
Use DataFrame.mask for replace only maximal value matched by compare all values of filtered columns with maximals:
cols = ['A', 'B', 'C']
s = df[cols].max(axis=1)
df[cols] = df[cols].mask(df[cols].eq(s, axis=0), s - df['D'], axis=0)
print (df)
A B C D
0 9 5 10 5
1 4 7 9 6
2 100 213 6 7

python3: how to calculate all the variations of 10 factors (each has 15 values) groupped by 3

Would you help me, please, to calculate all the variations of 10 factors (each has 15 values) groupped by 3.
We have 10 factors.
Each factor has 15 values. E.g. 1,2,3,4,5,6...15
All the possible combinations of the first tripple of the factors (e.g. factor1, factor2, factor3) are:
15 (factor1 combination values) x 15 (factor2 combination values) x 15 (factor3 combination values) = 3 375
This should be calculated for all the possible tripplets among 10 factors :
3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 = 59 049 possible combinations of 3 factors
(except duplicates like factor1, factor1, factor2)
As a result we have 59 049 combinations of 3 factors x 3 375 combinations of its values = 199 mln records
Desirable output:
1st place 2nd place 3rd place 1st place value 2nd place value 3rd place value
factor1 factor2 factor3 1 1 1
factor1 factor2 factor3 1 1 2
factor1 factor2 factor3 1 1 3
… … … … … …
factor8 factor9 factor10 15 15 15
Thank you for every prompt how to meet the goal.
Key to your question: Number of combinations "except duplicates" is simply a binomial coefficient, and the instances can be generated by itertools.product() or pandas.MultiIndex.from_product() (this anwer also).
Therefore, the exact number of (factor1, factor2, factor3) is binom(10, 3) =120 instead of 3**10=59,049. The total number of rows is thus 120*3375=405,000.
Solution:
I parameterized all the numbers just to make the mathematical logic clear. In addition, this solution can be applied to varying number of values by recalculating comb_facs accordingly.
import pandas as pd
import numpy as np
import itertools
from scipy.special import comb
# data and parameters
n_cols = 10
k_cols = 3 # binomial coeff. (n k)
n_vals = 15 # 15 vals
dic = {}
for i in range(1, n_cols+1):
dic[f"f{i}"] = np.array([j for j in range(1, 1+n_vals)], dtype=object)
df = pd.DataFrame(dic)
# preallocate the output arrays: factors and values
comb_cols = comb(n_cols, k_cols) # binom(10,3) = 120
comb_facs = int(n_vals ** k_cols) # NOTE: must recalculate if number of values are not constant
total_len = int(comb_cols * comb_facs)
factors = np.zeros((total_len, k_cols), dtype=object)
values = np.zeros((total_len, k_cols), dtype=int)
# the actual iteration
for i, tup in enumerate(itertools.combinations(df.columns, k_cols)):
# 1. Cartesian product of (facA, facB, facC).
# can also use list(itertools.product())
vals = pd.MultiIndex.from_product(
[df[tup[i]].values for i in range(k_cols)] # df.f1, df.f2, df.f3
)
arr_vals = pd.DataFrame(index=vals).reset_index().values
# 2. Populate factor names and values into output arrays
factors[i * comb_facs:(i + 1) * comb_facs, :] = tup # broadcasting
values[i * comb_facs:(i + 1) * comb_facs, :] = arr_vals
# result
pd.concat([pd.DataFrame(factors, columns=["1p fac", "2p fac", "3p fac"]),
pd.DataFrame(values, columns=["1p val", "2p val", "3p val"])], axis=1)
Out[41]:
1p fac 2p fac 3p fac 1p val 2p val 3p val
0 f1 f2 f3 1 1 1
1 f1 f2 f3 1 1 2
2 f1 f2 f3 1 1 3
3 f1 f2 f3 1 1 4
4 f1 f2 f3 1 1 5
... ... ... ... ... ...
404995 f8 f9 f10 15 15 11
404996 f8 f9 f10 15 15 12
404997 f8 f9 f10 15 15 13
404998 f8 f9 f10 15 15 14
404999 f8 f9 f10 15 15 15
[405000 rows x 6 columns]

Only drop duplicates if number of duplicates is less than X

I need to drop duplicate rows in my DataFrame only if the number of duplicates is less than x (e.g. 3)
(if more than 3 duplicates, keep them !)
Sample:
where count is number of duplicates and duplicates are in col data
data | count
-------------
a | 1
b | 2
b | 2
c | 1
d | 3
d | 3
d | 3
Desired result:
data | count
-------------
a | 1
b | 1
c | 1
d | 3
d | 3
d | 3
How can i achieve this? Thanks in advance.
I believe you need chain conditions with Series.duplicated and get greater or equal values of N in boolean indexing, last set 1 for count column:
N = 3
df1 = df[~df.duplicated('data') | df['count'].ge(N)].copy()
df1.loc[df['count'] < N, 'count'] = 1
print (df1)
data count
0 a 1
1 b 1
3 c 1
4 d 3
5 d 3
6 d 3
IIUC, you could do the following:
# create mask for non-duplicates and groups larger than 3
mask = (df.groupby('data')['count'].transform('count') >= 3) | ~df.duplicated('data')
# filter
filtered = df.loc[mask].drop('count', axis=1)
# reset count column
filtered['count'] = filtered.groupby('data')['data'].transform('count')
print(filtered)
Output
data count
0 a 1
1 b 1
3 c 1
4 d 3
5 d 3
6 d 3
N = 3
df['count'] = df['count'].apply(lambda x: 1 if x < N else x)
result = pd.concat([df[df['count'].eq(1)].drop_duplicates(), df[df['count'].eq(N)]])
result
data count
0 a 1
1 b 1
3 c 1
4 d 3
5 d 3
6 d 3

Dataframe: Computed row based on cell above and cell on the left

I have a dataframe with a bunch of integer values. I then compute the column totals and append it as a new row to the dataframe. So far so good.
Now I want to append another computed row where the value of each cell is the sum of cell above and the cell on the left. You can see what I mean below:
----------------------------------------------------------------
|250000 |0 |145000 |145000 |220000 |165000 |145000 |145000 |
----------------------------------------------------------------
|250000 |250000 |395000 |540000 |760000 |925000 |1070000|1215000 |
----------------------------------------------------------------
How can this be done?
I think you need Series.cumsum with select last row (total row) by DataFrame.iloc:
df = pd.DataFrame({
'B':[4,5,4],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
})
df.loc['sum'] = df.sum()
df.loc['cumsum'] = df.iloc[-1].cumsum()
#if need only cumsum row
#df.loc['cumsum'] = df.sum().cumsum()
print (df)
B C D E
0 4 7 1 5
1 5 8 3 3
2 4 9 5 6
sum 13 24 9 14
cumsum 13 37 46 60

Excel to calculate if values in ranges meet certain criteria using VBA

I have two ranges in excel, say:
x | y
------------
5 | -1
46 | -4
2 | 1
67 | -1
22 | 1
6 | 0
34 | 0
7 | -2
I want calculate the sum of the second column for values less than O only if the respective values in the first column is less than 10 (i.e sum(y(i) for i<0 and x(i)<10) . Hence in this case the sum will be -3.
Assuming your headers are in A1:B1 and your data is A2:B9 use this:
=SUMIFS(B2:B9,A2:A9,"<10",B2:B9,"<0")
Try something like
Function calc() AS Integer
Dim sum AS Integer: sum = 0
Dim c AS Range
For Each c In ThisWorkbook.Worksheets(1).Range("A1:A15")
If c.Value < 10 And c.Offset(0, 1).Value < 0 Then
sum = sum + c.Offset(0, 1).Value
End If
Next c
calc = sum
End Function

Resources