Looping to create a new column based on other column values in Python Dataframe [duplicate] - python-3.x

This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 3 years ago.
I want to create a new column in python dataframe based on other column values in multiple rows.
For example, my python dataframe df:
A | B
------------
10 | 1
20 | 1
30 | 1
10 | 1
10 | 2
15 | 3
10 | 3
I want to create variable C that is based on the value of variable A with condition from variable B in multiple rows. When the value of variable B in row i,i+1,..., the the value of C is the sum of variable A in those rows. In this case, my output data frame will be:
A | B | C
--------------------
10 | 1 | 70
20 | 1 | 70
30 | 1 | 70
10 | 1 | 70
10 | 2 | 10
15 | 3 | 25
10 | 3 | 25
I haven't got any idea the best way to achieve this. Can anyone help?
Thanks in advance

recreate the data:
import pandas as pd
A = [10,20,30,10,10,15,10]
B = [1,1,1,1,2,3,3]
df = pd.DataFrame({'A':A, 'B':B})
df
A B
0 10 1
1 20 1
2 30 1
3 10 1
4 10 2
5 15 3
6 10 3
and then i'll create a lookup Series from the df:
lookup = df.groupby('B')['A'].sum()
lookup
A
B
1 70
2 10
3 25
and then i'll use that lookup on the df using apply
df.loc[:,'C'] = df.apply(lambda row: lookup[lookup.index == row['B']].values[0], axis=1)
df
A B C
0 10 1 70
1 20 1 70
2 30 1 70
3 10 1 70
4 10 2 10
5 15 3 25
6 10 3 25

You have to use groupby() method, to group the rows on B and sum() on A.
df['C'] = df.groupby('B')['A'].transform(sum)

Related

How to do similar type of columns addition in Pyspark?

I want to do addition of similar type of columns (total columns are more than 100) as follows:
id
b
c
d
b_apac
c_apac
d_apac
abcd
3
5
null
45
9
1
bcd
13
15
1
45
2
10
cd
32
null
6
45
90
1
resultant table should look like this:
id
b_sum
c_sum
d_sum
abcd
48
14
1
bcd
58
17
11
cd
77
90
7
Please help me with some generic code as I have more than 100 columns to do this for. |
You can use use sum and check the prefix of your column name:
df.select(
'id',
sum([df[col] for col in df.columns if col.startswith('b')]).alias('b_sum'),
sum([df[col] for col in df.columns if col.startswith('c')]).alias('c_sum'),
sum([df[col] for col in df.columns if col.startswith('d')]).alias('d_sum'),
).show(10, False)

How can I get the count of sequential events pairs from a Pandas dataframe?

I have a dataframe that looks like this:
ID EVENT DATE
1 1 142
1 5 167
1 3 245
2 1 54
2 5 87
3 3 165
3 2 178
And I would like to generate something like this:
EVENT_1 EVENT_2 COUNT
1 5 2
5 3 1
3 2 1
The idea is how many items (ID) go from one event to the next one. Don't care about previous states, I just want to consider the next state from the current state (e.g.: for ID 1, I don't want to count a transition from 1 to 3 because first, it goes to event 5 and then to 3).
The date format is the number of days from a specific date (sort of like SAS format).
Is there a clean way to achieve this?
Let's try this:
(df.groupby([df['EVENT'].rename('EVENT_1'),
df.groupby('ID')['EVENT'].shift(-1).rename('EVENT_2')])['ID']
.count()).rename('COUNT').reset_index().astype(int)
Output:
| | EVENT_1 | EVENT_2 | COUNT |
|---:|----------:|----------:|--------:|
| 0 | 1 | 5 | 2 |
| 1 | 3 | 2 | 1 |
| 2 | 5 | 3 | 1 |
Details: Groupby on 'EVENT' and shifted 'EVENT' within each ID, then count.
You could use groupby and shift. We'll also use rename_axis and reset_index to tidy up the final output:
(pd.concat([f.groupby([f['EVENT'], f['EVENT'].shift(-1).astype('Int64')]).size()
for _, f in df.groupby('ID')])
.groupby(level=[0, 1]).sum()
.rename_axis(['EVENT_1', 'EVENT_2']).reset_index(name='COUNT'))
[out]
EVENT_1 EVENT_2 COUNT
0 1 5 2
1 3 2 1
2 5 3 1

Creating A new column based on other columns' values with specific requirement in Python Dataframe

I want to create a new column in Python dataframe with specific requirements from other columns. For example, my python dataframe df:
A | B
-----------
5 | 0
5 | 1
15 | 1
10 | 1
10 | 1
20 | 2
15 | 2
10 | 2
5 | 3
15 | 3
10 | 4
20 | 0
I want to create new column C, with below requirements:
When the value of B = 0, then C = 0
The same value in B will have the same value in C. The same values in B will be classified as start, middle, and end. So for values 1, it has 1 start, 2 middle, and 1 end, for values 3, it has 1 start, 0 middle, and 1 end. And the calculation for each section:
I specify a threshold = 10.
Let's look at values B = 1 :
Start :
C.loc[2] = min(threshold, A.loc[1]) + A.loc[2]
Middle :
C.loc[3] = A.loc[3]
C.loc[4] = A.loc[4]
End:
C.loc[5] = min(Threshold, A.loc[6])
However, the output value of C will be the sum of the above calculations.
When the value of B is unique and not 0. For example when B = 4
C[10] = min(threshold, A.loc[9]) + min(threshold, A.loc[11])
I can solve point 0 and 3. But I'm struggling to solve point 2.
So, the final output will be:
A | B | c
--------------------
5 | 0 | 0
5 | 1 | 45
15 | 1 | 45
10 | 1 | 45
10 | 1 | 45
20 | 2 | 50
15 | 2 | 50
10 | 2 | 50
5 | 3 | 25
10 | 3 | 25
10 | 4 | 20
20 | 0 | 0

Create "leakage-free" Variables in Python?

I have a pandas data frame with several thousand observations and I would like to create "leakage-free" variables in Python. So I am looking for a way to calculate e.g. a group-specific mean of a variable without the single observation in row i.
For example:
| Group | Price | leakage-free Group Mean |
-------------------------------------------
| 1 | 20 | 25 |
| 1 | 40 | 15 |
| 1 | 10 | 30 |
| 2 | ... | ... |
I would like to do that with several variables and I would like to create mean, median and variance in such a way, so a computationally fast method might be good. If a group has only one row I would like to enter 0s in the leakage-free Variable.
As I am rather a beginner in Python, some piece of code might be very helpful. Thank You!!
With one-liner:
df = pd.DataFrame({'Group': [1,1,1,2], 'Price':[20,40,10,30]})
df['lfgm'] = df.groupby('Group').transform(lambda x: (x.sum()-x)/(len(x)-1)).fillna(0)
print(df)
Output:
Group Price lfgm
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
3 2 30 0.0
Update:
For median and variance (not one-liners unfortunately):
df = pd.DataFrame({'Group': [1,1,1,1,2], 'Price':[20,100,10,70,30]})
def f(x):
for i in x.index:
z = x.loc[x.index!=i, 'Price']
x.at[i, 'mean'] = z.mean()
x.at[i, 'median'] = z.median()
x.at[i, 'var'] = z.var()
return x[['mean', 'median', 'var']]
df = df.join(df.groupby('Group').apply(f))
print(df)
Output:
Group Price mean median var
0 1 20 60.000000 70.0 2100.000000
1 1 100 33.333333 20.0 1033.333333
2 1 10 63.333333 70.0 1633.333333
3 1 70 43.333333 20.0 2433.333333
4 2 30 NaN NaN NaN
Use:
grp = df.groupby('Group')
n = grp['Price'].transform('count')
mean = grp['Price'].transform('mean')
df['new_col'] = (mean*n - df['Price'])/(n-1)
print(df)
Group Price new_col
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
Note: This solution will be faster than using apply, you can test using %%timeit followed by the codes.

Looking for an Excel Formula

I have a table full of numbers with with headings. I also have a separate list of numbers that are contained in the table. I would like to find the location of each number on the list, in the table. I would then like to use the cell location to provide the corresponding row heading. I demonstrated what I'm looking for below.
How do I go about doing this? I'm imagining some combination of index/match functions, or perhaps vlookup, but none of the formulas that I've tried have worked so far. I'm completely lost at this point, so any help will be appreciated.
Thanks in advance!
Imagine something like this:
Table:
- Category A 1 2 3 4 5
- Category B 6 7 8 9 10
- Category C 11 12 13 14 15
- Category D 16 17 18 19 20
- Category E 21 22 23 24 25
List:
22
5
10
4
18
6
14
2
Desired Outcome:
- 22 Category E
- 5 Category A
- 10 Category B
- 4 Category A
- 18 Category D
- 6 Category B
- 14 Category C
- 2 Category A
Step 1: Find the row that the matching value is in
You can find the matching row by using a combination of a boolean function and SUMPRODUCT:
SUMPRODUCT((dataRange=22)*ROW(dataRange))
(note that this assumes that the items are all unique; it will not work if you have more than one match)
Step 2: find the category for that row
OFFSET(categoryACell, rows, 0)
so the resulting function would be:
OFFSET(categoryACell, SUMPRODUCT(--(dataRange=22)*ROW(dataRange)), 0)
A | B | C | D | E | F
_________________________________________________________
1 || Category A | 1 | 2 | 3 | 4 | 5
2 || Category B | 6 | 7 | 8 | 9 | 10
3 || Category C | 11 | 12 | 13 | 14 | 15
4 || Category D | 16 | 17 | 18 | 19 | 20
5 || Category E | 21 | 22 | 23 | 24 | 25
6 ||
7 ||
8 ||
9 ||
10 || 22 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A10)*ROW(B1:F5)))
11 || 5 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A11)*ROW(B1:F5)))
12 || 10 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A12)*ROW(B1:F5)))
13 || 4 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A13)*ROW(B1:F5)))
14 || 18 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A14)*ROW(B1:F5)))
15 || 6 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A15)*ROW(B1:F5)))
16 || 14 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A16)*ROW(B1:F5)))
17 || 2 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A17)*ROW(B1:F5)))

Resources