Pandas - create column with aggregate results - python-3.x

I have a dataset which has a row for each loan, and a borrower can have multiple loans. The 'Property' flag shows if there is any security behind the loan. I am trying to aggregate this flag on a borrower level, so for each borrower, if one of the Property flags is 'Y', I want to add an additional column where it is 'Y' for each of the borrowers.
The short example below shows what the end result should look like. Any help would be appreciated.
import pandas as pd
data = {'Borrower': [1,2,2,2,3,3,4,5,6,6],
'Loan' : [1,2,3,4,5,6,7,8,9,10],
'Property': ["Y","N","Y","Y","N","Y","N","Y","N","N"],
'Result': ['Y','Y','Y','Y','Y','Y','N','Y','N','N']}
df = pd.DataFrame.from_dict(data)

You can use Transform on Property after groupby Borrower. Because the ASCII code of 'Y' is bigger than 'N' so if there is any property which is 'Y' for a borrower, max(Property) will give 'Y'.
df['Result2'] = df.groupby('Borrower')['Property'].transform(max)
df
Out[202]:
Borrower Loan Property Result Result2
0 1 1 Y Y Y
1 2 2 N Y Y
2 2 3 Y Y Y
3 2 4 Y Y Y
4 3 5 N Y Y
5 3 6 Y Y Y
6 4 7 N N N
7 5 8 Y Y Y
8 6 9 N N N
9 6 10 N N N

Related

Two new columns based on return has two values in dataframe apply

I have a DataFrame:
Num
1
2
3
def foo(x):
return x**2, x**3
When I did df['sq','cube'] = df['num'].apply(foo)
It is making a single column like below:
num (sq,cub)
1 (1,1)
2 (4,8)
3 (9,27)
I want these column separate with their values
num sq cub
1 1 1
2 4 8
3 9 27
How can I achieve this...?
obj = df['num'].apply(foo)
df['sq'] = obj.str[0]
df['cube'] = obj.str[1]

Do I use a loop, df.melt or df.explode to achieve a flattened dataframe?

Can anyone help with some code that will achieve the following transformation? I have tried variations of df.melt, df.explode, and also a looping statement but only get error statements. I think it might need nesting but don't have the experience to do so.
index A B C D
0 X d 4 2
1 Y b 5 2
Where column D represents frequency of column C.
desired output is:
index A B C
0 X d 4
1 X d 4
2 Y b 5
3 Y b 5
If you want to repeat rows, why not use index.repeat?
import pandas as pd
#recreate the sample dataframe
df = pd.DataFrame({"A":["X","Y"],"B":["d","b"],"C":[4,5],"D":[3,2]}, columns=list("ABCD"))
df = df.reindex(df.index.repeat(df["D"])).drop("D", 1).reset_index(drop=True)
print(df)
Sample output
A B C
0 X d 4
1 X d 4
2 X d 4
3 Y b 5
4 Y b 5

Cycle through a variable range with another variable

I need to loop through 2 variables and cycle through 1 variable from 2 variables (whichever is bigger) until the range of the 2nd (longest) last.
For example
x = 5 #input by user
y = 8 #input by user
for x_val, y_val in itertools.zip_longest(range(x), range(y), fillvalue='-'):
print(x_val)
print(y_val)
Expected output
0
0
1
1
2
2
3
3
4
4
0
5
1
6
2
7
tried
x = 5
x_cyc = itertools.cycle(range(x))
y = 8
for x_val, y_val in itertools.zip_longest(range(x), x_cyc):
print(x_val)
print(y_val)
but that didn't make much sense.
you dont need zip longest, you create an infinite cycle for the smaller of the two numbers and then normal range for the larger number. this way the min range will be infinite and max range will be the finite range.
You can simply use normal zip to go through them till you reach the end of the non infinite range.
from itertools import cycle
x = 8
y = 5
min_range = cycle(range(min(x, y)))
max_range = range(max(x, y))
for x_val, y_val in zip(min_range, max_range):
print(x_val)
print(y_val)
OUTPUT
0
0
1
1
2
2
3
3
4
4
0
5
1
6
2
7
UPDATE BASED ON COMMENTS
Now the x_val and y_val are bound to the x and y range and the lowest of x or y ints will be cycled in range.
from itertools import cycle
x = 8
y = 5
x_range = range(x)
y_range = range(y)
if x > y:
y_range = cycle(y_range)
elif y > x:
x_range = cycle(x_range)
for x_val, y_val in zip(x_range, y_range):
print(x_val)
print(y_val)
Note that the output will now differ when x is greater than y or when y is greater than x since x will always output first.
OUTPUT x=2, y=3
0
0
1
1
0
2
OUTPUT x=3 y=2
0
0
1
1
2
0

Pandas Aggregate data other than a specific value in specific column

I have my data like this in pandas dataframe python
df = pd.DataFrame({
'ID':range(1, 8),
'Type':list('XXYYZZZ'),
'Value':[2,3,2,9,6,1,4]
})
The oputput that i want to generate is
How can i generate these results using python pandas dataframe. I want to include all the Y values of type column, and does not want to aggregate them.
First filter values by boolean indexing, aggregate and append filter out rows, last sorting:
mask = df['Type'] == 'Y'
df1 = (df[~mask].groupby('Type', as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.append(df[mask])
.sort_values('ID'))
print (df1)
ID Type Value
0 1 X 5
2 3 Y 2
3 4 Y 9
1 5 Z 11
If want range 1 to length of data for ID column:
mask = df['Type'] == 'Y'
df1 = (df[~mask].groupby('Type', as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.append(df[mask])
.sort_values('ID')
.assign(ID = lambda x: np.arange(1, len(x) + 1)))
print (df1)
ID Type Value
0 1 X 5
2 2 Y 2
3 3 Y 9
1 4 Z 11
Another idea is create helper column for unique values only for Y rows and aggregate by both columns:
mask = df['Type'] == 'Y'
df['g'] = np.where(mask, mask.cumsum() + 1, 0)
df1 = (df.groupby(['Type','g'], as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.drop('g', axis=1)[['ID','Type','Value']])
print (df1)
ID Type Value
0 1 X 5
1 3 Y 2
2 4 Y 9
3 5 Z 11
Similar alternative with Series g, then drop is not necessary:
mask = df['Type'] == 'Y'
g = np.where(mask, mask.cumsum() + 1, 0)
df1 = (df.groupby(['Type',g], as_index=False)
.agg({'ID':'first', 'Value':'sum'})[['ID','Type','Value']])

python3 modifying rows in a dataframe based on a condition

I have a dataframe something like
A B C
1 4 x
2 8 y
3 7 z
4 12 y
5 10 b
i need to modify column B based on condition something like
if B <= 5 then B = 1
if B > 5 and B <= 10 then B = 2
if B > 10 and B < 15 then B = 3
so that my dataframe becomes
A B C
1 1 x
2 2 y
3 2 z
4 3 y
5 2 b
i am okay if I have to add a new column first and then drop column B. Could anyone help please?
You should use the apply function to implement this.
def check(row):
if (row['B']) <= 5:
return 1
elif (row['B'] > 5) and (row['B'] <= 10):
return 2
elif (row['B'] > 10) and (row['B'] <= 15):
return 3
These would apply the function to each row and then you can perform the checks.
df['B'] = df.apply(check, axis = 1)
Then the resulting DF would look like:
A B C
1 1 x
2 2 y
3 2 z
4 3 y
5 2 b
More documentation available here.

Resources