duplicate rows and swap clumns based on condition - python-3.x

I have the following table:
name a0 a1 type val
0 name1 1 0 AB 100
1 name1 2 0 AB 105
2 name2 1 2 BB 110
3 name3 1 0 AN 120
and I want to do this.
For every type I see where the type name does not contain the same 2 letters, I want to duplicate the row and swap the a0 and a1 columns and the letters of the type column. So, my result will be:
name a0 a1 type val
0 name1 1 0 AB 100
1 name1 0 1 BA 100
2 name1 2 0 AB 105
3 name1 0 2 BA 105
4 name2 1 2 BB 110
5 name3 1 0 AN 120
6 name3 0 1 NA 120
Note that for example for the same name we can have the same type but different a0 and a1 (and hence val).
So, we can have name1 and type AB as in the first two lines of the original table.
I tried:
df1 = pd.DataFrame({'name':['name1', 'name1', 'name2', 'name3'], 'a0':[1, 2, 1, 1], 'a1':[0, 0, 2, 0], 'type':['AB', 'AB', 'BB', 'AN'], 'val':[100,105, 110, 120]})
for idx in df1.index:
a1 = df1.loc[idx, 'a0']
a0 = df1.loc[idx, 'a1']
val = df1.loc[idx, 'val']
name = df1.loc[idx, 'name']
if df1.loc[idx, 'type'] == 'AB':
new_type = 'BA'
elif df1.loc[idx, 'type'] == 'AN':
new_type = 'NA'
row = pd.DataFrame({'name':name, 'a0':a0 , 'a1':a1 , 'type':new_type, 'val':val}, index=np.arange(idx))
df1 = df1.append(row, ignore_index=False)
df1 = df1.sort_index().reset_index(drop=True)
but it gives me:
name a0 a1 type val
0 name1 1 0 AB 100
1 name1 2 0 BA 105
2 name1 0 2 BA 105
3 name1 2 0 BA 105
4 name1 0 2 BA 105
5 name1 2 0 BA 105
6 name1 0 2 BA 105
7 name1 2 0 AB 105
8 name2 1 2 BB 110
9 name3 1 0 AN 120

First create mask for identify values with 2 different letters, create new DataFrame by DataFrame.assign, swap values in columns, join together and sorting by index, last create default index values:
mask = df['type'].apply(set).str.len() == 2
df1 = df[mask].assign(type=lambda x: df['type'].str[1] + df['type'].str[0])
df1[['a0','a1']] = df1[['a1','a0']].to_numpy()
#pandas below
#df1[['a0','a1']] = df1[['a1','a0']].values
df = pd.concat([df, df1]).sort_index().reset_index(drop=True)
print (df)
name a0 a1 type val
0 name1 1 0 AB 100
1 name1 0 1 BA 100
2 name1 2 0 AB 105
3 name1 0 2 BA 105
4 name2 1 2 BB 110
5 name3 1 0 AN 120
6 name3 0 1 NA 120

You can use:
def myfunc(x):
x['type']=x['type'][::-1]
x[['a0','a1']]=x[['a1','a0']].values
return x
m=df.type.apply(set).str.len().gt(1)
pd.concat([df,df.loc[m].apply(myfunc,axis=1)],ignore_index=True).sort_values(['name','val'])
name a0 a1 type val
0 name1 1 0 AB 100
4 name1 0 1 BA 100
1 name1 2 0 AB 105
5 name1 0 2 BA 105
2 name2 1 2 BB 110
3 name3 1 0 AN 120
6 name3 0 1 NA 120

Related

Grouping several dataframe columns based on another columns values

I have this dataframe:
refid col2 price1 factor1 price2 factor2 price3 factor3
0 1 a 200 1 180 3 150 10
1 2 b 500 1 450 3 400 10
2 3 c 700 1 620 2 550 5
And I need to get this output:
refid col2 price factor
0 1 a 200 1
1 1 b 500 1
2 1 c 700 1
3 2 a 180 3
4 2 b 450 3
5 2 c 620 2
6 3 a 150 10
7 3 b 400 10
8 3 c 550 5
Right now I'm trying to use df.melt method, but can't get it to work, this is the code and the current result:
df2_melt = df2.melt(id_vars=["refid","col2"],
value_vars=["price1","price2","price3",
"factor1","factor2","factor3"],
var_name="Price",
value_name="factor")
refid col2 price factor
0 1 a price1 200
1 2 b price1 500
2 3 c price1 700
3 1 a price2 180
4 2 b price2 450
5 3 c price2 620
6 1 a price3 150
7 2 b price3 400
8 3 c price3 550
9 1 a factor1 1
10 2 b factor1 1
11 3 c factor1 1
12 1 a factor2 3
13 2 b factor2 3
14 3 c factor2 2
15 1 a factor3 10
16 2 b factor3 10
17 3 c factor3 5
Since you have a wide DataFrame with common prefixes, you can use wide_to_long:
out = pd.wide_to_long(df, stubnames=['price','factor'],
i=["refid","col2"], j='num').droplevel(-1).reset_index()
Output:
refid col2 price factor
0 1 a 200 1
1 1 a 180 3
2 1 a 150 10
3 2 b 500 1
4 2 b 450 3
5 2 b 400 10
6 3 c 700 1
7 3 c 620 2
8 3 c 550 5
Note that your expected output has an error where factors don't align with refids.
You can melt two times and then concat them:
import pandas as pd
df = pd.DataFrame({'refid': [1, 2, 3], 'col2': ['a', 'b', 'c'],
'price1': [200, 500, 700], 'factor1': [1, 1, 1],
'price2': [180, 450, 620], 'factor2': [3,3,2],
'price3': [150, 400, 550], 'factor3': [10, 10, 5]})
prices = [c for c in df if c.startswith('price')]
factors = [c for c in df if c.startswith('factor')]
df1 = pd.melt(df, id_vars=["refid","col2"], value_vars=prices, value_name='price').drop('variable', axis=1)
df2 = pd.melt(df, id_vars=["refid","col2"], value_vars=factors, value_name='factor').drop('variable', axis=1)
df3 = pd.concat([df1, df2['factor']],axis=1).reset_index().drop('index', axis=1)
print(df3)
Here is the output:
refid col2 price factor
0 1 a 200 1
1 2 b 500 1
2 3 c 700 1
3 1 a 180 3
4 2 b 450 3
5 3 c 620 2
6 1 a 150 10
7 2 b 400 10
8 3 c 550 5
One option is pivot_longer from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
(df
.pivot_longer(
index = ['refid', 'col2'],
names_to = '.value',
names_pattern = r"(.+)\d",
sort_by_appearance = True)
)
refid col2 price factor
0 1 a 200 1
1 1 a 180 3
2 1 a 150 10
3 2 b 500 1
4 2 b 450 3
5 2 b 400 10
6 3 c 700 1
7 3 c 620 2
8 3 c 550 5
The idea for this particular reshape is that whatever group in the regular expression is paired with the .value stays as the column header.

Checking if the value is there in any specified column of the same table

I wanted to check if the value of a particular row of a column is present in the other column.
df:
sno id1 id2 id3
1 1,2 7 1,2,7,22
2 2 8,9 2,8,9,15,17
3 1,5 6 1,5,6,17,33
4 4 4,12,18
5 9 9,14
output:
for a particular given row,
for i in sno:
if id1 in id3 :
score = 50
elif id2 in id3:
score = 50
if id1 in id3 and id2 in id3:
score = 75
I finally want my score out of that logic.
You can convert all values to sets with split and then compare by issubset, also and bool(a) is used for omit empty sets (created from missing values):
print (df)
sno id1 id2 id3
0 1 1,2 7 1,20,70,22
1 2 2 8,9 2,8,9,15,17
2 3 1,5 6 1,5,6,17,33
3 4 4 NaN 4,12,18
4 5 NaN 9 9,14
def convert(x):
return set(x.split(',')) if isinstance(x, str) else set([])
cols = ['id1', 'id2', 'id3']
df1 = df[cols].applymap(convert)
m1 = np.array([a.issubset(b) and bool(a) for a, b in zip(df1['id1'], df1['id3'])])
m2 = np.array([a.issubset(b) and bool(a) for a, b in zip(df1['id2'], df1['id3'])])
df['new'] = np.select([m1 & m2, m1 | m2], [75, 50], np.nan)
print (df)
sno id1 id2 id3 new
0 1 1,2 7 1,20,70,22 NaN
1 2 2 8,9 2,8,9,15,17 75.0
2 3 1,5 6 1,5,6,17,33 75.0
3 4 4 NaN 4,12,18 50.0
4 5 NaN 9 9,14 50.0

Taking all duplicate values in column as single value in pandas

My current dataframe is:
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
I want to get a dataframe as:
Name term Grade
0 A 1 35
2 40
1 B 1 50
2 45
Is i possible to get like my expected output?If yes,How can i do it?
Use duplicated for boolean mask with numpy.where:
mask = df['Name'].duplicated()
#more general
#mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
Difference between masks is possible seen in changed DataFrame:
print (df)
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
4 A 4 43
5 A 3 46
If multiple same consecutive groups like 2 A groups need general solution:
mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 A 4 43
5 3 46
mask = df['Name'].duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 4 43
5 3 46

Subset and Loop to create a new column [duplicate]

With the DataFrame below as an example,
In [83]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
df
Out[83]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
What would be a simple way to generate a new column containing some aggregation of the data over one of the columns?
For example, if I sum values over items in A
In [84]:
df.groupby('A').sum()['values']
Out[84]:
A
1 25
2 45
Name: values
How can I get
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
In [20]: df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
In [21]: df
Out[21]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
In [22]: df['sum_values_A'] = df.groupby('A')['values'].transform(np.sum)
In [23]: df
Out[23]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
I found a way using join:
In [101]:
aggregated = df.groupby('A').sum()['values']
aggregated.name = 'sum_values_A'
df.join(aggregated,on='A')
Out[101]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
Anyone has a simpler way to do it?
This is not so direct but I found it very intuitive (the use of map to create new columns from another column) and can be applied to many other cases:
gb = df.groupby('A').sum()['values']
def getvalue(x):
return gb[x]
df['sum'] = df['A'].map(getvalue)
df
In [15]: def sum_col(df, col, new_col):
....: df[new_col] = df[col].sum()
....: return df
In [16]: df.groupby("A").apply(sum_col, 'values', 'sum_values_A')
Out[16]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45

Sum of next n rows in python

I have a dataframe which is grouped at product store day_id level Say it looks like the below and I need to create a column with rolling sum
prod store day_id visits
111 123 1 2
111 123 2 3
111 123 3 1
111 123 4 0
111 123 5 1
111 123 6 0
111 123 7 1
111 123 8 1
111 123 9 2
need to create a dataframe as below
prod store day_id visits rolling_4_sum cond
111 123 1 2 6 1
111 123 2 3 5 1
111 123 3 1 2 1
111 123 4 0 2 1
111 123 5 1 4 0
111 123 6 0 4 0
111 123 7 1 NA 0
111 123 8 1 NA 0
111 123 9 2 NA 0
i am looking for create a
cond column: that recursively checks a condition , say if rolling_4_sum is greater than 5 then make the next 4 rows as 1 else do nothing ,i.e. even if the condition is not met retain what was already filled before , do this check for each row until 7 th row.
How can i achieve this using python ? i am trying
d1['rolling_4_sum'] = d1.groupby(['prod', 'store']).visits.rolling(4).sum()
but getting an error.
The formation of rolling sums can be done with rolling method, using boxcar window:
df['rolling_4_sum'] = df.visits.rolling(4, win_type='boxcar', center=True).sum().shift(-2)
The shift by -2 is because you apparently want the sums to be placed at the left edge of the window.
Next, the condition about rolling sums being less than 4:
df['cond'] = 0
for k in range(1, 4):
df.loc[df.rolling_4_sum.shift(k) < 7, 'cond'] = 1
A new column is inserted and filled with 0; then for each k=1,2,3,4, look k steps back; if the sum then less than 7, then set the condition to 1.

Resources