Collapse/Transpose Columns of a DataFrame Based on Repeating - pandas - python-3.x

I have a data frame sample_df like this,
id pd pd_dt pd_tp pd.1 pd_dt.1 pd_tp.1 pd.2 pd_dt.2 pd_tp.2
0 1 100 per year 468 200 per year 400 300 per year 320
1 2 100 per year 60 200 per year 890 300 per year 855
I need my output like this,
id pd pd_dt pd_tp
1 100 per year 468
1 200 per year 400
1 300 per year 320
2 100 per year 60
2 200 per year 890
2 300 per year 855
I tried the following,
sample_df.stack().reset_index().drop('level_1',axis=1)
This does not work.
I have pd, pd_dt, pd_tp are repeating with .1, .2 .. values.
I have How can I achieve output?

You want pd.wide_to_long, but with some tweak since your first few columns do not share the same patterns with the rest:
# rename
df.columns = [x+'.0' if '.' not in x and x != 'id' else x
for x in df.columns]
pd.wide_to_long(df, stubnames=['pd','pd_dt','pd_tp'],
i='id', j='order', sep='.')
Output:
pd pd_dt pd_tp
id order
1 0 100 per year 468
2 0 100 per year 60
1 1 200 per year 400
2 1 200 per year 890
1 2 300 per year 320
2 2 300 per year 855

You can use numpy split to split it into n arrays and concetanate them back together. Then repeat the id column by the number of rows in your new dataframe.
new_df = pd.DataFrame(np.concatenate(np.split(df.iloc[:,1:].values, (df.shape[1] - 1)/3, axis=1)))
new_df.columns = ['pd','pd_dt','pd_tp']
new_df['id'] = pd.concat([df.id] * (new_df.shape[0]//2), ignore_index=True)
new_df.sort_values('id')
Result:
pd pd_dt pd_tp id
0 100 per year 468 1
2 200 per year 400 1
4 300 per year 320 1
1 100 per year 60 2
3 200 per year 890 2
5 300 per year 855 2

You can do this:
dt_mask=df.columns.str.contains('dt')
tp_mask=df.columns.str.contains('tp')
new_df=pd.DataFrame()
new_df['pd']=df[df.columns[~(dt_mask|tp_mask)]].stack().reset_index(level=1,drop='level_1')
new_df['pd_dt']=df[df.columns[dt_mask]].stack().reset_index(level=1,drop='level_1')
new_df['pd_tp']=df[df.columns[tp_mask]].stack().reset_index(level=1,drop='level_1')
new_df.reset_index(inplace=True)
print(new_df)
id pd pd_dt pd_tp
0 1 100 per_year 468
1 1 200 per_year 400
2 1 300 per_year 320
3 2 100 per_year 60
4 2 200 per_year 890
5 2 300 per_year 855

Related

How to exclude rows from a groupby operation

I am working on a groupby operation using the attribute column but I want to exclude the desc_type 1 and 2 that will be used to calculate total discount inside each attrib.
pd.DataFrame({'ID':[10,10,10,20,30,30],'attribute':['attrib_1','desc_type1','desc_type2','attrib_1','attrib_2','desc_type1'],'value':[100,0,0,100,30,0],'discount':[0,6,2,0,0,13.3]})
output:
ID attribute value discount
10 attrib_1 100 0
10 desc_type1 0 6
10 desc_type2 0 2
20 attrib_1 100 0
30 attrib_2 30 0
30 desc_type1 0 13.3
I want to groupby this dataframe by attribute but excluding the desc_type1 and desc_type2.
The desired output:
attribute ID_count value_sum discount_sum
attrib_1 2 200 8
attrib_2 1 30 13.3
explanations:
attrib_1 has discount_sum=8 because ID 30 that belongs to attrib_1has two desc_type
attrib_2 has discount_sum=13.3 because ID 10 has one desc_type
ID=20 has no discounts types.
What I did so far:
df.groupby('attribute').agg({'ID':'count','value':'sum','discount':'sum'})
But the line above does not exclude the desc_type 1 and 2 from the groupby
Important: an ID may have a discount or not.
link to the realdataset: realdataset
You can fill the attributes per ID, then groupby.agg:
m = df['attribute'].str.startswith('desc_type')
group = df['attribute'].mask(m).groupby(df['ID']).ffill()
out = (df
.groupby(group, as_index=False)
.agg(**{'ID_count': ('ID', 'nunique'),
'value_sum': ('value', 'sum'),
'discount_sum': ('discount', 'sum')
})
)
output:
ID_count value_sum discount_sum
0 2 200 8.0
1 1 30 13.3
Hello I think this helps :
df.loc[(df['attribute'] != 'desc_type1') &( df['attribute'] != 'desc_type2')].groupby('attribute').agg({'ID':'count','value':'sum','discount':'sum'})
Output :
ID value discount
attribute
attrib_1 2 200 0.0
attrib_2 1 30 0.0

Re-formatting a dataframe to show sequence number and time difference after a groupby

I have a pandas dataframe that has an identifier, a sequence number, and a timestamp.
For example:
MyIndex seq_no timestamp
1 181 7:56
1 182 7:57
1 183 7:59
2 184 8:01
2 185 8:04
3 186 8:05
3 187 8:08
3 188 8:10
I want to reformat by showing a sequence number for each index and with the time difference, something like:
MyIndex seq_no timediff
1 1 0
1 2 1
1 3 2
2 1 0
2 2 3
3 1 0
3 2 3
3 3 2
I know I can get the seq_no by doing
df.groupby("MyIndex")["seq_no"].rank(method="first", ascending=True)
but how do I get the time difference? Bonus points if you show me how to do the time difference between steps, or total timediff from the start.
I think the simplest way to get the difference is to convert the timestamp to a single unit. You can then calculate the difference with groupby and shift.
import pandas as pd
from io import StringIO
data = """Index seq_no timestamp
1 181 7:56
1 182 7:57
1 183 7:59
2 184 8:01
2 185 8:04
3 186 8:05
3 187 8:08
3 188 8:10"""
df = pd.read_csv(StringIO(data), sep='\s+')
# use cumcount to get new seq_no
df['seq_no_new'] = df.groupby('Index').cumcount() + 1
# can convert timestamp by splitting string
# and then casting to int
time = df['timestamp'].str.split(':', expand=True).astype(int)
df['time'] = time.iloc[:, 0] * 60 + time.iloc[:, 1]
# you then calculate the difference with groupby/shift
# fillna values with 0 and cast to int
df['timediff'] = (df['time'] - df.groupby('Index')['time'].shift(1)).fillna(0).astype(int)
# pick columns you want at the end
df = df.loc[:, ['Index', 'seq_no_new', 'timediff']]
Output
>>>df
Index seq_no_new timediff
0 1 1 0
1 1 2 1
2 1 3 2
3 2 1 0
4 2 2 3
5 3 1 0
6 3 2 3
7 3 3 2

pandas - sort columns and group by a particular field

I have a list of objects
[
{
"companyid": long,
"parentid": long
"score": long,
...
}
]
The parentid is nothing but the cid of the parent company
Sample data will look something like this
cid parentid score
1 10 1000
2 10 100
3 10 1001
10 10 20
11 100 1000
12 100 100
100 100 200
111 1000 10
112 1000 100
1000 100 2000
I need to sort the values based on the score, but i want to group the values by parentid
I tried this which didn't really fit my requirements, since it groups then sorts
df.groupby('headcompanyid').apply(lambda x: x.sort_values('score'))
Sorting by score will give this result:
cid parentid score
1000 100 2000
3 10 1001
1 10 1000
11 100 1000
100 100 200
2 10 100
112 1000 100
12 100 100
10 10 20
111 1000 10
Grouping by parentid on the sorted data (which is my end goal), should give this result
cid parentid score
1000 100 2000
11 100 1000 // since 100 is the parentid, it needs to be pushed up the in the result set
100 100 200 // if multiple records are pushed up, then sorting should be based on score
12 100 100
3 10 1001 // 2nd group by is based on 10, since 1001 is the next highest score which
1 10 1000 // doesn't belong to the 100 parentid group
2 10 100
10 10 20
112 1000 100
111 1000 10
i am using pandas v0.24.2 and python 3.7 if it matters
Try this:
df.sort_values(['parentid', 'score'], ascending=[False, False])
Output:
cid parentid score
8 112 1000 100
7 111 1000 10
9 1000 100 2000
4 11 100 1000
6 100 100 200
5 12 100 100
2 3 10 1001
0 1 10 1000
1 2 10 100
3 10 10 20

How to remove Initial rows in a dataframe in python

I have 4 dataframes with weekly sales values for a year for 4 products. Some of the initial rows are 0 as no sales. there are some other 0 values as well in between the weeks.
I want to remove those initial 0 values, keeping the in between 0s.
For example
Week Sales(prod 1)
1 0
2 0
3 100
4 120
5 55
6 0
7 60.
Week Sales(prod 2)
1 0
2 0
3 0
4 120
5 0
6 30
7 60.
I want to remove row 1,2 from 1st table and 1,2,3 frm 2nd.
Few Assumption based on your example dataframe:
DataFrame is created using pandas
week always start with 1
will remove all the starting weeks only which are having 0 sales
Solution:
Python libraries Required
- pandas, more_itertools
Example DataFrame (df):
Week Sales
1 0
2 0
3 0
4 120
5 0
6 30
7 60
Python Code:
import pandas as pd
import more_itertools as mit
filter_col = 'Sales'
filter_val = 0
##function which returns the index to be removed
def return_initial_week_index_with_zero_sales(df,filter_col,filter_val):
index_wzs = [False]
if df[filter_col].iloc[1]==filter_val:
index_list = df[df[filter_col]==filter_val].index.tolist()
index_wzs = [list(group) for group in mit.consecutive_groups(index_list)]
else:
pass
return index_wzs[0]
##calling above function and removing index from the dataframe
df = df.set_index('Week')
weeks_to_be_removed = return_initial_week_index_with_zero_sales(df,filter_col,filter_val)
if weeks_to_be_removed:
print('Initial weeks with 0 sales are {}'.format(weeks_to_be_removed))
df = df.drop(index=weeks_to_be_removed)
else:
print('No initial week has 0 sales')
df.reset_index(inplace=True)
Result:df
Week Sales
4 120
5 55
6 0
7 60
I hope it helps, you can modify the function as per your requirement.

Group by multiple columns and calculate the average sum

I have the below dataframe :
Customer Category Month Mon_exp
1 A 1 200
1 A 1 100
1 A 2 150
1 B 2 150
1 B 3 300
2 A 1 300
2 A 1 200
2 A 2 150
2 B 2 150
2 B 3 400
Expected Dataframe :
Customer Category Month Mon_exp Ave_Mon_exp
1 A 1 200 300
1 A 1 100 300
1 A 2 150 300
1 B 2 150 300
1 B 3 300 300
2 A 1 300 400
2 A 1 200 400
2 A 2 150 400
2 B 2 150 400
2 B 3 400 400
Explanation for the new column 'Ave_Mon_exp' :
1) For Each customer, sum the 'Mon_exp' and divide with the count of unique 'Month' value.
For eg. Customer - 1, Sum of 'Mon_exp' is 900 and count of unique 'Month' value is 3. Hence the Ave_Mon_exp is 300.
Can anyone help me to derive the new column 'Ave_Mon_exp' ?
Thanks
import pandas as pd
sample_df = pd.DataFrame({'Customer':[1,1,1,1,1,2,2,2,2,2],'Category':['A','A','A','B','B','A','A','A','B','B'], 'Month': [1,1,2,2,3,1,1,2,2,3], 'Mon_exp': [200, 100, 150, 150, 300,300,200,150,150,400]})
new_col = sample.groupby('Customer')['Mon_exp'].sum()/ sample.groupby('Customer')['Month'].nunique()
new_col.name = 'Customer'
sample = sample.join(new_col, on='Customer', rsuffix='_Ave_Mon_exp')
print(sample_df)

Resources