Reshape a pandas DataFrame using combination of row values in two columns - python-3.x

I have data for multiple customers in data frame as below-
Customer_id event_type month mins_spent
1 live CM 10
1 live CM1 10
1 catchup CM2 20
1 live CM2 30
2 live CM 45
2 live CM1 30
2 catchup CM2 20
2 live CM2 20
I need the result data frame so that there is one row for each customer and column are combined value of column month and event_type and value will be mins_spent. Result data frame as below-
Customer_id CM_live CM_catchup CM1_live CM1_catchup CM2_live CM2_catchup
1 10 0 10 0 30 20
2 45 0 30 0 20 20
Is there an efficient way to do this instead of iterating the input data frame and creating the new data frame ?

you can use pivot_table
# pivot your data frame
p = df.pivot_table(values='mins_spent', index='Customer_id',
columns=['month', 'event_type'], aggfunc=np.sum)
# flatten multi indexed columns with list comprehension
p.columns = ['_'.join(col) for col in p.columns]
CM_live CM1_live CM2_catchup CM2_live
Customer_id
1 10 10 20 30
2 45 30 20 20

You can create a new column (key) by concatenating columns month and event_type, and then use pivot() to reshape your data.
(df.assign(key = lambda d: d['month'] + '_' + d['event_type'])
.pivot(
index='Customer_id',
columns='key',
values='mins_spent'
))

Related

want to calculate the count of pass instances of data set using python pandas

x=[]
y1=[]
r1=len(df)
L1=len(df.columns)
for i in range(r1):
ll=(df.loc[i,'LL'])
ul=(df.loc[i,'UL'])
count1 =0
for j in range(5,L1):
if isinstance(df.iloc[i,j],str):
df.loc[i,j]=0
if ll<=df.iloc[i,j]<=ul:
count1=count1+1
if count1==(L1-5):
x.append('Pass')
else:
x.append('Fail')
y1.append(count1)
se = pd.Series(x)
se1=pd.Series(y1)
df['Min']=min1.values
df['Mean']=mean1.values
df['Median']=median1.values
df['Max']=max1.values
df['Pass Count']=se1.values
df['Result']=se.values
min1 = df.iloc[:,5:].min(axis=1)
mean1=df.iloc[:,5:].astype(float).mean(axis=1,skipna = True)
median1=df.iloc[:,5:].astype(float).median(axis=1,skipna = True)
max1=df.iloc[:,5:].max(axis=1)
count1=df.iloc[:,5:].count(axis=1)
yield1=[]
for i in range(len(se1)):
yd1=(se1[i]/(L1-3))*100
yield1.append(yd1)
se2=pd.Series(yield1)
df['Yield']=se2.values
df1=df.loc[:,['PARAMETER','Min','Mean','Median','Max','Result','Pass Count','Yield']]
df1
Below is my data set, it is sensor data on daily basis. Daily data should be within the Lower Limit (LL) and Upper Limit(UL). I want to count how many days sensors data is within the LL and UL.
I am not able to calculate the number of days for sensor data within LL and UL using Pandas. How can I calculate the number of days for sensor data within LL and UL?
Take a few key ideas
need a list of the columns that go into calc daycols
transpose these columns into an array then to test, gives a boolean array
sum this boolean array and you have your desired calc
df = pd.read_csv(io.StringIO("""sensor location,LL,UL,day1,day2,day3,day4,day5,day6,day7,number of days sensor data within LL and UL
A,1,10,12,6,9,4,9,7,15,5
B,1,12,4,15,7,1,11,1,7,6
C,1,15,13,13,13,10,7,13,13,7
D,1,10,12,1,14,12,15,4,4,3
E,1,20,11,15,8,14,1,14,14,7"""))
daycols = [d for i,d in enumerate(df.columns) if "day" in d and "number" not in d]
df = df.assign(
# use fact true is 1 so sum a truth array gives the answer
daysBetween=lambda dfa: ((dfa.loc[:,daycols].T>=dfa["LL"]) &
(dfa.loc[:,daycols].T<=dfa["UL"])).sum()
)
print(df.to_string(index=False))
output
sensor location LL UL day1 day2 day3 day4 day5 day6 day7 number of days sensor data within LL and UL daysBetween
A 1 10 12 6 9 4 9 7 15 5 5
B 1 12 4 15 7 1 11 1 7 6 6
C 1 15 13 13 13 10 7 13 13 7 7
D 1 10 12 1 14 12 15 4 4 3 3
E 1 20 11 15 8 14 1 14 14 7 7
speed up
It you have many columns then you can use slice capability to identify them and turn into indexes so iloc can be used. Additionally the transpose is not necessary.
dayi = [df.columns.get_loc(c) for c in df.columns[3:-1]]
df = df.assign(
# use fact true is 1 so sum a truth array gives the answer
daysBetween=lambda dfa: ((dfa.iloc[:,dayi]>=dfa["LL"]) &
(dfa.iloc[:,dayi]<=dfa["UL"])).sum()
)

Create Multiple Dataframes using Loop & function

I have a df over 1M rows similar to this
ID Date Amount
x May 1 10
y May 2 20
z May 4 30
x May 1 40
y May 1 50
z May 2 60
x May 1 70
y May 5 80
a May 6 90
b May 8 100
x May 10 110
I have to sort the data based on the date and then create new dataframes depending on the times the value is present in Amount column. So if x has made purchase 3 time then I need it in 3 different dataframes. first_purchase dataframe would have every ID that has purchased even once irrespective of date or amount.
If an ID purchases 3 times, I need that ID to be in first purchase then second and then 3rd with Date and Amount.
Doing it manually is easy with:-
df = df.sort_values('Date')
first_purchase = df.drop_duplicates('ID')
after_1stpurchase = df[~df.index.isin(first_purchase.index)]
second data frame would be created with:-
after_1stpurchase = after_1stpurchase.sort_values('Date')
second_purchase = after_1stpurchase.drop_duplicates('ID')
after_2ndpurchase = after_1stpurchase[~after_1stpurchase.index.isin(second_purchase.index)]
How do I create the loop to provide me with each dataframes?
IIUC, I was able to achieve what you wanted.
import pandas as pd
import numpy as np
# source data for the dataframe
data = {
"ID":["x","y","z","x","y","z","x","y","a","b","x"],
"Date":["May 01","May 02","May 04","May 01","May 01","May 02","May 01","May 05","May 06","May 08","May 10"],
"Amount":[10,20,30,40,50,60,70,80,90,100,110]
}
df = pd.DataFrame(data)
# convert the Date column to datetime and still maintain the format like "May 01"
df['Date'] = pd.to_datetime(df['Date'], format='%b %d').dt.strftime('%b %d')
# sort the values on ID and Date
df.sort_values(by=['ID', 'Date'], inplace=True)
df.reset_index(inplace=True, drop=True)
print(df)
Original Dataframe:
Amount Date ID
0 90 May 06 a
1 100 May 08 b
2 10 May 01 x
3 40 May 01 x
4 70 May 01 x
5 110 May 10 x
6 50 May 01 y
7 20 May 02 y
8 80 May 05 y
9 60 May 02 z
10 30 May 04 z
.
# create a list of unique ids
list_id = sorted(list(set(df['ID'])))
# create an empty list that would contain dataframes
df_list = []
# count of iterations that must be seperated out
# for example if we want to record 3 entries for
# each id, the iter would be 3. This will create
# three new dataframes that will hold transactions
# respectively.
iter = 3
for i in range(iter):
df_list.append(pd.DataFrame())
for val in list_id:
tmp_df = df.loc[df['ID'] == val].reset_index(drop=True)
# consider only the top iter(=3) values to be distributed
counter = np.minimum(tmp_df.shape[0], iter)
for idx in range(counter):
df_list[idx] = df_list[idx].append(tmp_df.loc[tmp_df.index == idx])
for df in df_list:
df.reset_index(drop=True, inplace=True)
print(df)
Transaction #1:
Amount Date ID
0 90 May 06 a
1 100 May 08 b
2 10 May 01 x
3 50 May 01 y
4 60 May 02 z
Transaction #2:
Amount Date ID
0 40 May 01 x
1 20 May 02 y
2 30 May 04 z
Transaction #3:
Amount Date ID
0 70 May 01 x
1 80 May 05 y
Note that in your data, there are four transactions for 'x'. If lets say you wanted to track the 4th iterative transaction as well. All you need to do is change the value if 'iter' to 4 and you will get the fourth dataframe as well with the following value:
Amount Date ID
0 110 May 10 x

Maximum for each column, return value of other for max, create new dataframe of returns

I hope the title is not misleading.
I need to go from this dataframe:
Column_1 Columns_2 First Second Third
0 Element_1 to_be_ignored 10 5 77
1 Element_2 to_be_ignored 30 30 11
2 Element_3 to_be_ignored 60 7 3
3 Element_4 to_be_ignored 20 87 90
to:
New_Column New_Column_1 Max
0 Element_3 First 60
1 Element_4 Second 87
2 Element_4 Third 90
get maximum value of every column
get responding value of Column_1 for maximum value
transform to new dataframe
what i got so far:
data = {'Column_1': ['Element_1', 'Element_2', 'Element_3', 'Element_4'],
'Columns_2': ['to_be_ignored', 'to_be_ignored', 'to_be_ignored', 'to_be_ignored'],
'First': [10,30,60,20], 'Second': [5,30,7,87], 'Third': [77,11,3,90]}
df = pd.DataFrame(data)
df.loc[df.iloc[:, 1:].idxmax(), ['Column_1']
so i am able to get the index position and value for the maximum in the columns.
2 Element_3
3 Element_4
3 Element_4
Unfortunately i can't figure out the rest.
THX
IIUC melt then sort_values + drop_duplicates
df.melt(['Column_1','Columns_2']).sort_values('value').drop_duplicates(['variable'],keep='last')
Column_1 Columns_2 variable value
2 Element_3 to_be_ignored First 60
7 Element_4 to_be_ignored Second 87
11 Element_4 to_be_ignored Third 90

How to get the median of different intervals of dataframe based on label name? [duplicate]

This question already has answers here:
How to groupby consecutive values in pandas DataFrame
(4 answers)
Closed 3 years ago.
So I have a DataFrame with two columns, one with label names (df['Labels']) and the other with int values (df['Volume']).
df = pd.DataFrame({'Labels':
['A','A','A','A','B','B','B','B','B','B','A','A','A','A','A','A','A','A','C','C','C','C','C'],
'Volume':[10,40,20,20,50,60,40,50,50,60,10,10,10,10,20,20,10,20,80,90,90,80,100]})
I would like to identify intervals where my labels change and then calculate the median on the column 'Volume' for each of these intervals. Later I should replace every value of column 'Volume' by the respective median of each interval.
In case of label A, I would like to have the median for both intervals.
Here is how my DataFrame should looks like:
df2 = pd.DataFrame({'Labels':['A','A','A','A','B','B','B','B','B','B','A','A','A','A','A','A','A','A','C','C','C','C','C'],
'Volume':[20,20,20,20,50,50,50,50,50,50,10,10,10,10,10,10,10,10,90,90,90,90,90]})
You want to groupby the blocks and transform median:
blocks = df['Labels'].ne(df['Labels'].shift()).cumsum()
df['group_median'] = df['Volume'].groupby(blocks).transform('median')
Use Series.cumsum + Series.shift() to create groups using groupby and then use transform
df['Volume']=df.groupby(df['Labels'].ne(df['Labels'].shift()).cumsum())['Volume'].transform('median')
print(df)
Labels Volume
0 A 20
1 A 20
2 A 20
3 A 20
4 B 50
5 B 50
6 B 50
7 B 50
8 B 50
9 B 50
10 A 10
11 A 10
12 A 10
13 A 10
14 A 10
15 A 10
16 A 10
17 A 10
18 C 90
19 C 90
20 C 90
21 C 90
22 C 90

How to take values in the column as the columns in the DataFrame in pandas

My current DataFrame is:
Term value
Name
A 1 35
A 2 40
A 3 50
B 1 20
B 2 45
B 3 50
I want to get a dataframe as:
Term 1 2 3
Name
A 35 40 50
B 20 45 50
How can i get it?I've tried using pivot_table but i didn't get my expected output.Is there any way to get my expected output?
Use:
df = df.set_index('Term', append=True)['value'].unstack()
Or:
df = pd.pivot(df.index, df['Term'], df['value'])
print (df)
Term 1 2 3
Name
A 35 40 50
B 20 45 50
EDIT: If duplicates in pairs Name with Term is necessary aggretion, e.g. sum or mean:
df = df.groupby(['Name','Term'])['value'].sum().unstack(fill_value=0)

Resources