pandas groupby week and rolling - python-3.x

i have a dataframe like this:
dfx=pd.DataFrame({"name":["bag","bag","bag","phone","phone","phone"],'date':["2022-11-14 00:00:00","2022-11-21 00:00:00","2022-11-28 00:00:00","2022-11-14 00:00:00","2022-11-21 00:00:00","2022-11-28 00:00:00"],"view":[80,90,100,200,400,450]})
'''
name date view
0 bag 2022-11-14 00:00:00 80
1 bag 2022-11-21 00:00:00 90
2 bag 2022-11-28 00:00:00 100
3 phone 2022-11-14 00:00:00 200
4 phone 2022-11-21 00:00:00 400
5 phone 2022-11-28 00:00:00 450
'''
I would like to group by name and week and get rolling views by week. Expected output:
name one_week_view two_weeks_view three_weeks_wiew
bag 80 170 270
phone 200 600 1050

Given you already have weekly data, you could use a pivot and cumsum:
(dfx.pivot(index='name', columns='date', values='view')
.cumsum(axis=1)
.set_axis(['one_week_view', 'two_weeks_view', 'three_weeks_view'], axis=1)
)
Output:
one_week_view two_weeks_view three_weeks_view
name
bag 80 170 270
phone 200 600 1050

Related

How to create this year_month sales and previous year_month sales in two different columns?

I need to create two different columns, one for this year sales and one column for last year sales from a transactional level data?
Data format:-
Date | bill amount
2019-07-22 | 500
2019-07-25 | 200
2020-11-15 | 100
2020-11-06 | 900
2020-12-09 | 50
2020-12-21 | 600
Required format:-
Year_month |This month Sales | Prev month sales
2019_07 | 700 | -
2020_11 | 1000 | -
2020_12 | 650 | 1000
The relatively tricky bit is to figure out what the previous month is. We do it by figuring out the beginning of the month for each date and then rolling back by 1 month. Note that this will take care of January -> December of previous year issues
We start by creating a sample dataframe and importing some useful modules
from io import StringIO
from datetime import datetime,timedelta
from dateutil.relativedelta import relativedelta
data = StringIO(
"""
date|amount
2019-07-22|500
2019-07-25|200
2020-11-15|100
2020-11-06|900
2020-12-09|50
2020-12-21|600
""")
df = pd.read_csv(data,sep='|')
df['date'] = pd.to_datetime(df['date'])
df
we get
date amount
0 2019-07-22 500
1 2019-07-25 200
2 2020-11-15 100
3 2020-11-06 900
4 2020-12-09 50
5 2020-12-21 600
Then we figure out the month start and the previous month start using datetime utilities
df['month_start'] = df['date'].apply(lambda d:datetime(year = d.year, month = d.month, day = 1))
df['prev_month_start'] = df['month_start'].apply(lambda d:d+relativedelta(months = -1))
Then we summarize monthly sales using groupby on month start
ms_df = df.drop(columns = 'date').groupby('month_start').agg({'prev_month_start':'first','amount':sum}).reset_index()
ms_df
so we get
month_start prev_month_start amount
0 2019-07-01 2019-06-01 700
1 2020-11-01 2020-10-01 1000
2 2020-12-01 2020-11-01 650
Then we join (merge) ms_df on itself by mapping 'prev_month_start' to 'month_start'
ms_df2 = ms_df.merge(ms_df, left_on='prev_month_start', right_on='month_start', how = 'left', suffixes = ('','_prev'))
We are more or less there but now make it pretty by getting rid of superfluous columns, adding labels, etc
ms_df2['label'] = ms_df2['month_start'].dt.strftime('%Y_%m')
ms_df2 = ms_df2.drop(columns = ['month_start','prev_month_start','month_start_prev','prev_month_start_prev'])
columns = ['label','amount','amount_prev']
ms_df2 = ms_df2[columns]
and we get
| | label | amount | amount_prev |
|---:|--------:|---------:|--------------:|
| 0 | 2019_07 | 700 | nan |
| 1 | 2020_11 | 1000 | nan |
| 2 | 2020_12 | 650 | 1000 |
Using #piterbarg's data, we can use resample, combined with shift and concat to get your desired data:
import pandas as pd
from io import StringIO
data = StringIO(
"""
date|amount
2019-07-22|500
2019-07-25|200
2020-11-15|100
2020-11-06|900
2020-12-09|50
2020-12-21|600
"""
)
df = pd.read_csv(data, sep="|", parse_dates=["date"])
df
date amount
0 2019-07-22 500
1 2019-07-25 200
2 2020-11-15 100
3 2020-11-06 900
4 2020-12-09 50
5 2020-12-21 600
Get the sum for current sales:
data = df.resample(on="date", rule="1M").amount.sum().rename("This_month")
data
date
2019-07-31 700
2019-08-31 0
2019-09-30 0
2019-10-31 0
2019-11-30 0
2019-12-31 0
2020-01-31 0
2020-02-29 0
2020-03-31 0
2020-04-30 0
2020-05-31 0
2020-06-30 0
2020-07-31 0
2020-08-31 0
2020-09-30 0
2020-10-31 0
2020-11-30 1000
2020-12-31 650
Freq: M, Name: This_month, dtype: int64
Now, we can shift the month to get values for previous month, and drop rows that have 0 as total sales to get your final output:
(pd.concat([data, data.shift().rename("previous_month")], axis=1)
.query("This_month!=0")
.fillna(0))
This_month previous_month
date
2019-07-31 700 0.0
2020-11-30 1000 0.0
2020-12-31 650 1000.0

Collapse/Transpose Columns of a DataFrame Based on Repeating - pandas

I have a data frame sample_df like this,
id pd pd_dt pd_tp pd.1 pd_dt.1 pd_tp.1 pd.2 pd_dt.2 pd_tp.2
0 1 100 per year 468 200 per year 400 300 per year 320
1 2 100 per year 60 200 per year 890 300 per year 855
I need my output like this,
id pd pd_dt pd_tp
1 100 per year 468
1 200 per year 400
1 300 per year 320
2 100 per year 60
2 200 per year 890
2 300 per year 855
I tried the following,
sample_df.stack().reset_index().drop('level_1',axis=1)
This does not work.
I have pd, pd_dt, pd_tp are repeating with .1, .2 .. values.
I have How can I achieve output?
You want pd.wide_to_long, but with some tweak since your first few columns do not share the same patterns with the rest:
# rename
df.columns = [x+'.0' if '.' not in x and x != 'id' else x
for x in df.columns]
pd.wide_to_long(df, stubnames=['pd','pd_dt','pd_tp'],
i='id', j='order', sep='.')
Output:
pd pd_dt pd_tp
id order
1 0 100 per year 468
2 0 100 per year 60
1 1 200 per year 400
2 1 200 per year 890
1 2 300 per year 320
2 2 300 per year 855
You can use numpy split to split it into n arrays and concetanate them back together. Then repeat the id column by the number of rows in your new dataframe.
new_df = pd.DataFrame(np.concatenate(np.split(df.iloc[:,1:].values, (df.shape[1] - 1)/3, axis=1)))
new_df.columns = ['pd','pd_dt','pd_tp']
new_df['id'] = pd.concat([df.id] * (new_df.shape[0]//2), ignore_index=True)
new_df.sort_values('id')
Result:
pd pd_dt pd_tp id
0 100 per year 468 1
2 200 per year 400 1
4 300 per year 320 1
1 100 per year 60 2
3 200 per year 890 2
5 300 per year 855 2
You can do this:
dt_mask=df.columns.str.contains('dt')
tp_mask=df.columns.str.contains('tp')
new_df=pd.DataFrame()
new_df['pd']=df[df.columns[~(dt_mask|tp_mask)]].stack().reset_index(level=1,drop='level_1')
new_df['pd_dt']=df[df.columns[dt_mask]].stack().reset_index(level=1,drop='level_1')
new_df['pd_tp']=df[df.columns[tp_mask]].stack().reset_index(level=1,drop='level_1')
new_df.reset_index(inplace=True)
print(new_df)
id pd pd_dt pd_tp
0 1 100 per_year 468
1 1 200 per_year 400
2 1 300 per_year 320
3 2 100 per_year 60
4 2 200 per_year 890
5 2 300 per_year 855

difference between two column of a dataframe

I am new to python and would like to find out the difference between two column of a dataframe.
What I want is to find the difference between two column along with a respective third column. For example, I have a dataframe Soccer which contains the list of all the team playing soccer with the goals against and for their club. I wanted to find out the goal difference along with the team name. i.e. (Goals Diff=goalsFor-goalsAgainst).
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Atletico Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
I tried creating a function and then iterating through each row of a dataframe as below:
for index, row in football.iterrows():
##pdb.set_trace()
goalsFor=row['GoalsFor']
goalsAgainst=row['GoalsAgainst']
teamName=row['Team']
if not total:
totals=np.array(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
else:
total= total.append(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
return total
def Goal_diff_count_Formal(gFor, gAgainst, team):
goalsDifference=gFor-gAgainst
return [team, goalsDifference]
However, I would like to know if there is a quickest way to get this, something like
dataframe['goalsFor'] - dataframe['goalsAgainst'] #along with the team name in the dataframe
Solution if unique values in Team column - create index by Team, get difference and select Team by index:
df = df.set_index('Team')
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Real Madrid 2807
Barcelona 2786
Atletico Madrid 1225
dtype: int64
print (s['Atletico Madrid'])
1225
Solution if possible duplicated values in Team column:
I believe you need grouping by Team and aggregate sum first and then get difference:
#change sample data for Team in row 3
print (df)
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Real Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
df = df.groupby('Team')['GoalsFor','GoalsAgainst'].sum()
df['diff'] = df['GoalsFor'] - df['GoalsAgainst']
print (df)
GoalsFor GoalsAgainst diff
Team
Barcelona 5900 3114 2786
Real Madrid 10481 6449 4032
EDIT:
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Barcelona 2786
Real Madrid 4032
dtype: int64
print (s['Barcelona'])
2786

how to pass pandas series element to another dataframe

I want to check if an error occurred.
I have this two dataframes, from excel files:
Log_frame is a dataframe of log files, reporting data recording and error:
Time Voltage[V] Freq[Hz] Speed Motor_Stt: ErrNo
0 10:00 220 50 30 1 0
1 10:10 220 50 30 1 0
2 10:20 220 50 0 2 3601
3 10:30 220 47 0 1 1500
4 10:40 250 50 0 1 7707
5 10:50 220 50 0 2 3601
6 11:00 220 50 0 2 3601
7 11:10 220 47 0 1 1500
8 11:20 220 50 30 1 0
9 11:30 220 50 30 1 0
Dev_frame is the dataframe of error description:
Fehler-Nr. Descr Cause
0 1500 Chk_Voltage Voltage out of range
1 7707 Chk_Freq. Freq. out of range
2 3601 Chk_Motor_Stt Motor_defec
3 7704 switch_trip chk_over_curr
from Log_frame I can check if, which and how many errors occurred during a day by:
Err_log = Log_frame['ErrNo']
p = Err_log[Err_log != 0].drop_duplicates('first').reset_index(drop=True)
and this result is a pandas series:
<class 'pandas.core.series.Series'>
0 3601
1 1500
2 7707
I can "pass" first error (or second and all the other) by this:
Dev_Err = Dev_frame['Fehler-Nr.']
n = Dev_Err[Dev_Err == p.iloc[0]] #or 1, 2 and so on
I was wondering how to loop trough p.iloc[i].
Should I use a for loop or can be done by any pandas function
EDIT: e.g. if I put 1 in p.iloc[] I can get:
0 1500
if 2:
1 7707
No need to create a loop to check each value, you can use isin method that pandas.DataFrame has as following:
n = dev_frame[dev_frame['Fehler-Nr.'].isin(p)]['Fehler-Nr.']
which is going to return:
0 1500
1 7707
2 3601
Name: Fehler-Nr., dtype: int64
Ref: pandas.DataFrame.isin
If you're using pandas and going for for loops you are wrong. Use pandas vectorised operations. These are done using (simple exaple)
df.apply(some function, axis)
I'm not 100% convinced I understood what you're trying to achieve, but I believe you just want to merge/join number of errors for a given error. If so, pandas.join() and pandas.merge() are to help. Check the docs.

Excel VLOOKUP gives wrong value

I have a VLookup cell which gives me the wrong value:
This is the table:
PID Product Price User User name Deal On Amount After
in 1001 table 1001 1 Milly No 1000
in 1001 table 100 13 Vernetta Yes 900
out 1001 table 50 14 Mireya No 900
out 1001 table 100 15 Rosanne Yes 1000
out 1001 table 101 16 Belinda No 1000
in 1001 table 200 1 Milly Yes 800
in 1234 chair 300 2 Charlena Yes 500
in 1234 chair 100 3 Mina Yes 400
in 1234 chair 50 4 Sabina Yes 350
in 8231 couch 20 5 Joni Yes 330
in 1001 table 150 6 Armando Yes 180
in 1001 table 100 7 Noemi Yes 80
in 8231 couch 40 8 Ashlie Yes 40
in 8231 couch 30 9 Ann Yes 10
out 1001 table 201 10 Angelina Yes 211
out 1234 chair 300 11 Melvina Yes 511
out 8231 couch 21 12 Mattie Yes 532
The product column is a VLOOKUP with the following formula
VLOOKUP(B6,$L$2:$M$10, 2)
B is the PID column. The table in L2:M10 is the following:
PID Prodcut
1001 table
1234 chair
8231 desk
2311 closet
9182 book_shelf
1822 bed
1938 coffee_book_table
2229 couch
Now as you can see. PID 8231 is a desk, but it appears as a couch. Can you see what the problem is?
The crux of your problem is the way you've written the formula, you forgot the last parameter, "FALSE" or 0 which means you want an EXACT MATCH. So your formula should look like this:
VLOOKUP(B6, $L$2:$M$10, 2, FALSE)
OR
VLOOKUP(B6, $L$2:$M$10, 2, 0)
Both do the same thing.
The default setting for it is TRUE, which looks for the closest match, which you only want if the index you're looking up off of is sorted.

Resources