Modelling a moving window with a shift( ) function in python problem - python-3.x

Problem: Lets suppose that we supply robots to a factory. Each of these robots is programmed to switch into the work mode after 3 days (e.g. if it arrives on day 1, it starts working on day 3), and then they work for 5 days. after that, the battery runs out and they stop working. The number of robots supplied each day varies.
The following code is the supplies for the first 15 days like so:
import pandas as pd
df = pd.DataFrame({
'date': ['01','02', '03', '04', '05','06', \
'07','08','09','10', '11', '12', '13', '14', '15'],
'value': [10,20,20,30,20,10,30,20,10,20,30,40,20,20,20]
})
df.set_index('date',inplace=True)
df
Let's now estimate the number of working robots on each of these days like so ( we move two days back and sum up only the numbers within the past 5 days):
04 10
05 20+10 = 30
06 20+20 = 40
07 30+20 = 50
08 20+30 = 50
09 10+20 = 30
10 30+10 = 40
11 20+30 = 50
12 10+20 = 30
13 20+10 = 30
14 30+20 = 50
15 40+30 = 70
Is it possible to model this in python? I have tried this - not quite but close.
df_p = (((df.rolling(2)).sum())).shift(5).rolling(1).mean().shift(-3)
p.s. if you dont think its complicated enough then I also need to include the last 7-day average for each of these numbers for my real problem.

Let's try shift forward first the window (5) less the rolling window length (2) and taking rolling sum with min periods set to 1:
shift_window = 5
rolling_window = 2
df['new_col'] = (
df['value'].shift(shift_window - rolling_window)
.rolling(rolling_window, min_periods=1).sum()
)
Or with hard coded values:
df['new_col'] = df['value'].shift(3).rolling(2, min_periods=1).sum()
df:
value new_col
date
01 10 NaN
02 20 NaN
03 20 NaN
04 30 10.0
05 20 30.0
06 10 40.0
07 30 50.0
08 20 50.0
09 10 30.0
10 20 40.0
11 30 50.0
12 40 30.0
13 20 30.0
14 20 50.0
15 20 70.0

Related

Compare hourly data using python pandas

Currently I am having two columns in dataframe one is timestamp and another is temperature which is received every 5 minutes. So data looks like:
timestamp temp
2021-03-21 00:02:17 35
2021-03-21 00:07:17 32
2021-03-21 00:12:17 33
2021-03-21 00:17:17 34
...
2021-03-21 00:57:19 33
2021-03-21 01:02:19 30
2021-03-21 01:07:19 31
...
Now if I want to compare each and every data on hourly basis how can I go ahead, I have tried df.resample() method but it just gives one result after every hour.
The result which I am expecting is like:
data at 00:02:17 - 35 and 01:02:19 - 30, So ans will be 35 -30 = 5
For second one 01:07:19 - 32 and 00:07:17 - 31, So ans will be 32 - 31 = 1
How can I do it dynamically such that it compares hourly data difference
Any help would be great.
Thanks a lot.
Use:
result_df = df.assign(minute_diff=df.sort_values('timestamp', ascending=False)
.groupby(pd.to_datetime(df['timestamp']).dt.minute)['temp']
.diff())
print(result_df)
timestamp temp minute_diff
0 2021-03-21 00:02:17 35 5.0
1 2021-03-21 00:07:17 32 1.0
2 2021-03-21 00:12:17 33 NaN
3 2021-03-21 00:17:17 34 NaN
4 2021-03-21 00:57:19 33 NaN
5 2021-03-21 01:02:19 30 NaN
6 2021-03-21 01:07:19 31 NaN

"Max value day" of the week and tallying up each day that was highest Python

I was able to get the highest value of the week. Now, I need to figure out which day of the week it was so I can tally up how many times a certain day of the week is the highest.
For example,
Day of the week that has highest value of that week
Mon:5
Tue:2
Wed:3
Thur:2
Fri:1
This is what my dataframe looked like before I parsed the information that I needed.
Date Weekdays Week Open Close
0 2019-06-26 Wednesday 26 208.279999 208.509995
1 2019-06-27 Thursday 26 208.970001 212.020004
2 2019-06-28 Friday 26 213.000000 213.169998
3 2019-07-01 Monday 27 214.250000 214.619995
4 2019-07-02 Tuesday 27 214.380005 214.539993
.. ... ... ... ... ...
500 2021-06-21 Monday 25 275.619995 277.100006
501 2021-06-22 Tuesday 25 277.570007 276.920013
502 2021-06-23 Wednesday 25 276.890015 274.660004
503 2021-06-24 Thursday 25 275.000000 275.489990
504 2021-06-25 Friday 25 276.369995 278.380005
[505 rows x 5 columns]
Now I was able to get the highest value of the week, but I want to get the day and tally the which days were the highest.
#Tally up the highest days of the week at OPEN
new_data.groupby(pd.Grouper('Week')).Open.max()
The result was
Week
26 213.000000
27 215.130005
28 215.210007
29 214.440002
30 208.369995
31 210.000000
32 204.199997
33 214.740005
34 210.050003
35 217.509995
36 222.000000
37 220.539993
38 220.279999
39 214.000000
40 214.300003
41 215.880005
42 216.740005
43 212.429993
44 213.550003
45 222.809998
46 228.500000
47 233.570007
48 233.919998
49 231.190002
50 231.259995
51 227.679993
52 226.860001
1 233.539993
2 234.789993
3 235.220001
4 233.000000
5 236.979996
6 241.429993
7 244.729996
8 248.070007
9 251.080002
10 264.220001
11 260.309998
12 252.750000
13 259.940002
14 264.220001
15 270.470001
16 272.299988
17 276.290009
18 289.970001
19 292.350006
20 290.200012
21 290.190002
22 292.910004
23 292.559998
24 286.660004
25 277.570007
53 230.500000
Name: Open, dtype: float64
I got you. We wrap the groupby in df.loc, then select the indexes for the max values of Open in each group. Finally just take the value_counts of the Weekdays.
df.loc[df.groupby(["Week"]).Open.idxmax()].Weekdays.value_counts()

How to do mean centering on a dataframe in pandas?

I want to use two transformation techniques on a data frame, mean centering and standardization. How can I perform the mean centering method on my dataframe?
I have performed standardization using StandardScaler() from sklearn.preprocessing.
from sklearn.preprocessing import StandardScaler()
standard.iloc[:,1:-1] = StandardScaler().fit_transform(standard.iloc[:,1:-1])
I am expecting a transformed data frame which is mean-centered
dataxx = {'Name':['Tom', 'gik','Tom','Tom','Terry','Jerry','Abel','Dula','Abel'],
'Age':[20, 21, 19, 18,88,89,95,96,97],'gg':[1, 1,1, 30, 30,30,40,40,40]}
dfxx = pd.DataFrame(dataxx)
dfxx["meancentered"] = dfxx.Age - dfxx.Age.mean()
Index
Name
Age
gg
meancentered
0
Tom
20
1
-40.333333
1
gik
21
1
-39.333333
2
Tom
19
1
-41.333333
3
Tom
18
30
-42.333333
4
Terry
88
30
27.666667
5
Jerry
89
30
28.666667
6
Abel
95
40
34.666667
7
Dula
96
40
35.666667
8
Abel
97
40
36.666667

How to join a series into a dataframe

So I counted the frequency of a column 'address' from the dataframe 'df_two' and saved the data as dict. used that dict to create a series 'new_series'. so now I want to join this series into a dataframe making 'df_three' so that I can do some maths with the column 'new_count' and the column 'number' from 'new_series' and 'df_two' respectively.
I have tried to use merge / concat the items of 'new_count' were changed to NaN
Image for what i got(NaN)
df_three
number address name new_Count
14 12 ab pra NaN
49 03 cd ken NaN
97 07 ef dhi NaN
91 10 fg rav NaN
Image for input
Input
new_series
new_count
12 ab 8778
03 cd 6499
07 ef 5923
10 fg 5631
df_two
number address name
14 12 ab pra
49 03 cd ken
97 07 ef dhi
91 10 fg rav
output
df_three
number address name new_Count
14 12 ab pra 8778
49 03 cd ken 6499
97 07 ef dhi 5923
91 10 fg rav 5631
It seems you forget parameter on:
df = df_two.join(new_series, on='address')
print (df)
number address name new_count
0 14 12 ab pra 8778
1 49 03 cd ken 6499
2 97 07 ef dhi 5923
3 91 10 fg rav 5631

How can I delete duplicates group 3 columns using two criteria (first two columns)?

That is my data set enter code here
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
1 2018 6 62 47 18
2 2018 6 62 47 18
3 2018 6 62 47 18
4 2018 6 62 47 18
In last three columns there is already the sum for the year and week. I need to get rid of duplicates so that the table contains unique values (for the example above):
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
4 2018 6 62 47 18
I tried to group data but it somehow works wrong and does what I need but just for one column.
df.groupby(['Year created', 'Week created']).size()
And output:
Year created Week created
2017 48 2
49 25
50 54
51 36
52 1
2018 1 17
2 50
3 37
But it is just one column and I don't know which one because even if I separate the data on three parts and do the same procedure for each part I get the same result (as above) for all.
I believe need drop_duplicates:
df = df.drop_duplicates(['Year created', 'Week created'])
print (df)
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
1 2018 6 62 47 18
df2 = df.drop_duplicates(['Year created', 'Week created', 'SUM_New', 'SUM_Closed'])
print(df2)
hope this helps.

Resources