Compare hourly data using python pandas - python-3.x

Currently I am having two columns in dataframe one is timestamp and another is temperature which is received every 5 minutes. So data looks like:
timestamp temp
2021-03-21 00:02:17 35
2021-03-21 00:07:17 32
2021-03-21 00:12:17 33
2021-03-21 00:17:17 34
...
2021-03-21 00:57:19 33
2021-03-21 01:02:19 30
2021-03-21 01:07:19 31
...
Now if I want to compare each and every data on hourly basis how can I go ahead, I have tried df.resample() method but it just gives one result after every hour.
The result which I am expecting is like:
data at 00:02:17 - 35 and 01:02:19 - 30, So ans will be 35 -30 = 5
For second one 01:07:19 - 32 and 00:07:17 - 31, So ans will be 32 - 31 = 1
How can I do it dynamically such that it compares hourly data difference
Any help would be great.
Thanks a lot.

Use:
result_df = df.assign(minute_diff=df.sort_values('timestamp', ascending=False)
.groupby(pd.to_datetime(df['timestamp']).dt.minute)['temp']
.diff())
print(result_df)
timestamp temp minute_diff
0 2021-03-21 00:02:17 35 5.0
1 2021-03-21 00:07:17 32 1.0
2 2021-03-21 00:12:17 33 NaN
3 2021-03-21 00:17:17 34 NaN
4 2021-03-21 00:57:19 33 NaN
5 2021-03-21 01:02:19 30 NaN
6 2021-03-21 01:07:19 31 NaN

Related

Append columns to DataFrame form another DataFrame

everyone!
Can you pleas help me with the bellow!
I have the first df_1:
key
end
1
10
1
20
2
30
2
40
And the second df_2:
key
time
1
13
1
25
2
35
2
45
I need add columns from df_1 to df_2 with the condition:
df_1['key'] == df_2['key'] and df_2['time'] > df_1['end']
The final solution should look like:
key
time
end_1
end_2
1
13
10
1
25
10
20
2
35
30
2
45
30
40
I was thinking to solve it like on the example bellow:
for index_1, row_1 in df_2.iterrows():
for index_2, row_2 in df_1.iterrows():
if row_1[0] == row_2[0] and row_1[1] > row_2[2]:
row_1.append(row_2)
But it doesn't work
I would appreciate if someone could help me.

How to sum certain columns ending a certain word of a dataframe in python pandas?

I am trying to get 'summing' of columns ending 'Load' and 'Gen' to two new columns.
My dataframe is:
Date A_Gen A_Load B_Gen B_Load
1-1-2010 30 20 40 30
1-2-2010 45 25 35 25
The result wanted is:
Date A_Gen A_Load B_Gen B_Load S_Gen S_Load
1-1-2010 30 20 40 30 70 50
1-2-2010 45 25 35 25 80 50
Try using filter(like='..') to get the relevant columns, sum along axis=1, and return your 2 new columns:
df['S_Gen'] , df['B_Load'] = df.filter(like='Load').sum(1) , df.filter(like='Gen').sum(1)
Output:
df
Out[146]:
Date A_Gen A_Load B_Gen B_Load S_Gen
0 2010-01-01 30 20 40 70 50
1 2010-02-01 45 25 35 80 50

Most frequently occurring numbers across multiple columns using pandas

I have a data frame with numbers in multiple columns listed by date, what I'm trying to do is find out the most frequently occurring numbers across the whole data set, also grouped by date.
import pandas as pd
import glob
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
print(lotimport()['ozlotto'])
------------- Output ---------------------
1 2 3 4 5 6 7 8 9
Date
2020-07-07 4 5 7 9 12 13 32 19 35
2020-06-30 1 17 26 28 38 39 44 14 41
2020-06-23 1 3 9 13 17 20 41 28 45
2020-06-16 1 2 13 21 22 27 38 24 33
2020-06-09 8 11 26 27 31 38 39 3 36
... .. .. .. .. .. .. .. .. ..
2005-11-15 7 10 13 17 30 32 41 20 14
2005-11-08 12 18 22 28 33 43 45 23 13
2005-11-01 1 3 11 17 24 34 43 39 4
2005-10-25 7 16 23 29 36 39 42 19 43
2005-10-18 5 9 12 30 33 39 45 7 19
The output I am aiming for is
Number frequency
45 201
32 195
24 187
14 160
48 154
--------------- Updated with append experiment -----------
I tried using append to create a single series from the dataframe, which worked for individual lines of code but got a really odd result when I ran it inside a for loop.
temp = lotimport()['ozlotto']['1']
print(temp)
temp = temp.append(lotimport()['ozlotto']['2'], ignore_index=True, verify_integrity=True)
print(temp)
temp = temp.append(lotimport()['ozlotto']['3'], ignore_index=True, verify_integrity=True)
print(temp)
lotcomb = pd.DataFrame()
for i in (lotimport()['ozlotto'].columns.tolist()):
print(f"{i} - {type(i)}")
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
print(lotcomb)
This solution might be the one you are looking for.
freqvalues = np.unique(df.to_numpy(), return_counts=True)
df2 = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
df2.index.name = "Numbers"
df2
Output:
Frequency
Numbers
1 6
2 5
3 5
5 8
6 4
7 7
8 2
9 7
10 3
11 4
12 2
13 8
14 1
15 4
16 4
17 6
18 4
19 5
20 9
21 3
22 4
23 2
24 4
25 5
26 4
27 6
28 1
29 6
30 3
31 3
... ...
70 6
71 6
72 5
73 5
74 2
75 8
76 5
77 3
78 3
79 2
80 3
81 4
82 6
83 9
84 5
85 4
86 1
87 3
88 4
89 3
90 4
91 4
92 3
93 5
94 1
95 4
96 6
97 6
98 1
99 6
97 rows × 1 columns
df.max(axis=0)
for columns
df.max(axis=1)
for index
Ok so the final answer I came up with was a mix of a few things including some of the great input from people in this thread. Essentially I do the following:
Pull in the CSV file and clean up the dates and the column names, then convert it to a pandas dataframe.
Then create a new pandas series and append each column to it ignoring dates to prevent conflicts.
Once I have the series, I use Vioxini's suggestion to use numpy to get counts of unique values and then turn the values into the index, after that sort the column by count in descending order and return the top 10 values.
Below is the resulting code, I hope it helps someone else.
import pandas as pd
import glob
import numpy as np
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
lotcomb = pd.Series([],dtype=object)
for i in (lotimport()['ozlotto'].columns.tolist()):
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
freqvalues = np.unique(lotcomb.to_numpy(), return_counts=True)
lotop = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
lotop.index.name = "Numbers"
lotop.sort_values(by=['Frequency'],ascending=False).head(10)

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

Excel: Average values where column values match

What I am attempting to accomplish is this - where Report ID matches, i need to calculate the average of Value, and then fill the rows with matching Report ID's with the average for that array of Value.
The data essentially looks like this:
Report ID | Report Instance | Value
11111 1 20
11112 1 50
11113 1 40
11113 2 30
11113 3 20
11114 1 40
11115 1 20
11116 1 30
11116 2 40
11117 1 20
The end goal should look like this:
Report ID | Report Instance | Value | Average
11111 1 20 20
11112 1 50 50
11113 1 40 30
11113 2 20 30
11113 3 30 30
11114 1 40 40
11115 1 20 20
11116 1 30 35
11116 2 40 35
11117 1 20 20
I have tried using average(if()), index(match()), vlookup(match()) and similar combinations of functions, but I haven't had much luck in getting my final output. I'm relatively new to using arrays in excel, and I dont have a strong grasp on how they're evaluated just yet, so any help is much appreciated.
Keep it simple :-)
Why not using =sumif(...)/countif(...) ?

Resources