How to sum certain columns ending a certain word of a dataframe in python pandas? - python-3.x

I am trying to get 'summing' of columns ending 'Load' and 'Gen' to two new columns.
My dataframe is:
Date A_Gen A_Load B_Gen B_Load
1-1-2010 30 20 40 30
1-2-2010 45 25 35 25
The result wanted is:
Date A_Gen A_Load B_Gen B_Load S_Gen S_Load
1-1-2010 30 20 40 30 70 50
1-2-2010 45 25 35 25 80 50

Try using filter(like='..') to get the relevant columns, sum along axis=1, and return your 2 new columns:
df['S_Gen'] , df['B_Load'] = df.filter(like='Load').sum(1) , df.filter(like='Gen').sum(1)
Output:
df
Out[146]:
Date A_Gen A_Load B_Gen B_Load S_Gen
0 2010-01-01 30 20 40 70 50
1 2010-02-01 45 25 35 80 50

Related

Compare hourly data using python pandas

Currently I am having two columns in dataframe one is timestamp and another is temperature which is received every 5 minutes. So data looks like:
timestamp temp
2021-03-21 00:02:17 35
2021-03-21 00:07:17 32
2021-03-21 00:12:17 33
2021-03-21 00:17:17 34
...
2021-03-21 00:57:19 33
2021-03-21 01:02:19 30
2021-03-21 01:07:19 31
...
Now if I want to compare each and every data on hourly basis how can I go ahead, I have tried df.resample() method but it just gives one result after every hour.
The result which I am expecting is like:
data at 00:02:17 - 35 and 01:02:19 - 30, So ans will be 35 -30 = 5
For second one 01:07:19 - 32 and 00:07:17 - 31, So ans will be 32 - 31 = 1
How can I do it dynamically such that it compares hourly data difference
Any help would be great.
Thanks a lot.
Use:
result_df = df.assign(minute_diff=df.sort_values('timestamp', ascending=False)
.groupby(pd.to_datetime(df['timestamp']).dt.minute)['temp']
.diff())
print(result_df)
timestamp temp minute_diff
0 2021-03-21 00:02:17 35 5.0
1 2021-03-21 00:07:17 32 1.0
2 2021-03-21 00:12:17 33 NaN
3 2021-03-21 00:17:17 34 NaN
4 2021-03-21 00:57:19 33 NaN
5 2021-03-21 01:02:19 30 NaN
6 2021-03-21 01:07:19 31 NaN

Most frequently occurring numbers across multiple columns using pandas

I have a data frame with numbers in multiple columns listed by date, what I'm trying to do is find out the most frequently occurring numbers across the whole data set, also grouped by date.
import pandas as pd
import glob
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
print(lotimport()['ozlotto'])
------------- Output ---------------------
1 2 3 4 5 6 7 8 9
Date
2020-07-07 4 5 7 9 12 13 32 19 35
2020-06-30 1 17 26 28 38 39 44 14 41
2020-06-23 1 3 9 13 17 20 41 28 45
2020-06-16 1 2 13 21 22 27 38 24 33
2020-06-09 8 11 26 27 31 38 39 3 36
... .. .. .. .. .. .. .. .. ..
2005-11-15 7 10 13 17 30 32 41 20 14
2005-11-08 12 18 22 28 33 43 45 23 13
2005-11-01 1 3 11 17 24 34 43 39 4
2005-10-25 7 16 23 29 36 39 42 19 43
2005-10-18 5 9 12 30 33 39 45 7 19
The output I am aiming for is
Number frequency
45 201
32 195
24 187
14 160
48 154
--------------- Updated with append experiment -----------
I tried using append to create a single series from the dataframe, which worked for individual lines of code but got a really odd result when I ran it inside a for loop.
temp = lotimport()['ozlotto']['1']
print(temp)
temp = temp.append(lotimport()['ozlotto']['2'], ignore_index=True, verify_integrity=True)
print(temp)
temp = temp.append(lotimport()['ozlotto']['3'], ignore_index=True, verify_integrity=True)
print(temp)
lotcomb = pd.DataFrame()
for i in (lotimport()['ozlotto'].columns.tolist()):
print(f"{i} - {type(i)}")
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
print(lotcomb)
This solution might be the one you are looking for.
freqvalues = np.unique(df.to_numpy(), return_counts=True)
df2 = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
df2.index.name = "Numbers"
df2
Output:
Frequency
Numbers
1 6
2 5
3 5
5 8
6 4
7 7
8 2
9 7
10 3
11 4
12 2
13 8
14 1
15 4
16 4
17 6
18 4
19 5
20 9
21 3
22 4
23 2
24 4
25 5
26 4
27 6
28 1
29 6
30 3
31 3
... ...
70 6
71 6
72 5
73 5
74 2
75 8
76 5
77 3
78 3
79 2
80 3
81 4
82 6
83 9
84 5
85 4
86 1
87 3
88 4
89 3
90 4
91 4
92 3
93 5
94 1
95 4
96 6
97 6
98 1
99 6
97 rows × 1 columns
df.max(axis=0)
for columns
df.max(axis=1)
for index
Ok so the final answer I came up with was a mix of a few things including some of the great input from people in this thread. Essentially I do the following:
Pull in the CSV file and clean up the dates and the column names, then convert it to a pandas dataframe.
Then create a new pandas series and append each column to it ignoring dates to prevent conflicts.
Once I have the series, I use Vioxini's suggestion to use numpy to get counts of unique values and then turn the values into the index, after that sort the column by count in descending order and return the top 10 values.
Below is the resulting code, I hope it helps someone else.
import pandas as pd
import glob
import numpy as np
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
lotcomb = pd.Series([],dtype=object)
for i in (lotimport()['ozlotto'].columns.tolist()):
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
freqvalues = np.unique(lotcomb.to_numpy(), return_counts=True)
lotop = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
lotop.index.name = "Numbers"
lotop.sort_values(by=['Frequency'],ascending=False).head(10)

Creating URNs based on a row ID

I have a pandas dataset that has rows with the same Site ID. I want to create a new ID for each row. Currently I have a df like this:
SiteID SomeData1 SomeData2
100001 20 30
100001 20 30
100002 30 40
I am looking to achieve the below output
Output:
SiteID SomeData1 SomeData2 Site_ID2
100001 20 30 1000011
100001 20 30 1000012
100002 30 40 1000021
What would be the best way to achieve this?
Add helper Series by GroupBy.cumcount converted to strings to column SiteID :
s = df.groupby(['SomeData1','SomeData2']).cumcount().add(1)
df['Site_ID2'] = df['SiteID'].astype(str).add(s.astype(str))
print (df)
SiteID SomeData1 SomeData2 Site_ID2
0 100001 20 30 1000011
1 100001 20 30 1000012
2 100002 30 40 1000021

row substraction in lambda pandas dataframe

I have a dataframe with multiple columns. One of the column is the cumulative revenue column. If the year is not ended then the revenue will be constant for the rest of the period because the coming daily revenue is 0.
The dataframe looks like this
Now I want to create a new column where the row is substracted by the last row and if the result is 0 then print 0 for that row in the new column. If not zero then use the row value. The new dataframe should look like this:
My idea was to do this with the apply lambda method. So this is the thinking:
{df['2017new'] = df['2017'].apply(lambda x: 0 if row - lastrow == 0 else x)}
But i do not know how to write the row - lastrow part of the code. How to do this? Thanks in advance!
By using np.where
df2['New']=np.where(df2['2017'].diff().eq(0),0,df2['2017'])
df2
Out[190]:
2016 2017 New
0 10 21 21
1 15 34 34
2 70 40 40
3 90 53 53
4 93 53 0
5 99 53 0
We can shift the data and fill the values based on condition using np.where i.e
df['new'] = np.where(df['2017']-df['2017'].shift(1)==0,0,df['2017'])
or with df.where i.e
df['new'] = df['2017'].where(df['2017']-df['2017'].shift(1)!=0,0)
2016 2017 new
0 10 21 21
1 15 34 34
2 70 40 40
3 90 53 53
4 93 53 0
5 99 53 0

Average formula using number of blank rows above

I'm working on spreadsheet with logged flows that are not at uniform periods.
Looking for formula for Col G that will average values in Col A for logged values for previous 10 minutes.
Here's the spreadsheet data:
Flow Time min sec sec 10_min Average
187.29 06:10:09 10 9 609
202.90 06:11:21 11 21 681
280.94 06:12:37 12 37 757
218.51 06:13:43 13 43 823
187.29 06:15:13 15 13 913
124.86 06:16:26 16 26 986
109.25 06:18:52 18 52 1132
109.25 06:20:00 20 0 1200 1 177.54
202.90 06:22:30 22 30 1350
265.33 06:23:36 23 36 1416
280.94 06:24:42 24 42 1482
249.73 06:25:58 25 58 1558
218.51 06:27:39 27 39 1659
421.41 06:28:47 28 47 1727
421.41 06:30:00 30 0 1800 1 294.32
Use an AVERAGEIFS and construct the criteria with the TEXT function while modifying one criteria by ten minutes.
=AVERAGEIFS(A:A,B:B, TEXT(B9-TIME(0, 10, 0), "\>0.0###############"),B:B, TEXT(B9, "\<\=0.0###############"))
Note that times can also be resolved as decimal numbers which I have used here. My second average came up slightly different from yours. You may wish to change the \>\= to \> .

Resources