Here is my raw data
Raw Data
Here is the data (including types) after I add on the column 'Date_2wks_Ago' within Pandas
enter image description here
I would like to add on a new column 'Rainfall_Last7Days' that calculates, for each day, the total amount of rainfall for the last week.
So (ignoring the other columns that aren't relevant) it would look a little like this...
Ideal Dataset
Anyone know how to do this in Pandas?
My data is about 1000 observations long, so not huge.
I think what you are looking for is the rolling() function.
This section recreates a simplified version of table
import pandas as pd
import numpy as np
# Create df
rainfall_from_9am=[4.6
,0.4
,3.6
,3.5
,3.2
,5.5
,2.2
,1.3
,0
,0
,0.04
,0
,0
,0
,0.04
,0.4]
date=['2019-02-03'
,'2019-02-04'
,'2019-02-05'
,'2019-02-06'
,'2019-02-07'
,'2019-02-08'
,'2019-02-09'
,'2019-02-10'
,'2019-02-11'
,'2019-02-12'
,'2019-02-13'
,'2019-02-14'
,'2019-02-15'
,'2019-02-16'
,'2019-02-17'
,'2019-02-18'
]
# Create df from list
df=pd.DataFrame({'rainfall_from_9am':rainfall_from_9am
,'date':date
})
This part calculates the rolling sum of rainfall for the current and previous 6 records.
df['rain_last7days']=df['rainfall_from_9am'].rolling(7).sum()
print(df)
Output:
date rainfall_from_9am rain_last7days
0 2019-02-03 4.60 NaN
1 2019-02-04 0.40 NaN
2 2019-02-05 3.60 NaN
3 2019-02-06 3.50 NaN
4 2019-02-07 3.20 NaN
5 2019-02-08 5.50 NaN
6 2019-02-09 2.20 23.00
7 2019-02-10 1.30 19.70
8 2019-02-11 0.00 19.30
9 2019-02-12 0.00 15.70
10 2019-02-13 0.04 12.24
11 2019-02-14 0.00 9.04
12 2019-02-15 0.00 3.54
13 2019-02-16 0.00 1.34
14 2019-02-17 0.04 0.08
15 2019-02-18 0.40 0.48
Conscious that this output does not match exactly with the example in your original question. Can you please help verify the correct logic you are after?
Related
I have the following dataframe
recycling 1 metric tonne (1000 kilogram) per waste type Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 1 barrel oil is approximately 159 litres of oil NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 material Plastic Glass Ferrous Metal Non-Ferrous Metal Paper
3 energy_saved 5774 Kwh 42 Kwh 642 Kwh 14000 Kwh 4000 kWh
4 crude_oil saved 16 barrels NaN 1.8 barrels 40 barrels 1.7 barrels
For reference look at the image:
What I want to do is to get the rows 2, 3, 4 into cols in a new dataframe. It should be looking some like this..
material energy_saved crude_oil saved
plastic 5774Kwh 16 barrels
Glass 42 Kwh NaN
... ... ...
I tried using .melt but it was not working.
If you notice, the col name and its values are in a single row. I just want them to be in a new data frame as col and value.
IIUC, is it just:
out = df.loc[[2,3,4],:].T.reset_index(drop=True)
Hi the following is my DataSet
MD Incl. Azi.
0 0.00 0.00 350.00
1 161.00 0.00 350.00
2 261.00 0.00 350.00
3 361.00 0.00 350.00
4 461.00 0.00 350.00
I would like to perform a calculation, create a new column and in this column add the row value to the previous row value in the column and so on.
import pandas as pd
import math
import numpy as np
#open excel file
df = pd.read_excel
print(df)
for i in range (1, len(df)):
incl = np.deg2rad(df['Incl.'])
df['TVD_diff'] = (((df['MD'] - df['MD'].shift())/2)*(np.cos(incl).shift() + np.cos(incl)))
print(df)
MD Incl. Azi. TVD
0 0.00 0.00 350.00 NaN
1 161.00 0.00 350.00 161.000000
2 261.00 0.00 350.00 100.000000
3 361.00 0.00 350.00 100.000000
4 461.00 0.00 350.00 100.000000
I would like the TVD column to be
TVD
NaN
161
261
361
461
and so on by adding its current value to the value before
Use cumsum
df['tvd'] = df['tvd'].cumsum()
Example:
import pandas as pd
import numpy as np
from io import StringIO
txt = """ MD Incl. Azi.
0 0.00 0.00 350.00
1 161.00 0.00 350.00
2 261.00 0.00 350.00
3 361.00 0.00 350.00
4 461.00 0.00 350.00"""
df = pd.read_csv(StringIO(txt), sep='\s\s+')
for i in range (1, len(df)):
incl = np.deg2rad(df['Incl.'])
df['TVD_diff'] = (((df['MD'] - df['MD'].shift())/2)*(np.cos(incl).shift() + np.cos(incl)))
df['TVD_diff'] = df['TVD_diff'].cumsum()
print(df)
Output:
MD Incl. Azi. TVD_diff
0 0.0 0.0 350.0 NaN
1 161.0 0.0 350.0 161.0
2 261.0 0.0 350.0 261.0
3 361.0 0.0 350.0 361.0
4 461.0 0.0 350.0 461.0
I parsed a table from a website using Selenium (by xpath), then used pd.read_html on the table element, and now I'm left with what looks like a list that makes up the table. It looks like this:
[Empty DataFrame
Columns: [Symbol, Expiration, Strike, Last, Open, High, Low, Change, Volume]
Index: [], Symbol Expiration Strike Last Open High Low Change Volume
0 XPEV Dec20 12/18/2020 46.5 3.40 3.00 5.05 2.49 1.08 696.0
1 XPEV Dec20 12/18/2020 47.0 3.15 3.10 4.80 2.00 1.02 2359.0
2 XPEV Dec20 12/18/2020 47.5 2.80 2.67 4.50 1.89 0.91 2231.0
3 XPEV Dec20 12/18/2020 48.0 2.51 2.50 4.29 1.66 0.85 3887.0
4 XPEV Dec20 12/18/2020 48.5 2.22 2.34 3.80 1.51 0.72 2862.0
5 XPEV Dec20 12/18/2020 49.0 1.84 2.00 3.55 1.34 0.49 4382.0
6 XPEV Dec20 12/18/2020 50.0 1.36 1.76 3.10 1.02 0.30 14578.0
7 XPEV Dec20 12/18/2020 51.0 1.14 1.26 2.62 0.78 0.31 4429.0
8 XPEV Dec20 12/18/2020 52.0 0.85 0.95 2.20 0.62 0.19 2775.0
9 XPEV Dec20 12/18/2020 53.0 0.63 0.79 1.85 0.50 0.13 1542.0]
How do I turn this into an actual dataframe, with the "Symbol, Expiration, etc..." as the header, and the far left column as the index?
I've been trying several different things, but to no avail. Where I left off was trying:
# From reading the html of the table step
dfs = pd.read_html(table.get_attribute('outerHTML'))
dfs = pd.DataFrame(dfs)
... and when I print the new dfs, I get this:
0 Empty DataFrame
Columns: [Symbol, Expiration, ...
1 Symbol Expiration Strike Last Open ...
Per pandas.read_html docs,
This function will always return a list of DataFrame or it will fail, e.g., it will not return an empty list.
According to your list output the non-empty dataframe is the second element in that list. So retrieve it by indexing (remember Python uses zero as first index of iterables). Do note you can use data frames stored in lists or dicts.
dfs[1].head()
dfs[1].tail()
dfs[1].describe()
...
single_df = dfs[1].copy()
del dfs
Or index on same call
single_df = pd.read_html(...)[1]
I have two pandas data-frames, which I wanted to merge. The data-frames have different columns and overlapping indices. I want to merge them, keeping the order of indices intact.
Dataframe (d1)
Dec 16 Dec 15
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN <-- 'all Nan' row
Other Long Term Liabilities 8.37 9.30
Long Term Provisions 13.53 12.74 <-- Not present in d2
Total Non-Current Liabilities 21.90 22.04
CURRENT LIABILITIES NaN NaN <-- 'all Nan' row
Trade Payables 32.49 24.26
Dataframe (d2)
Dec 11 Dec 10
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN
Deferred Tax Liabilities [Net] 0.00 7.40 <-- Not present in d1
Other Long Term Liabilities 14.13 0.00
Total Non-Current Liabilities 14.13 7.40
CURRENT LIABILITIES NaN NaN
Trade Payables 77.35 60.40
I tried the following ways to merge these data-frames, but none of them worked.
d1.merge(d2, how='left', left_index=True,right_index=True)
d1.merge(d2, how='outer', left_index=True,right_index=True)
pd.merge_ordered(d1,d2,left_on=['Dec 16'],right_on=['Dec 11'])
pd.concat([d1.merge(d2, how='left', left_index=True,right_index=True),d1.merge(d2, how='right', left_index=True,right_index=True)]).drop_duplicates(subset='Dec 16',keep='last')
I am expecting the resulting dataframe to look like this
Dec 16 Dec 15 Dec 11 Dec 10
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN NaN NaN
Deferred Tax Liabilities [Net] NaN NaN 0.00 7.40 <-- from d2
Other Long Term Liabilities 8.37 9.30 14.13 0.00 <-- d1+d2 merged
Long Term Provisions 13.53 12.74 NaN NaN <-- from d1
Total Non-Current Liabilities 21.90 22.04 14.13 7.40 <-- d1+d2 merged
CURRENT LIABILITIES NaN NaN NaN NaN
Trade Payables 32.49 24.26 77.35 60.40
Note that the overall order matters (e.g all NaN rows need to be in same order), but not the order of merged indices between the 'all NaN' rows. Also the columns of d1 should come prior to d2 columns.
Use how=outer with merge and reindex with custom order
In [1424]: order_index = ['NON-CURRENT LIABILITIES', 'Deferred Tax Liabilities [Net]',
'Other Long Term Liabilities', 'Long Term Provisions',
'Total Non-Current Liabilities', 'CURRENT LIABILITIES',
'Trade Payables']
In [1425]: df1.merge(df2,how='outer',left_index=True,right_index=True).reindex(order_index)
Out[1425]:
Dec 16 Dec 15 Dec 11 Dec 10
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN NaN NaN
Deferred Tax Liabilities [Net] NaN NaN 0.00 7.4
Other Long Term Liabilities 8.37 9.30 14.13 0.0
Long Term Provisions 13.53 12.74 NaN NaN
Total Non-Current Liabilities 21.90 22.04 14.13 7.4
CURRENT LIABILITIES NaN NaN NaN NaN
Trade Payables 32.49 24.26 77.35 60.4
Also, join works
In [1426]: df1.join(df2, how='outer').reindex(order_index)
Out[1426]:
Dec 16 Dec 15 Dec 11 Dec 10
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN NaN NaN
Deferred Tax Liabilities [Net] NaN NaN 0.00 7.4
Other Long Term Liabilities 8.37 9.30 14.13 0.0
Long Term Provisions 13.53 12.74 NaN NaN
Total Non-Current Liabilities 21.90 22.04 14.13 7.4
CURRENT LIABILITIES NaN NaN NaN NaN
Trade Payables 32.49 24.26 77.35 60.4
Details
In [1417]: df1
Out[1417]:
Dec 16 Dec 15
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN
Other Long Term Liabilities 8.37 9.30
Long Term Provisions 13.53 12.74
Total Non-Current Liabilities 21.90 22.04
CURRENT LIABILITIES NaN NaN
Trade Payables 32.49 24.26
In [1418]: df2
Out[1418]:
Dec 11 Dec 10
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN
Deferred Tax Liabilities [Net] 0.00 7.4
Other Long Term Liabilities 14.13 0.0
Total Non-Current Liabilities 14.13 7.4
CURRENT LIABILITIES NaN NaN
Trade Payables 77.35 60.4
So I have two dataframes, one is a single dataframe of a dictionary of dataframes stocks['OPK'], and the other is just a simple Pandas dataframe df.
Here is a slice of df, df.loc['2010-01-04':, 'Open'] that I'm interesting in comparing with the other dataframe.
Date Open
2010-01-04 1.80
2010-01-05 1.64
2010-01-06 1.90
2010-01-07 1.79
2010-01-08 1.92
2010-01-11 1.90
2010-01-12 1.89
2010-01-13 1.82
2010-01-14 1.84
2010-01-15 1.85
2010-01-19 1.77
This is the other dataframe stocks['OPK'].Open
2010-01-04 1.80
2010-01-05 1.64
2010-01-06 NaN
2010-01-07 1.79
2010-01-08 NaN
2010-01-11 1.90
2010-01-12 1.89
2010-01-13 1.82
2010-01-14 NaN
2010-01-15 1.85
2010-01-19 NaN
As you can, the second dataframe has missing values.
Since both indexes are of the datetime format, I want to be able to compare stock['OPK'].Open to df.loc['2010-01-04':, 'Open'] and fill in the missing values with the values from the the first datframe, df
I can do a boolean filter with this code, but I don't know how to proceed from there:
stocks['OPK'].Open == df.loc['2010-01-04':, 'Open']
The problem with pd.merge and its respective options is that it seems to add extra
columns. I just want to fill in the missing values (if there are any) through comparison of another dataframe which might have the missing values.
Thank you.
You can use fillna()
df2 = df2.fillna(df1)
Another and faster way is combine_first
df2 = df2.combine_first(df1)
Both will return
Date Open
0 2010-01-04 1.80
1 2010-01-05 1.64
2 2010-01-06 1.90
3 2010-01-07 1.79
4 2010-01-08 1.92
5 2010-01-11 1.90
6 2010-01-12 1.89
7 2010-01-13 1.82
8 2010-01-14 1.84
9 2010-01-15 1.85
10 2010-01-19 1.77