Comparing two similar dataframes and filling in missing values of one dataframe - python-3.x

So I have two dataframes, one is a single dataframe of a dictionary of dataframes stocks['OPK'], and the other is just a simple Pandas dataframe df.
Here is a slice of df, df.loc['2010-01-04':, 'Open'] that I'm interesting in comparing with the other dataframe.
Date Open
2010-01-04 1.80
2010-01-05 1.64
2010-01-06 1.90
2010-01-07 1.79
2010-01-08 1.92
2010-01-11 1.90
2010-01-12 1.89
2010-01-13 1.82
2010-01-14 1.84
2010-01-15 1.85
2010-01-19 1.77
This is the other dataframe stocks['OPK'].Open
2010-01-04 1.80
2010-01-05 1.64
2010-01-06 NaN
2010-01-07 1.79
2010-01-08 NaN
2010-01-11 1.90
2010-01-12 1.89
2010-01-13 1.82
2010-01-14 NaN
2010-01-15 1.85
2010-01-19 NaN
As you can, the second dataframe has missing values.
Since both indexes are of the datetime format, I want to be able to compare stock['OPK'].Open to df.loc['2010-01-04':, 'Open'] and fill in the missing values with the values from the the first datframe, df
I can do a boolean filter with this code, but I don't know how to proceed from there:
stocks['OPK'].Open == df.loc['2010-01-04':, 'Open']
The problem with pd.merge and its respective options is that it seems to add extra
columns. I just want to fill in the missing values (if there are any) through comparison of another dataframe which might have the missing values.
Thank you.

You can use fillna()
df2 = df2.fillna(df1)
Another and faster way is combine_first
df2 = df2.combine_first(df1)
Both will return
Date Open
0 2010-01-04 1.80
1 2010-01-05 1.64
2 2010-01-06 1.90
3 2010-01-07 1.79
4 2010-01-08 1.92
5 2010-01-11 1.90
6 2010-01-12 1.89
7 2010-01-13 1.82
8 2010-01-14 1.84
9 2010-01-15 1.85
10 2010-01-19 1.77

Related

how to convert dictdefault with nested multidimensional list collection to pandas DataFrame

I have two nested dictionary with multidimensional list in python. Here is the structure of the dictionary:
{183431:[Mon_yy,qtr_year,season,cal_year],197623:[Mon_yy,qtr_year,season,cal_year]}
defaultdict(list,
{'183431': [mon_yy
Oct22 1.73
Nov22 1.80
Dec22 1.83
Name: MEAN, Length: 134, dtype: float64,
qtr_year
Q4 22 1.79
Q1 23 1.81
Q2 23 1.68
Name: MEAN, dtype: float64,
season
Win22 1.80
Sum23 1.61
Win23 1.29
Name: MEAN, dtype: float64,
cal_year
Cal22 1.79
Cal23 1.60
Cal24 1.03
Name: MEAN, dtype: float64],
'197623': [mon_yy
Oct22 1.47
Nov22 1.65
Dec22 1.70
Name: MEAN, Length: 130, dtype: float64,
qtr_year
Q4 22 1.61
Q1 23 1.74
Q2 23 1.70
Name: MEAN, dtype: float64,
season
Win22 1.68
Sum23 1.63
Win23 1.28
Name: MEAN, dtype: float64,
cal_year
Cal22 1.61
Cal23 1.59
Cal24 1.01
Name: MEAN, dtype: float64]})
And I am trying to convert this structure into pandas dataframe,
mon_yy 183431 197623
Oct-22 1.73 1.47
Nov-22 1.8 1.65
Dec-22 1.83 1.7
qtr_year
Q4 22 1.79 1.61
Q1 23 1.81 1.74
Q2 23 1.68 1.7
season
Win22 1.8 1.68
Sum23 1.61 1.63
Win23 1.29 1.28
cal_year
Cal22 1.79 1.61
Cal23 1.6 1.59
Cal24 1.03 1.01
I can able to convert up to 1 level but cant achieve the exact structure.
Here is the I tried so far
1st try:
s = pd.Series(price_diff_dict).explode()
dict_to_df = pd.DataFrame(s.to_list(), index=s.index, columns=['183431','197623'])
2nd try:
dd = pd.concat({k: pd.DataFrame(v).T for k, v in price_diff_dict.items()}, axis=0)
Output:
MEAN MEAN MEAN MEAN
183431 Oct22 1.73 NaN NaN NaN
Nov22 1.80 NaN NaN NaN
Dec22 1.83 NaN NaN NaN
Jan23 1.85 NaN NaN NaN
Feb23 1.83 NaN NaN NaN
... ... ... ... ...
197623 Cal29 NaN NaN NaN 0.00
Cal30 NaN NaN NaN -0.16
Cal31 NaN NaN NaN -2.06
Cal32 NaN NaN NaN -2.26
Cal33 NaN NaN NaN -2.42
CAn anyone please help me to solve this or any suggestions that would be really useful.
Thanks,
Prabha.

How to subtract X rows in a dataframe with first value from another dataframe?

I am using pandas for this work.
I have a 2 datasets. The first dataset has approximately 6 million rows and 6 columns. For example the first data set looks something like this:
Date
Time
U
V
W
T
2020-12-30
2:34
3
4
5
7
2020-12-30
2:35
2
3
6
5
2020-12-30
2:36
1
5
8
5
2020-12-30
2:37
2
3
0
8
2020-12-30
2:38
4
4
5
7
2020-12-30
2:39
5
6
5
9
this is just the raw data collected from the machine.
The second is the average values of three rows at a time from each column (U,V,W,T).
U
V
W
T
2
4
6.33
5.67
3.66
4.33
3.33
8
What I am trying to do is calculate the perturbation for each column per second.
U(perturbation)=U(raw)-U(avg)
U(raw)= dataset 1
U(avg)= dataset 2
Basically take the first three rows from the first column of the first dataset and individually subtract them by the first value in the first column of the second dataset, then take the next three values from the first column of the first data set and individually subtract them by second value in the first column of the second dataset. Do the same for all three columns.
The desired final output should be as the following:
Date
Time
U
V
W
T
2020-12-30
2:34
1
0
-1.33
1.33
2020-12-30
2:35
0
-1
-0.33
-0.67
2020-12-30
2:36
-1
1
1.67
-0.67
2020-12-30
2:37
-1.66
-1.33
-3.33
0
2020-12-30
2:38
0.34
-0.33
1.67
-1
2020-12-30
2:39
1.34
1.67
1.67
1
I am new to pandas and do not know how to approach this.
I hope it makes sense.
a = df1.assign(index = df1.index // 3).merge(df2.reset_index(), on='index')
b = a.filter(regex = '_x', axis=1) - a.filter(regex = '_y', axis = 1).to_numpy()
pd.concat([a.filter(regex='^[^_]+$', axis = 1), b], axis = 1)
Date Time index U_x V_x W_x T_x
0 2020-12-30 2:34 0 0.00 0.00 -1.33 1.33
1 2020-12-30 2:35 0 -1.00 -1.00 -0.33 -0.67
2 2020-12-30 2:36 0 -2.00 1.00 1.67 -0.67
3 2020-12-30 2:37 1 -1.66 -1.33 -3.33 0.00
4 2020-12-30 2:38 1 0.34 -0.33 1.67 -1.00
5 2020-12-30 2:39 1 1.34 1.67 1.67 1.00
You can use numpy:
import numpy as np
df1[df2.columns] -= np.repeat(df2.to_numpy(), 3, axis=0)
NB. This modifies df1 in place, if you want you can make a copy first (df_final = df1.copy()) and apply the subtraction on this copy.

How to sum the last 7 days in Pandas between two dates

Here is my raw data
Raw Data
Here is the data (including types) after I add on the column 'Date_2wks_Ago' within Pandas
enter image description here
I would like to add on a new column 'Rainfall_Last7Days' that calculates, for each day, the total amount of rainfall for the last week.
So (ignoring the other columns that aren't relevant) it would look a little like this...
Ideal Dataset
Anyone know how to do this in Pandas?
My data is about 1000 observations long, so not huge.
I think what you are looking for is the rolling() function.
This section recreates a simplified version of table
import pandas as pd
import numpy as np
# Create df
rainfall_from_9am=[4.6
,0.4
,3.6
,3.5
,3.2
,5.5
,2.2
,1.3
,0
,0
,0.04
,0
,0
,0
,0.04
,0.4]
date=['2019-02-03'
,'2019-02-04'
,'2019-02-05'
,'2019-02-06'
,'2019-02-07'
,'2019-02-08'
,'2019-02-09'
,'2019-02-10'
,'2019-02-11'
,'2019-02-12'
,'2019-02-13'
,'2019-02-14'
,'2019-02-15'
,'2019-02-16'
,'2019-02-17'
,'2019-02-18'
]
# Create df from list
df=pd.DataFrame({'rainfall_from_9am':rainfall_from_9am
,'date':date
})
This part calculates the rolling sum of rainfall for the current and previous 6 records.
df['rain_last7days']=df['rainfall_from_9am'].rolling(7).sum()
print(df)
Output:
date rainfall_from_9am rain_last7days
0 2019-02-03 4.60 NaN
1 2019-02-04 0.40 NaN
2 2019-02-05 3.60 NaN
3 2019-02-06 3.50 NaN
4 2019-02-07 3.20 NaN
5 2019-02-08 5.50 NaN
6 2019-02-09 2.20 23.00
7 2019-02-10 1.30 19.70
8 2019-02-11 0.00 19.30
9 2019-02-12 0.00 15.70
10 2019-02-13 0.04 12.24
11 2019-02-14 0.00 9.04
12 2019-02-15 0.00 3.54
13 2019-02-16 0.00 1.34
14 2019-02-17 0.04 0.08
15 2019-02-18 0.40 0.48
Conscious that this output does not match exactly with the example in your original question. Can you please help verify the correct logic you are after?

Convert "Empty Dataframe" / List Items to Dataframe?

I parsed a table from a website using Selenium (by xpath), then used pd.read_html on the table element, and now I'm left with what looks like a list that makes up the table. It looks like this:
[Empty DataFrame
Columns: [Symbol, Expiration, Strike, Last, Open, High, Low, Change, Volume]
Index: [], Symbol Expiration Strike Last Open High Low Change Volume
0 XPEV Dec20 12/18/2020 46.5 3.40 3.00 5.05 2.49 1.08 696.0
1 XPEV Dec20 12/18/2020 47.0 3.15 3.10 4.80 2.00 1.02 2359.0
2 XPEV Dec20 12/18/2020 47.5 2.80 2.67 4.50 1.89 0.91 2231.0
3 XPEV Dec20 12/18/2020 48.0 2.51 2.50 4.29 1.66 0.85 3887.0
4 XPEV Dec20 12/18/2020 48.5 2.22 2.34 3.80 1.51 0.72 2862.0
5 XPEV Dec20 12/18/2020 49.0 1.84 2.00 3.55 1.34 0.49 4382.0
6 XPEV Dec20 12/18/2020 50.0 1.36 1.76 3.10 1.02 0.30 14578.0
7 XPEV Dec20 12/18/2020 51.0 1.14 1.26 2.62 0.78 0.31 4429.0
8 XPEV Dec20 12/18/2020 52.0 0.85 0.95 2.20 0.62 0.19 2775.0
9 XPEV Dec20 12/18/2020 53.0 0.63 0.79 1.85 0.50 0.13 1542.0]
How do I turn this into an actual dataframe, with the "Symbol, Expiration, etc..." as the header, and the far left column as the index?
I've been trying several different things, but to no avail. Where I left off was trying:
# From reading the html of the table step
dfs = pd.read_html(table.get_attribute('outerHTML'))
dfs = pd.DataFrame(dfs)
... and when I print the new dfs, I get this:
0 Empty DataFrame
Columns: [Symbol, Expiration, ...
1 Symbol Expiration Strike Last Open ...
Per pandas.read_html docs,
This function will always return a list of DataFrame or it will fail, e.g., it will not return an empty list.
According to your list output the non-empty dataframe is the second element in that list. So retrieve it by indexing (remember Python uses zero as first index of iterables). Do note you can use data frames stored in lists or dicts.
dfs[1].head()
dfs[1].tail()
dfs[1].describe()
...
single_df = dfs[1].copy()
del dfs
Or index on same call
single_df = pd.read_html(...)[1]

How to add a new column into a dataframe based on rows of an other dataframe?

I have two Dataframes :
DF1(That i've just resampled):
Mi_pollution.head():
Sensor_ID Time_Instant Measurement
0 10273 2013-11-01 00:00:00 46
1 10273 2013-11-01 01:00:00 51
2 10273 2013-11-01 02:00:00 39
3 10273 2013-11-01 03:00:00 30
4 10273 2013-11-01 04:00:00 37
And I have the DF2 :
Pollutants.head():
Sensor_ID Sensor_Street_Name Sensor_Lat Sensor_Long Sensor_Type UOM Time_Instant
0 20020 Milano -via Carlo Pascal 45.478452 9.235016 Ammonia µg/m YYYY/MM/DD
1 17127 Milano - viale Marche 45.496067 9.193023 Benzene µg/m YYYY/MM/DD HH24:MI
2 17126 Milano -via Carlo Pascal 45.478452 9.235016 Benzene µg/m YYYY/MM/DD HH24:MI
3 6057 Milano - via Senato 45.470780 9.197180 Benzene µg/m YYYY/MM/DD HH24:MI
4 6062 Milano - P.zza Zavattari 45.476089 9.143509 Benzene µg/m YYYY/MM/DD HH24:MI
And What I'm trying to to do , is to create new columns based on the pollutants and add them to DF1 and assign each measurement based on the Sensor , like This:
Sensor_ID Time_Instant Ammonia Benzene Nitrogene …...
0 20020 2013-12-01 00:00:00 4.8 Nan Nan
1 20020 2013-12-01 01:00:00 5.3 Nan Nan
2 20020 2013-12-01 02:00:00 3.0 Nan Nan
.
.
56 14330 2013-11-01 00:00:00 Nan 6.3 Nan
57 14330 2013-11-01 01:00:00 Nan 5.3 Nan
.
.
Any Suggestion Would be much appreciated , Thank U all.
Assuming you're joining on Sensor_ID (there aren't any common Sensor_IDs between the two dataframes in the small example you gave), you could merge the dfs on Sensor_ID (and possibly Time_Instant?).
Then you can use pivot_table to transpose the row values (Sensor_Type) to column headings, then fill in the row values with Measurement.
For example:
df3 = df1.merge(df2, on='Sensor_ID', how='left')\
.pivot_table(index=['Sensor_ID','Sensor_Street_Name','Other columns'],
values='Measurement',
columns='Sensor_Type')\
.reset_index()

Resources