how to multiply values with group of data from pandas series without loop iteration - python-3.x

I have two pandas time series with different length and index, and a Boolean series. Series_1 is from the last data of each month with index last day of the month, series_2 is daily data with index daily, the Boolean series is True on the last day of each month, else as false.
I want to get data from series_1 (s1[0]) times data from series_2 (s2[1:n]) which is the daily data from one month, is there a way to do it without loop?
series_1 = 2010-06-30 1
2010-07-30 2
2010-08-31 5
2010-09-30 7
series_2 = 2010-07-01 2
2010-07-02 3
2010-07-03 5
2010-07-04 6
.....
2010-07-30 7
2010-08-01 6
2010-08-02 7
2010-08-03 5
.....
2010-08-31 6
Boolean = False
false
....
True
False
False
....
True
(with only the end of each month True)
want to get a series as a result that s = series_1[i] * series_2[j:j+n] (n data from same month)
How to make it?
Thanks in advance

Not sure if I got your question completely right but this should get you there:
series_1 = pd.Series({
'2010-07-30': 2,
'2010-08-31': 5
})
series_2 = pd.Series({
'2010-07-01': 2,
'2010-07-02': 3,
'2010-07-03': 5,
'2010-07-04': 6,
'2010-07-30': 7,
'2010-08-01': 6,
'2010-08-02': 7,
'2010-08-03': 5,
'2010-08-31': 6
})
Make the series Datetime aware and resample them to daily frequency:
series_1.index = pd.DatetimeIndex(series_1.index)
series_1 = series_1.resample('1D').asfreq()
series_2.index = pd.DatetimeIndex(series_2.index)
series_2 = series_2.resample('1D').asfreq()
Put them in a dataframe and perform basic multiplication:
df = pd.DataFrame()
df['1'] = series_1
df['2'] = series_2
df['product'] = df['1'] * df['2']
Result:
>>> df
1 2 product
2010-07-30 2.0 7.0 14.0
2010-07-31 NaN NaN NaN
2010-08-01 NaN 6.0 NaN
2010-08-02 NaN 7.0 NaN
2010-08-03 NaN 5.0 NaN
[...]
2010-08-27 NaN NaN NaN
2010-08-28 NaN NaN NaN
2010-08-29 NaN NaN NaN
2010-08-30 NaN NaN NaN
2010-08-31 5.0 6.0 30.0

Related

Concatenate 2 dataframes. I would like to combine duplicate columns

The following code can be used as an example of the problem I'm having:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df3=pd.concat([df1,df2], axis=1)
print(df3)
The result I get from this concatenation is:
B B
1 10 NaN
2 11 NaN
3 12 NaN
4 NaN 10
5 NaN 11
6 NaN 12
I would like to have:
B
1 10
2 11
3 12
4 10
5 11
6 12
I know that I can concatenate along axis=0. Unfortunately, that only solves the problem for this little example. The actual code I'm working with is more complex. Concatenating along axis=0 causes the index to be duplicated. I don't want that either.
EDIT:
People have asked me to give a more complex example to describe why simply removing 'axis=1' doesn't work. Here is a more complex example, first with axis=1 INCLUDED:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2], axis=1)
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3], axis=1)
print(df4)
This gives me:
B B C
1 10 NaN 20
2 11 NaN 21
3 12 NaN 22
4 NaN 10 NaN
5 NaN 11 NaN
6 NaN 12 NaN
I would like to have:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Now here is an example with axis=1 REMOVED:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2])
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3])
print(df4)
This gives me:
B C
A
1 10 NaN
2 11 NaN
3 12 NaN
4 10 NaN
5 11 NaN
6 12 NaN
1 NaN 20
2 NaN 21
3 NaN 22
I would like to have:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Sorry it wasn't very clear. I hope this helps.
Here is a two step process, for the example provided after the 'EDIT' point. Start by creating the dictionaries:
import pandas as pd
dic = {'A':['1','2','3'], 'B':['10','11','12']}
dic2 = {'A':['4','5','6'], 'B':['10','11','12']}
dic3 = {'A':['1','2','3'], 'C':['20','21','22']}
Step 1: convert each dictionary to a data frame, with index 'A', and concatenate (along axis=0):
t = pd.concat([pd.DataFrame(dic).set_index('A'),
pd.DataFrame(dic2).set_index('A'),
pd.DataFrame(dic3).set_index('A')])
Step 2: concatenate non-null elements of col 'B' with non-null elements of col 'C' (you could put this in a list comprehension if there are more than two columns). Now we concatenate along axis=1:
result = pd.concat([
t.loc[ t['B'].notna(), 'B' ],
t.loc[ t['C'].notna(), 'C' ],
], axis=1)
print(result)
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Edited:
If two objects need to be added along axis=1, then the new columns will be appended.And with axis=0 or default same column will be appended with new values.
Refer Below Solution:
import pandas as pd
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2])
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3],axis=1) #As here C is new new column so need to use axis=1
print(df4)
Output:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN

Slicing xarray dataset with coordinate dependent variable

I built an xarray dataset in python3 with coordinates (time, levels) to identify all cloud bases and cloud tops during one day of observations. The variable levels is the dimension for the cloud base/tops that can be identified at a given time. It stores cloud base/top heights values for each time.
Now I want to select all the cloud bases and tops that are located within a given range of heights that change in time. The height range is identified by the arrays bottom_mod and top_mod. These arrays have a time dimension and contain the edges of the range of heights to be selected.
The xarray dataset is cloudStandard_mod_reshaped:
Dimensions: (levels: 8, time: 9600)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) datetime64[ns] 2013-04-14 ... 2013-04-14T23:59:51
Data variables:
cloudTop (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
cloudThick (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
cloudBase (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
I tried to select the heights in the range identified by top and bottom array as follows:
PBLclouds = cloudStandard_mod_reshaped.sel(levels=slice(bottom_mod[:], top_mod[:]))
but this instruction does accept only scalar values for the slice command.
Do you know how to slice with values that are coordinate-dependent?
You can use the .where() method.
The line providing the solution is under 2.
1. First, create some data like yours:
The dataset:
nlevels, ntime = 8, 50
ds = xr.Dataset(
coords=dict(levels=np.arange(nlevels), time=np.arange(ntime),),
data_vars=dict(
cloudTop=(("levels", "time"), np.random.randn(nlevels, ntime)),
cloudThick=(("levels", "time"), np.random.randn(nlevels, ntime)),
cloudBase=(("levels", "time"), np.random.randn(nlevels, ntime)),
),
)
output of print(ds):
<xarray.Dataset>
Dimensions: (levels: 8, time: 50)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) int64 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49
Data variables:
cloudTop (levels, time) float64 0.08375 0.04721 0.9379 ... 0.04877 2.339
cloudThick (levels, time) float64 -0.6441 -0.8338 -1.586 ... -1.026 -0.5652
cloudBase (levels, time) float64 -0.05004 -0.1729 0.7154 ... 0.06507 1.601
For the top and bottom levels, I'll make the bottom level random and just add an offset to construct the top level.
offset = 3
bot_mod = xr.DataArray(
dims=("time"),
coords=dict(time=np.arange(ntime)),
data=np.random.randint(0, nlevels - offset, ntime),
name="bot_mod",
)
top_mod = (bot_mod + offset).rename("top_mod")
output of print(bot_mod):
<xarray.DataArray 'bot_mod' (time: 50)>
array([0, 1, 2, 2, 3, 1, 2, 1, 0, 2, 1, 3, 2, 0, 2, 4, 3, 3, 2, 1, 2, 0,
2, 2, 0, 1, 1, 4, 1, 3, 0, 4, 0, 4, 4, 0, 4, 4, 1, 0, 3, 4, 4, 3,
3, 0, 1, 2, 4, 0])
2. Then, select the range of levels where clouds are:
use .where() method to select the dataset variables that are between the bottom level and the top level:
ds_clouds = ds.where((ds.levels > bot_mod) & (ds.levels < top_mod))
output of print(ds_clouds):
<xarray.Dataset>
Dimensions: (levels: 8, time: 50)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) int64 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49
Data variables:
cloudTop (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
cloudThick (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
cloudBase (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
It puts nan where the condition is not satisfied, you can use the .dropna() method to get rid of those.
3. Check for success:
Plot cloudBase variable of the dataset before and after processing:
fig, axes = plt.subplots(ncols=2)
ds.cloudBase.plot.imshow(ax=axes[0])
ds_clouds.cloudBase.plot.imshow(ax=axes[1])
plt.show()
I'm not yet allowed to embed images, so that's a link:
Original data vs. selected data

Replace value with nan based on required quarters

As a part of model requirement, I am stuck on weird spot where I need to replace actual value with Nan for extra quarters.
In the below example,
Id 1 should have nan in column Q4, 2 should have no nan and 3 should have Q3 and Q4 both as nan.
d = {'ID': [1, 2,3], 'QTR_req': [3,4,2],'Q1':[1,1,1],'Q2':[2,2,2],'Q3':[3,3,3],'Q4':[4,4,4]}
df2 = pd.DataFrame(data=d)
I have reached till the part of accessing QTR_req using df.loc but then stuck on how to make specific quarter nan. Could you suggest what am I looking for here?
May be this:
df2[cols_needed] = (df2[ cols_needed ]
.where(df2['QTR_req'].values[:,None] >np.arange(len(cols_needed )) )
)
Output:
ID QTR_req Q1 Q2 Q3 Q4
0 1 3 1 2 3.0 NaN
1 2 4 1 2 3.0 4.0
2 3 2 1 2 NaN NaN

Pandas append returns DF with NaN values

I'm appending data from a list to pandas df. I keep getting NaN in my entries.
Based on what I've read I think I might have to mention the data type for each column in my code.
dumps = [];features_df = pd.DataFrame()
for i in range (int(len(ids)/50)):
dumps = sp.audio_features(ids[i*50:50*(i+1)])
for i in range (len(dumps)):
print(list(dumps[0].values()))
features_df = features_df.append(list(dumps[0].values()), ignore_index = True)
Expected results, something like-
[0.833, 0.539, 11, -7.399, 0, 0.178, 0.163, 2.1e-06, 0.101, 0.385, 99.947, 'audio_features', '6MWtB6iiXyIwun0YzU6DFP', 'spotify:track:6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/tracks/6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/audio-analysis/6MWtB6iiXyIwun0YzU6DFP', 149520, 4]
for one row.
Actual-
danceability energy ... duration_ms time_signature
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
For all rows
append() strategy in a tight loop isn't a great way to do this. Rather, you can construct an empty DataFrame and then use loc to specify an insertion point. The DataFrame index should be used.
For example:
import pandas as pd
df = pd.DataFrame(data=[], columns=['n'])
for i in range(100):
df.loc[i] = i
print(df)
time python3 append_df.py
n
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
real 0m13.178s
user 0m12.287s
sys 0m0.617s
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

DataFrame difference between rows based on multiple columns

I am trying to calculate the difference between rows based on multiple columns. The data set is very large and I am pasting dummy data below that describes the problem:
if I want to calculate the daily difference in weight at a pet+name level. So far I have only come up with the solution of concatenating these columns and creating multiindex based on the new column and the date column. But I think there should be a better way. In the real dataset I have more than 3 columns I am using calculate row difference.
df['pet_name']=df.pet + df.name
df.set_index(['pet_name','date'],inplace = True)
df.sort_index(inplace=True)
df['diffs']=np.nan
for idx in t.index.levels[0]:
df.diffs[idx] = df.weight[idx].diff()
Base on your description , you can try groupby
df['pet_name']=df.pet + df.name
df.groupby('pet_name')['weight'].diff()
Use groupby by 2 columns:
df.groupby(['pet', 'name'])['weight'].diff()
All together:
#convert dates to datetimes
df['date'] = pd.to_datetime(df['date'])
#sorting
df = df.sort_values(['pet', 'name','date'])
#get differences per groups
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
Sample:
np.random.seed(123)
N = 100
L = list('abc')
df = pd.DataFrame({'pet': np.random.choice(L, N),
'name': np.random.choice(L, N),
'date': pd.Series(pd.date_range('2015-01-01', periods=int(N/10)))
.sample(N, replace=True),
'weight':np.random.rand(N)})
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['pet', 'name','date'])
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
df['pet_name'] = df.pet + df.name
df = df.sort_values(['pet_name','date'])
df['diffs1'] = df.groupby(['pet_name', 'date'])['weight'].diff()
print (df.head(20))
date name pet weight diffs pet_name diffs1
1 2015-01-02 a a 0.105446 NaN aa NaN
2 2015-01-03 a a 0.845533 NaN aa NaN
2 2015-01-03 a a 0.980582 0.135049 aa 0.135049
2 2015-01-03 a a 0.443368 -0.537214 aa -0.537214
3 2015-01-04 a a 0.375186 NaN aa NaN
6 2015-01-07 a a 0.715601 NaN aa NaN
7 2015-01-08 a a 0.047340 NaN aa NaN
9 2015-01-10 a a 0.236600 NaN aa NaN
0 2015-01-01 b a 0.777162 NaN ab NaN
2 2015-01-03 b a 0.871683 NaN ab NaN
3 2015-01-04 b a 0.988329 NaN ab NaN
4 2015-01-05 b a 0.918397 NaN ab NaN
4 2015-01-05 b a 0.016119 -0.902279 ab -0.902279
5 2015-01-06 b a 0.095530 NaN ab NaN
5 2015-01-06 b a 0.894978 0.799449 ab 0.799449
5 2015-01-06 b a 0.365719 -0.529259 ab -0.529259
5 2015-01-06 b a 0.887593 0.521874 ab 0.521874
7 2015-01-08 b a 0.792299 NaN ab NaN
7 2015-01-08 b a 0.313669 -0.478630 ab -0.478630
7 2015-01-08 b a 0.281235 -0.032434 ab -0.032434

Resources