Slicing xarray dataset with coordinate dependent variable - python-3.x

I built an xarray dataset in python3 with coordinates (time, levels) to identify all cloud bases and cloud tops during one day of observations. The variable levels is the dimension for the cloud base/tops that can be identified at a given time. It stores cloud base/top heights values for each time.
Now I want to select all the cloud bases and tops that are located within a given range of heights that change in time. The height range is identified by the arrays bottom_mod and top_mod. These arrays have a time dimension and contain the edges of the range of heights to be selected.
The xarray dataset is cloudStandard_mod_reshaped:
Dimensions: (levels: 8, time: 9600)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) datetime64[ns] 2013-04-14 ... 2013-04-14T23:59:51
Data variables:
cloudTop (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
cloudThick (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
cloudBase (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
I tried to select the heights in the range identified by top and bottom array as follows:
PBLclouds = cloudStandard_mod_reshaped.sel(levels=slice(bottom_mod[:], top_mod[:]))
but this instruction does accept only scalar values for the slice command.
Do you know how to slice with values that are coordinate-dependent?

You can use the .where() method.
The line providing the solution is under 2.
1. First, create some data like yours:
The dataset:
nlevels, ntime = 8, 50
ds = xr.Dataset(
coords=dict(levels=np.arange(nlevels), time=np.arange(ntime),),
data_vars=dict(
cloudTop=(("levels", "time"), np.random.randn(nlevels, ntime)),
cloudThick=(("levels", "time"), np.random.randn(nlevels, ntime)),
cloudBase=(("levels", "time"), np.random.randn(nlevels, ntime)),
),
)
output of print(ds):
<xarray.Dataset>
Dimensions: (levels: 8, time: 50)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) int64 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49
Data variables:
cloudTop (levels, time) float64 0.08375 0.04721 0.9379 ... 0.04877 2.339
cloudThick (levels, time) float64 -0.6441 -0.8338 -1.586 ... -1.026 -0.5652
cloudBase (levels, time) float64 -0.05004 -0.1729 0.7154 ... 0.06507 1.601
For the top and bottom levels, I'll make the bottom level random and just add an offset to construct the top level.
offset = 3
bot_mod = xr.DataArray(
dims=("time"),
coords=dict(time=np.arange(ntime)),
data=np.random.randint(0, nlevels - offset, ntime),
name="bot_mod",
)
top_mod = (bot_mod + offset).rename("top_mod")
output of print(bot_mod):
<xarray.DataArray 'bot_mod' (time: 50)>
array([0, 1, 2, 2, 3, 1, 2, 1, 0, 2, 1, 3, 2, 0, 2, 4, 3, 3, 2, 1, 2, 0,
2, 2, 0, 1, 1, 4, 1, 3, 0, 4, 0, 4, 4, 0, 4, 4, 1, 0, 3, 4, 4, 3,
3, 0, 1, 2, 4, 0])
2. Then, select the range of levels where clouds are:
use .where() method to select the dataset variables that are between the bottom level and the top level:
ds_clouds = ds.where((ds.levels > bot_mod) & (ds.levels < top_mod))
output of print(ds_clouds):
<xarray.Dataset>
Dimensions: (levels: 8, time: 50)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) int64 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49
Data variables:
cloudTop (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
cloudThick (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
cloudBase (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
It puts nan where the condition is not satisfied, you can use the .dropna() method to get rid of those.
3. Check for success:
Plot cloudBase variable of the dataset before and after processing:
fig, axes = plt.subplots(ncols=2)
ds.cloudBase.plot.imshow(ax=axes[0])
ds_clouds.cloudBase.plot.imshow(ax=axes[1])
plt.show()
I'm not yet allowed to embed images, so that's a link:
Original data vs. selected data

Related

Concatenation axis must match exactly for np.corrcoef

I have 2 numpy arrays. x is a 2-d array with 9 features/columns and 536 rows and y is a 1-d array with 536 rows. demonstrated below
>>> x.shape
(536, 9)
>>> y.shape
(536,)
I am trying to find the correlation coefficients between x and y.
>>> np.corrcoef(x,y)
Here's the error I am seeing.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<__array_function__ internals>", line 5, in corrcoef
File "/opt/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py", line 2683, in corrcoef
c = cov(x, y, rowvar, dtype=dtype)
File "<__array_function__ internals>", line 5, in cov
File "/opt/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py", line 2477, in cov
X = np.concatenate((X, y), axis=0)
File "<__array_function__ internals>", line 5, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 9 and the array at index 1 has size 536
Can't seem to figure out what the shape of these 2 should be.
For same shape is possible use numpy.broadcast_to:
y1 = np.broadcast_to(y[:, None], x.shape)
#alternative solution
y1 = np.repeat(y[:, None], x.shape[1],1)
print (np.corrcoef(x,y1))
Sample:
np.random.seed(1609)
x = np.random.random((5,3))
y = np.random.random((5))
print (x)
[[3.28341891e-01 9.10078695e-01 6.25727436e-01]
[9.52999512e-01 3.54590864e-02 4.19920842e-01]
[2.46229526e-02 3.60903454e-01 9.96143110e-01]
[8.87331773e-01 8.34857105e-04 6.36058323e-01]
[2.91490345e-01 5.01580494e-01 3.23455182e-01]]
print (y)
[0.60437973 0.74687751 0.68819022 0.19104546 0.68420365]
y1 = np.broadcast_to(y[:, None], x.shape)
print(y1)
[[0.60437973 0.60437973 0.60437973]
[0.74687751 0.74687751 0.74687751]
[0.68819022 0.68819022 0.68819022]
[0.19104546 0.19104546 0.19104546]
[0.68420365 0.68420365 0.68420365]]
print (np.corrcoef(x,y1))
[[ 1.00000000e+00 -9.96776982e-01 3.52933703e-01 -9.66910777e-01
9.23044315e-01 nan -3.11624591e-16 nan
nan 3.11624591e-16]
[-9.96776982e-01 1.00000000e+00 -4.26856227e-01 9.43328464e-01
-8.89208247e-01 nan 0.00000000e+00 nan
nan 0.00000000e+00]
[ 3.52933703e-01 -4.26856227e-01 1.00000000e+00 -1.02557684e-01
-3.41645099e-02 nan 9.18680698e-17 nan
nan -9.18680698e-17]
[-9.66910777e-01 9.43328464e-01 -1.02557684e-01 1.00000000e+00
-9.90642527e-01 nan -9.92012638e-17 nan
nan 9.92012638e-17]
[ 9.23044315e-01 -8.89208247e-01 -3.41645099e-02 -9.90642527e-01
1.00000000e+00 nan 6.00580887e-16 nan
nan -6.00580887e-16]
[ nan nan nan nan
nan nan nan nan
nan nan]
[-3.11624591e-16 0.00000000e+00 9.18680698e-17 -9.92012638e-17
6.00580887e-16 nan 1.00000000e+00 nan
nan -1.00000000e+00]
[ nan nan nan nan
nan nan nan nan
nan nan]
[ nan nan nan nan
nan nan nan nan
nan nan]
[ 3.11624591e-16 0.00000000e+00 -9.18680698e-17 9.92012638e-17
-6.00580887e-16 nan -1.00000000e+00 nan
nan 1.00000000e+00]]
#Jezrael pretty much answered my question. An alternate approach would be create a array of zeros with 9 columns and store the correlation coefficient of each feature in x with y. We do this iteratively for every feature in x.
coeffs = np.zeros(9)
#number of rows
n_features = x.shape[1]
for feature in range(n_features):
#corrcoef returns a 2-d of shape (2,2) with 1s along the diagonal and coefficient values at 0,1 and 1,0
coeff = np.corrcoef(x[:,feature],y)[0,1]
coeffs[feature] = coeff

Concatenate 2 dataframes. I would like to combine duplicate columns

The following code can be used as an example of the problem I'm having:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df3=pd.concat([df1,df2], axis=1)
print(df3)
The result I get from this concatenation is:
B B
1 10 NaN
2 11 NaN
3 12 NaN
4 NaN 10
5 NaN 11
6 NaN 12
I would like to have:
B
1 10
2 11
3 12
4 10
5 11
6 12
I know that I can concatenate along axis=0. Unfortunately, that only solves the problem for this little example. The actual code I'm working with is more complex. Concatenating along axis=0 causes the index to be duplicated. I don't want that either.
EDIT:
People have asked me to give a more complex example to describe why simply removing 'axis=1' doesn't work. Here is a more complex example, first with axis=1 INCLUDED:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2], axis=1)
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3], axis=1)
print(df4)
This gives me:
B B C
1 10 NaN 20
2 11 NaN 21
3 12 NaN 22
4 NaN 10 NaN
5 NaN 11 NaN
6 NaN 12 NaN
I would like to have:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Now here is an example with axis=1 REMOVED:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2])
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3])
print(df4)
This gives me:
B C
A
1 10 NaN
2 11 NaN
3 12 NaN
4 10 NaN
5 11 NaN
6 12 NaN
1 NaN 20
2 NaN 21
3 NaN 22
I would like to have:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Sorry it wasn't very clear. I hope this helps.
Here is a two step process, for the example provided after the 'EDIT' point. Start by creating the dictionaries:
import pandas as pd
dic = {'A':['1','2','3'], 'B':['10','11','12']}
dic2 = {'A':['4','5','6'], 'B':['10','11','12']}
dic3 = {'A':['1','2','3'], 'C':['20','21','22']}
Step 1: convert each dictionary to a data frame, with index 'A', and concatenate (along axis=0):
t = pd.concat([pd.DataFrame(dic).set_index('A'),
pd.DataFrame(dic2).set_index('A'),
pd.DataFrame(dic3).set_index('A')])
Step 2: concatenate non-null elements of col 'B' with non-null elements of col 'C' (you could put this in a list comprehension if there are more than two columns). Now we concatenate along axis=1:
result = pd.concat([
t.loc[ t['B'].notna(), 'B' ],
t.loc[ t['C'].notna(), 'C' ],
], axis=1)
print(result)
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Edited:
If two objects need to be added along axis=1, then the new columns will be appended.And with axis=0 or default same column will be appended with new values.
Refer Below Solution:
import pandas as pd
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2])
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3],axis=1) #As here C is new new column so need to use axis=1
print(df4)
Output:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN

how to multiply values with group of data from pandas series without loop iteration

I have two pandas time series with different length and index, and a Boolean series. Series_1 is from the last data of each month with index last day of the month, series_2 is daily data with index daily, the Boolean series is True on the last day of each month, else as false.
I want to get data from series_1 (s1[0]) times data from series_2 (s2[1:n]) which is the daily data from one month, is there a way to do it without loop?
series_1 = 2010-06-30 1
2010-07-30 2
2010-08-31 5
2010-09-30 7
series_2 = 2010-07-01 2
2010-07-02 3
2010-07-03 5
2010-07-04 6
.....
2010-07-30 7
2010-08-01 6
2010-08-02 7
2010-08-03 5
.....
2010-08-31 6
Boolean = False
false
....
True
False
False
....
True
(with only the end of each month True)
want to get a series as a result that s = series_1[i] * series_2[j:j+n] (n data from same month)
How to make it?
Thanks in advance
Not sure if I got your question completely right but this should get you there:
series_1 = pd.Series({
'2010-07-30': 2,
'2010-08-31': 5
})
series_2 = pd.Series({
'2010-07-01': 2,
'2010-07-02': 3,
'2010-07-03': 5,
'2010-07-04': 6,
'2010-07-30': 7,
'2010-08-01': 6,
'2010-08-02': 7,
'2010-08-03': 5,
'2010-08-31': 6
})
Make the series Datetime aware and resample them to daily frequency:
series_1.index = pd.DatetimeIndex(series_1.index)
series_1 = series_1.resample('1D').asfreq()
series_2.index = pd.DatetimeIndex(series_2.index)
series_2 = series_2.resample('1D').asfreq()
Put them in a dataframe and perform basic multiplication:
df = pd.DataFrame()
df['1'] = series_1
df['2'] = series_2
df['product'] = df['1'] * df['2']
Result:
>>> df
1 2 product
2010-07-30 2.0 7.0 14.0
2010-07-31 NaN NaN NaN
2010-08-01 NaN 6.0 NaN
2010-08-02 NaN 7.0 NaN
2010-08-03 NaN 5.0 NaN
[...]
2010-08-27 NaN NaN NaN
2010-08-28 NaN NaN NaN
2010-08-29 NaN NaN NaN
2010-08-30 NaN NaN NaN
2010-08-31 5.0 6.0 30.0

Pandas, how to dropna values using subset with multiindex dataframe?

I have a data frame with multi-index columns.
From this data frame I need to remove the rows with NaN values in a subset of columns.
I am trying to use the subset option of pd.dropna but I do not manage to find the way to specify the subset of columns. I have tried using pd.IndexSlice but this does not work.
In the example below I need to get ride of the last row.
import pandas as pd
# ---
a = [1, 1, 2, 2, 3, 3]
b = ["a", "b", "a", "b", "a", "b"]
col = pd.MultiIndex.from_arrays([a[:], b[:]])
val = [
[1, 2, 3, 4, 5, 6],
[None, None, 1, 2, 3, 4],
[None, 1, 2, 3, 4, 5],
[None, None, 5, 3, 3, 2],
[None, None, None, None, 5, 7],
]
# ---
df = pd.DataFrame(val, columns=col)
# ---
print(df)
# ---
idx = pd.IndexSlice
df.dropna(axis=0, how="all", subset=idx[1:2, :])
# ---
print(df)
Using the thresh option is an alternative but if possible I would like to use subset and how='all'
When dealing with a MultiIndex, each column of the MultiIndex can be specified as a tuple:
In [67]: df.dropna(axis=0, how="all", subset=[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')])
Out[67]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
Or, to select all columns whose first level equals 1 or 2 you could use:
In [69]: df.dropna(axis=0, how="all", subset=df.loc[[], [1,2]].columns)
Out[69]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
df[[1,2]].columns also works, but this returns a (possibly large) intermediate DataFrame. df.loc[[], [1,2]].columns is more memory-efficient since its intermediate DataFrame is empty.
If you want to apply the dropna to the columns which have 1 or 2 in level 1, you can do it as follows:
cols= [(c0, c1) for (c0, c1) in df.columns if c0 in [1,2]]
df.dropna(axis=0, how="all", subset=cols)
If applied to your data, it results in:
Out[446]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
As you can see, the last line (index=4) is gone, because all columns below 1 and 2 were NaN for this line. If you rather want all rows to be removed, where any NaN occured in the column, you need:
df.dropna(axis=0, how="any", subset=cols)
Which results in:
Out[447]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6

Python: Extract dimension data from dataframe string column and create columns with values for each of them

Hej,
I have a source file with 2 columns: ID and all_dimensions. All dimensions is a string with different "key-value"-pairs which are not the same for each id.
I want to make the keys column headers and parse the respective value if existent in the right cell.
Example:
ID all_dimensions
12 Height:2 cm,Volume: 4cl,Weight:100g
34 Length: 10cm, Height: 5 cm
56 Depth: 80cm
78 Weight: 2 kg, Length: 7 cm
90 Diameter: 4 cm, Volume: 50 cl
Desired result:
ID Height Volume Weight Length Depth Diameter
12 2 cm 4cl 100g - - -
34 5 cm - - 10cm - -
56 - - - - 80cm -
78 - - 2 kg 7 cm - -
90 - 50 cl - - - 4 cm
I do have over a 100 dimensions so ideally I would like to write a for loop or something similar to not specify each column header (see code examples below)
I am using Python 3.7.3 and pandas 0.24.2.
What have I tried already:
1) I have tried to split the data in separate columns but wasn't sure how to proceed to have each value assigned into the right header:
df.set_index('ID',inplace=True)
newdf = df["all_dimensions"].str.split(",|:",expand = True)
2) Using the initial df, I used "str.extract" to create new columns (but then I would need to specify each header):
df['Volume']=df.all_dimensions.str.extract(r'Volume:([\w\s.]*)').fillna('')
3) To resolve the problem of 2) with each header, I created a list of all dimension attributes and thought to use the list with an for loop to extract the values:
columns_list=df.all_dimensions.str.extract(r'^([\D]*):',expand=True).drop_duplicates()
columns_list=columns_list[0].str.strip().values.tolist()
for dimension in columns_list:
df.dimension=df.all_dimensions.str.extract(r'dimension([\w\s.]*)').fillna('')
Here, JupyterNB gives me a UserWarning: "Pandas doesn't allow columns to be created via a new attribute name" and the df looks the same as before.
Option 1: I prefer splitting several time:
new_series = (df.set_index('ID')
.all_dimensions
.str.split(',', expand=True)
.stack()
.reset_index(level=-1, drop=True)
)
# split second time for individual measurement
new_df = (new_series.str
.split(':', expand=True)
.reset_index()
)
# stripping off leading/trailing spaces
new_df[0] = new_df[0].str.strip()
new_df[1] = new_df[1].str.strip()
# unstack to get the desire table:
new_df.set_index(['ID', 0])[1].unstack()
Option 2: Use split(',|:') as what you tried:
# splitting
new_series = (df.set_index('ID')
.all_dimensions
.str.split(',|:', expand=True)
.stack()
.reset_index(level=-1, drop=True)
)
# concat along axis=1 to get dataframe with two columns
# new_df.columns = ('ID', 0, 1) where 0 is measurement name
new_df = (pd.concat((new_series[::2].str.strip(),
new_series[1::2]), axis=1)
.reset_index())
new_df.set_index(['ID', 0])[1].unstack()
Output:
Depth Diameter Height Length Volume Weight
ID
12 NaN NaN 2 cm NaN 4cl 100g
34 NaN NaN 5 cm 10cm NaN NaN
56 80cm NaN NaN NaN NaN NaN
78 NaN NaN NaN 7 cm NaN 2 kg
90 NaN 4 cm NaN NaN 50 cl NaN
This is a hard question , your string need to be split and your each items after split need to be convert to dict , then we can using DataFrame constructor rebuild those columns
d=[ [{y.split(':')[0]:y.split(':')[1]}for y in x.split(',')]for x in df.all_dimensions]
from collections import ChainMap
data = list(map(lambda x : dict(ChainMap(*x)),d))
s=pd.DataFrame(data)
df=pd.concat([df,s.groupby(s.columns.str.strip(),axis=1).first()],1)
df
Out[26]:
ID all_dimensions Depth ... Length Volume Weight
0 12 Height:2 cm,Volume: 4cl,Weight:100g NaN ... NaN 4cl 100g
1 34 Length: 10cm, Height: 5 cm NaN ... 10cm NaN NaN
2 56 Depth: 80cm 80cm ... NaN NaN NaN
3 78 Weight: 2 kg, Length: 7 cm NaN ... 7 cm NaN 2 kg
4 90 Diameter: 4 cm, Volume: 50 cl NaN ... NaN 50 cl NaN
[5 rows x 8 columns]
Check the columns
df['Height']
Out[28]:
0 2 cm
1 5 cm
2 NaN
3 NaN
4 NaN
Name: Height, dtype: object

Resources