Concatenation axis must match exactly for np.corrcoef - python-3.x

I have 2 numpy arrays. x is a 2-d array with 9 features/columns and 536 rows and y is a 1-d array with 536 rows. demonstrated below
>>> x.shape
(536, 9)
>>> y.shape
(536,)
I am trying to find the correlation coefficients between x and y.
>>> np.corrcoef(x,y)
Here's the error I am seeing.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<__array_function__ internals>", line 5, in corrcoef
File "/opt/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py", line 2683, in corrcoef
c = cov(x, y, rowvar, dtype=dtype)
File "<__array_function__ internals>", line 5, in cov
File "/opt/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py", line 2477, in cov
X = np.concatenate((X, y), axis=0)
File "<__array_function__ internals>", line 5, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 9 and the array at index 1 has size 536
Can't seem to figure out what the shape of these 2 should be.

For same shape is possible use numpy.broadcast_to:
y1 = np.broadcast_to(y[:, None], x.shape)
#alternative solution
y1 = np.repeat(y[:, None], x.shape[1],1)
print (np.corrcoef(x,y1))
Sample:
np.random.seed(1609)
x = np.random.random((5,3))
y = np.random.random((5))
print (x)
[[3.28341891e-01 9.10078695e-01 6.25727436e-01]
[9.52999512e-01 3.54590864e-02 4.19920842e-01]
[2.46229526e-02 3.60903454e-01 9.96143110e-01]
[8.87331773e-01 8.34857105e-04 6.36058323e-01]
[2.91490345e-01 5.01580494e-01 3.23455182e-01]]
print (y)
[0.60437973 0.74687751 0.68819022 0.19104546 0.68420365]
y1 = np.broadcast_to(y[:, None], x.shape)
print(y1)
[[0.60437973 0.60437973 0.60437973]
[0.74687751 0.74687751 0.74687751]
[0.68819022 0.68819022 0.68819022]
[0.19104546 0.19104546 0.19104546]
[0.68420365 0.68420365 0.68420365]]
print (np.corrcoef(x,y1))
[[ 1.00000000e+00 -9.96776982e-01 3.52933703e-01 -9.66910777e-01
9.23044315e-01 nan -3.11624591e-16 nan
nan 3.11624591e-16]
[-9.96776982e-01 1.00000000e+00 -4.26856227e-01 9.43328464e-01
-8.89208247e-01 nan 0.00000000e+00 nan
nan 0.00000000e+00]
[ 3.52933703e-01 -4.26856227e-01 1.00000000e+00 -1.02557684e-01
-3.41645099e-02 nan 9.18680698e-17 nan
nan -9.18680698e-17]
[-9.66910777e-01 9.43328464e-01 -1.02557684e-01 1.00000000e+00
-9.90642527e-01 nan -9.92012638e-17 nan
nan 9.92012638e-17]
[ 9.23044315e-01 -8.89208247e-01 -3.41645099e-02 -9.90642527e-01
1.00000000e+00 nan 6.00580887e-16 nan
nan -6.00580887e-16]
[ nan nan nan nan
nan nan nan nan
nan nan]
[-3.11624591e-16 0.00000000e+00 9.18680698e-17 -9.92012638e-17
6.00580887e-16 nan 1.00000000e+00 nan
nan -1.00000000e+00]
[ nan nan nan nan
nan nan nan nan
nan nan]
[ nan nan nan nan
nan nan nan nan
nan nan]
[ 3.11624591e-16 0.00000000e+00 -9.18680698e-17 9.92012638e-17
-6.00580887e-16 nan -1.00000000e+00 nan
nan 1.00000000e+00]]

#Jezrael pretty much answered my question. An alternate approach would be create a array of zeros with 9 columns and store the correlation coefficient of each feature in x with y. We do this iteratively for every feature in x.
coeffs = np.zeros(9)
#number of rows
n_features = x.shape[1]
for feature in range(n_features):
#corrcoef returns a 2-d of shape (2,2) with 1s along the diagonal and coefficient values at 0,1 and 1,0
coeff = np.corrcoef(x[:,feature],y)[0,1]
coeffs[feature] = coeff

Related

Slicing xarray dataset with coordinate dependent variable

I built an xarray dataset in python3 with coordinates (time, levels) to identify all cloud bases and cloud tops during one day of observations. The variable levels is the dimension for the cloud base/tops that can be identified at a given time. It stores cloud base/top heights values for each time.
Now I want to select all the cloud bases and tops that are located within a given range of heights that change in time. The height range is identified by the arrays bottom_mod and top_mod. These arrays have a time dimension and contain the edges of the range of heights to be selected.
The xarray dataset is cloudStandard_mod_reshaped:
Dimensions: (levels: 8, time: 9600)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) datetime64[ns] 2013-04-14 ... 2013-04-14T23:59:51
Data variables:
cloudTop (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
cloudThick (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
cloudBase (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
I tried to select the heights in the range identified by top and bottom array as follows:
PBLclouds = cloudStandard_mod_reshaped.sel(levels=slice(bottom_mod[:], top_mod[:]))
but this instruction does accept only scalar values for the slice command.
Do you know how to slice with values that are coordinate-dependent?
You can use the .where() method.
The line providing the solution is under 2.
1. First, create some data like yours:
The dataset:
nlevels, ntime = 8, 50
ds = xr.Dataset(
coords=dict(levels=np.arange(nlevels), time=np.arange(ntime),),
data_vars=dict(
cloudTop=(("levels", "time"), np.random.randn(nlevels, ntime)),
cloudThick=(("levels", "time"), np.random.randn(nlevels, ntime)),
cloudBase=(("levels", "time"), np.random.randn(nlevels, ntime)),
),
)
output of print(ds):
<xarray.Dataset>
Dimensions: (levels: 8, time: 50)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) int64 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49
Data variables:
cloudTop (levels, time) float64 0.08375 0.04721 0.9379 ... 0.04877 2.339
cloudThick (levels, time) float64 -0.6441 -0.8338 -1.586 ... -1.026 -0.5652
cloudBase (levels, time) float64 -0.05004 -0.1729 0.7154 ... 0.06507 1.601
For the top and bottom levels, I'll make the bottom level random and just add an offset to construct the top level.
offset = 3
bot_mod = xr.DataArray(
dims=("time"),
coords=dict(time=np.arange(ntime)),
data=np.random.randint(0, nlevels - offset, ntime),
name="bot_mod",
)
top_mod = (bot_mod + offset).rename("top_mod")
output of print(bot_mod):
<xarray.DataArray 'bot_mod' (time: 50)>
array([0, 1, 2, 2, 3, 1, 2, 1, 0, 2, 1, 3, 2, 0, 2, 4, 3, 3, 2, 1, 2, 0,
2, 2, 0, 1, 1, 4, 1, 3, 0, 4, 0, 4, 4, 0, 4, 4, 1, 0, 3, 4, 4, 3,
3, 0, 1, 2, 4, 0])
2. Then, select the range of levels where clouds are:
use .where() method to select the dataset variables that are between the bottom level and the top level:
ds_clouds = ds.where((ds.levels > bot_mod) & (ds.levels < top_mod))
output of print(ds_clouds):
<xarray.Dataset>
Dimensions: (levels: 8, time: 50)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) int64 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49
Data variables:
cloudTop (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
cloudThick (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
cloudBase (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
It puts nan where the condition is not satisfied, you can use the .dropna() method to get rid of those.
3. Check for success:
Plot cloudBase variable of the dataset before and after processing:
fig, axes = plt.subplots(ncols=2)
ds.cloudBase.plot.imshow(ax=axes[0])
ds_clouds.cloudBase.plot.imshow(ax=axes[1])
plt.show()
I'm not yet allowed to embed images, so that's a link:
Original data vs. selected data

Trying to append a single row of data to a pandas DataFrame, but instead adds rows for each field of input

I am trying to add a row of data to a pandas DataFrame, but it keeps adding a separate row for each piece of data. I feel I am missing something very simple and obvious, but what it is I do not know.
import pandas
colNames = ["ID", "Name", "Gender", "Height", "Weight"]
df1 = pandas.DataFrame(columns = colNames)
df1.set_index("ID", inplace=True, drop=False)
i = df1.shape[0]
person = [{"ID":i},{"Name":"Jack"},{"Gender":"Male"},{"Height":177},{"Weight":75}]
df1 = df1.append(pandas.DataFrame(person, columns=colNames))
print(df1)
Output:
ID Name Gender Height Weight
0 0.0 NaN NaN NaN NaN
1 NaN Jack NaN NaN NaN
2 NaN NaN Male NaN NaN
3 NaN NaN NaN 177.0 NaN
4 NaN NaN NaN NaN 75.0
You are using too many squiggly brackets. All of your data should be inside one pair of squiggly brackets. This creates a single python dictionary. Change that line to:
person = [{"ID":i,"Name":"Jack","Gender":"Male","Height":177,"Weight":75}]

combine row values in all consecutive rows that contains NaN and int values using pandas

I need your help:
I want to merge consecutive rows like this:
Input:
Time ColA ColB Time_for_test[sec]
2020-01-19 08:51:56.461 NaN B NaN
2020-01-19 08:52:15.405 NaN NaN 18.95
2020-01-19 08:52:40.923 A NaN NaN
2020-01-19 08:52:59.589 NaN NaN 18.67
2020-01-19 08:54:07.687 NaN B NaN
Output:
Time ColA ColB Time_for_test[sec]
2020-01-19 08:51:56.461 NaN B NaN
2020-01-19 08:52:15.405 NaN B 18.95
2020-01-19 08:52:40.923 A NaN NaN
2020-01-19 08:52:59.589 A NaN 18.67
2020-01-19 08:54:07.687 NaN B NaN
Of course, I checked if exist similar cases published on the site:
I tried one adding a new column like that:
merge_df = merge_df.fillNa(0)
merge_df['sum'] = merge_df['TableA']+merge_df['Time_for_ST[sec]'].shift(-1)
It did not work.
Thank you for patience
stack and unstack are your friends. Assuming your dataframe index is unique:
df[['ColA', 'ColB']].stack() \
.reset_index(level=1) \
.reindex(df.index) \
.ffill() \
.set_index('level_1', append=True) \
.unstack() \
.droplevel(0, axis=1)
Since it's one long operation chain, you can run only line 1, then line 1,2, then 1,2,3.... to see how it works.

Pandas append returns DF with NaN values

I'm appending data from a list to pandas df. I keep getting NaN in my entries.
Based on what I've read I think I might have to mention the data type for each column in my code.
dumps = [];features_df = pd.DataFrame()
for i in range (int(len(ids)/50)):
dumps = sp.audio_features(ids[i*50:50*(i+1)])
for i in range (len(dumps)):
print(list(dumps[0].values()))
features_df = features_df.append(list(dumps[0].values()), ignore_index = True)
Expected results, something like-
[0.833, 0.539, 11, -7.399, 0, 0.178, 0.163, 2.1e-06, 0.101, 0.385, 99.947, 'audio_features', '6MWtB6iiXyIwun0YzU6DFP', 'spotify:track:6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/tracks/6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/audio-analysis/6MWtB6iiXyIwun0YzU6DFP', 149520, 4]
for one row.
Actual-
danceability energy ... duration_ms time_signature
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
For all rows
append() strategy in a tight loop isn't a great way to do this. Rather, you can construct an empty DataFrame and then use loc to specify an insertion point. The DataFrame index should be used.
For example:
import pandas as pd
df = pd.DataFrame(data=[], columns=['n'])
for i in range(100):
df.loc[i] = i
print(df)
time python3 append_df.py
n
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
real 0m13.178s
user 0m12.287s
sys 0m0.617s
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Replace values in pandas column based on nan in another column

For pairs of columns, i want to replace the values of the second columns with nan if the values in the first is nan.
I have tried without success
>import pandas as pd
>
> df=pd.DataFrame({'a': ['r', np.nan, np.nan, 's'], 'b':[0.5, 0.5, 0.2,
> 0.02], 'c':['n','r', np.nan, 's' ], 'd':[1,0.5,0.2,0.05]})
>
>listA=['a','c']
>listB=['b','d']
>for color, ratio in zip(listA, listB):
>>df.loc[df[color].isnull(), ratio] == np.nan
df remain unchanged
other test using def (failed)
>def Test(df):
>> if df[color]== np.nan:
>> >> return df[ratio]== np.nan
>> else:
>> >>return
>for color, ratio in zip(listA, listB):
>>>>df[ratio]=df.apply(Test, axis=1)
Thanks
It seems you have typo, change == to =:
for color, ratio in zip(listA, listB):
df.loc[df[color].isnull(), ratio] = np.nan
print (df)
a b c d
0 r 0.50 n 1.00
1 NaN NaN r 0.50
2 NaN NaN NaN NaN
3 s 0.02 s 0.05
Another solution with mask for replace True values of mask to NaN by default:
for color, ratio in zip(listA, listB):
df[ratio] = df[ratio].mask(df[color].isnull())
print (df)
a b c d
0 r 0.50 n 1.00
1 NaN NaN r 0.50
2 NaN NaN NaN NaN
3 s 0.02 s 0.05

Resources