shuffle Numpy Array and pandas dataframe in unison

shuffle Numpy Array and pandas dataframe in unison - python-3.x

I have a pandas dataframe of shape (18837349,2000) and a 3D Numpy Array of shape (18837349,6,601). I want to shuffle the rows of my dataframe and the first dimension of my Numpy Array in unison. I know how to shuffle a dataframe:
df_shuffle = df.sample(frac=1).reset_index(drop=True)
But I don't know how to do it together with a 3D Numpy Array. Insights will be appreciated.

You can shuffle an index and use them for both objects
ix = np.arange(18837349)
np.random.shuffle(ix)
df_shuffle, array_shuffle = your_df.iloc[ix].reset_index(drop=True), your_array[ix]

Related

Getting the MAE values of columns in pandas dataframe with last column

How to compute MAE for the columns in a pandas Dataframe with the last column:
,CPFNN,EN,Blupred,Horvath2,EPM,vMLP,Age
202,4.266596,3.5684403102704,5.2752761330328,5.17705043941232,3.30077613485548,3.412883,4.0
203,5.039452,5.1258136685894,4.40019825995985,5.03563327742846,3.97465334472661,4.140719,4.0
204,5.0227585,5.37207428128756,1.56392554883583,4.41805439337257,4.43779809822224,4.347523,4.0
205,4.796998,5.61052306552109,4.20912233479662,3.57075401779518,3.24902718889411,3.887743,4.0
I have a pandas dataframe and I want to create a list with the mae values of each column with "Age".
Is there a "pandas" way of doing this instead of just doing a for loop for each column?
from sklearn.metrics import mean_absolute_error as mae
mae(blood_bestpred_df["CPFNN"], blood_bestpred_df['Age'])
I'd like to do this:
mae(blood_bestpred_df[["CPFNN,EN,Blupred,Horvath2,EPM,vMLP"]], blood_bestpred_df['Age'])
But I have a dimension issue.

Looks like sklearn's MAE requires both inputs to be the same shape and doesn't do any broadcasting (I'm not an sklearn expert, there might be another way around this). You can use raw pandas instead:
import pandas as pd
df = pd.read_clipboard(sep=",", index_col=0) # Your df here
out = df.drop(columns="Age").sub(df["Age"], axis=0).abs().mean()
out:
CPFNN 0.781451
EN 1.134993
Blupred 1.080168
Horvath2 0.764996
EPM 0.478335
vMLP 0.296904
dtype: float64

one hot encoder for the categorical variables of more one word

I have a dataset like below. I want to do one hot encoding for logistic regression for the 'Item' column. There are 313 distinct items in the 'Item' column I'm getting below error. Can you please assist how to resolve it?
enter image description here
Here is the code:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])],
remainder='passthrough')
X = np.array(ct.fit_transform(X))**
array(<1126x316 sparse matrix of type '<class 'numpy.float64'>'
with 4493 stored elements in Compressed Sparse Row format>, dtype=object)

Use this code, where df is the name of your dataframe
import pandas as pd
df = pd.get_dummies(data = df, columns = ['Item'])

How to make multiple ranges in a array in python?

I am looking for a single vector with values [(0:400) (-400:-1)]
Can anyone help me on how to write this in python.

Using Numpy .array to create the vector and .arange to generate the range:
import numpy as np
arr = np.array([[np.arange(401)], [np.arange(-400, 0)]], dtype=object)

Pandas dataframe from numpy with last dimension as object

Convert 3d numpy array to pandas dataframe with numpy.array as elements.
Are there any other solutions? What about speed?
import numpy as np
import pandas as pd
ones = np.ones((2,3,5))
temp = [[np.array(column_elem, dtype=np.object) for column_elem in row] for row in ones]
df = pd.DataFrame(temp)

Computing dask delayed objects stored in dataframe

I am looking for the best way to compute many dask delayed obejcts stored in a dataframe. I am unsure if the pandas dataframe should be converted to a dask dataframe with delayed objects within, or if the compute call should be called on all values of the pandas dataframe.
I would appreciate any suggestions in general, as I am having trouble with the logic of passing delayed object across nested for loops.
import numpy as np
import pandas as pd
from scipy.stats import hypergeom
from dask import delayed, compute
steps = 5
sample = [int(x) for x in np.linspace(5, 100, num=steps)]
enr_df = pd.DataFrame()
for N in sample:
enr = []
for i in range(20):
k = np.random.randint(1, 200)
enr.append(delayed(hypergeom.sf)(k=k, M=10000, n=20, N=N, loc=0))
enr_df[N] = enr
I cannot call compute on this dataframe without applying the function across all cells like so: enr_df.applymap(compute) (which I believe calls compute on each value individually).
However if I convert to a dask dataframe the delayed objects I want to compute are layered in the dask dataframe structure:
enr_dd = dd.from_pandas(enr_df, npartitions=1)
enr_dd.compute()
And the computation output I expect does not proceed.

You can pass a list of delayed objects into dask.compute
results = dask.compute(*list_of_delayed_objects)
So you need to get a list from your Pandas dataframe. This is something you can do with normal Python code.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

shuffle Numpy Array and pandas dataframe in unison - python-3.x

You can shuffle an index and use them for both objects ix = np.arange(18837349) np.random.shuffle(ix) df_shuffle, array_shuffle = your_df.iloc[ix].reset_index(drop=True), your_array[ix]

Related

Getting the MAE values of columns in pandas dataframe with last column

one hot encoder for the categorical variables of more one word

How to make multiple ranges in a array in python?

Pandas dataframe from numpy with last dimension as object

Computing dask delayed objects stored in dataframe

Categories

Resources