Getting the MAE values of columns in pandas dataframe with last column - python-3.x

How to compute MAE for the columns in a pandas Dataframe with the last column:
,CPFNN,EN,Blupred,Horvath2,EPM,vMLP,Age
202,4.266596,3.5684403102704,5.2752761330328,5.17705043941232,3.30077613485548,3.412883,4.0
203,5.039452,5.1258136685894,4.40019825995985,5.03563327742846,3.97465334472661,4.140719,4.0
204,5.0227585,5.37207428128756,1.56392554883583,4.41805439337257,4.43779809822224,4.347523,4.0
205,4.796998,5.61052306552109,4.20912233479662,3.57075401779518,3.24902718889411,3.887743,4.0
I have a pandas dataframe and I want to create a list with the mae values of each column with "Age".
Is there a "pandas" way of doing this instead of just doing a for loop for each column?
from sklearn.metrics import mean_absolute_error as mae
mae(blood_bestpred_df["CPFNN"], blood_bestpred_df['Age'])
I'd like to do this:
mae(blood_bestpred_df[["CPFNN,EN,Blupred,Horvath2,EPM,vMLP"]], blood_bestpred_df['Age'])
But I have a dimension issue.

Looks like sklearn's MAE requires both inputs to be the same shape and doesn't do any broadcasting (I'm not an sklearn expert, there might be another way around this). You can use raw pandas instead:
import pandas as pd
df = pd.read_clipboard(sep=",", index_col=0) # Your df here
out = df.drop(columns="Age").sub(df["Age"], axis=0).abs().mean()
out:
CPFNN 0.781451
EN 1.134993
Blupred 1.080168
Horvath2 0.764996
EPM 0.478335
vMLP 0.296904
dtype: float64

Related

one hot encoder for the categorical variables of more one word

I have a dataset like below. I want to do one hot encoding for logistic regression for the 'Item' column. There are 313 distinct items in the 'Item' column I'm getting below error. Can you please assist how to resolve it?
enter image description here
Here is the code:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])],
remainder='passthrough')
X = np.array(ct.fit_transform(X))**
array(<1126x316 sparse matrix of type '<class 'numpy.float64'>'
with 4493 stored elements in Compressed Sparse Row format>, dtype=object)
Use this code, where df is the name of your dataframe
import pandas as pd
df = pd.get_dummies(data = df, columns = ['Item'])

shuffle Numpy Array and pandas dataframe in unison

I have a pandas dataframe of shape (18837349,2000) and a 3D Numpy Array of shape (18837349,6,601). I want to shuffle the rows of my dataframe and the first dimension of my Numpy Array in unison. I know how to shuffle a dataframe:
df_shuffle = df.sample(frac=1).reset_index(drop=True)
But I don't know how to do it together with a 3D Numpy Array. Insights will be appreciated.
You can shuffle an index and use them for both objects
ix = np.arange(18837349)
np.random.shuffle(ix)
df_shuffle, array_shuffle = your_df.iloc[ix].reset_index(drop=True), your_array[ix]

log transformation of whole dataframe using numpy

I have a dataframe in python which is made using the following code:
import pandas as pd
df = pd.read_csv('myfile.txt', sep="\t")
df1 = df.iloc[:, 3:]
now in df1 there are 24 columns. I would like to transform the values to log2 value and make a new dataframe in which there are 24 columns with log value of original dataframe. to do so I used numpy.log like the following line:
df2 = (numpy.log(df1))
this code does not return what I would like to get. do you know how to fix it?

how to create a new column with dense vectors in pyspark table using Pandas UDF?

My table is stored in pyspark in databricks. The table has two columns id and text. I am trying to get a dense vector for the text column. I have a ML model to generate the text dense representation into a new column called dense_embedding. The model generate a numpy array to represent the input text. ``
work like this model.encode(text_input). I want to use this model to generate the all text dense representation for the column text.
Here is what I did:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import *
import pandas as pd
# Use pandas_udf to define a Pandas UDF
#pandas_udf('???', PandasUDFType.SCALAR)
# Input/output are text and dense vector
def embedding(v):
return Vectors.dense(model.encode([v]))
small.withColumn('dense_embedding', embedding(small.text))
I am not sure is what data type shall I put into the pandas_udf function? is it correct to convert dense_vector like what I did?

Missing data Prediction

I have a jester data, the data has 100 movies and it's raiting which is given by 24983 user and the data has lots of missing datas. My job is predict its.
I want to start with Decision Tree,
I'm thinking that, First I will select first column of data(it has first movies raitings) and then I will delete first column from data. Then I will fit them, and finally I will found prediction probablity of first column(which is deleted from data)
I'm working on Python
import numpy as np
import pandas as pd
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestClassifier
df = pd.read_excel(input_file, header=None)
matrix = df.as_matrix()
imp = Imputer(missing_values=99, strategy='mean', axis=0)
imp.fit(matrix)
matrix= imp.transform(matrix)
train_data = matrix[:,:90] #train data (train data has 90 column)
test_data = matrix[:,90:] #%10 test data (test data has 10 column)
array2 = train_data.copy()
column = array2[:,0] # 0. column should be delete
array2 = np.delete(array2,0,axis=1) # 0. column should be select
clf = RandomForestClassifier(n_estimators=25)
clf.fit(array2.astype(int), column.astype(int))
clf_probs = clf.predict_proba(column)
my last giving error -> ValueError: Number of features of the model must match the input. Model n_features is 89 and input n_features is 24983
I have to predict the column like what I tell you (above the code)
What should I do? I really need help.

Resources