scikit learn label encoding prints as row instead of column - scikit-learn

I am trying to do label encoding using sci kit learn's built in function but why does my result print as row instead of an additional column?
from sklearn.preprocessing import LabelEncoder
# creating instance of labelencoder
labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
y_data['coded_sklearn'] = labelencoder.fit_transform(y_data['coded_sklearn'])
print(y_data)
enter image description here

Related

Combine numpy array with TfidfVectorizer as a joint feature matrix in SKLearn

I have a dataset input, which is a list of ~40000 letters (that are represented as strings).
With SKLearn, I first used a TfidfVectorizer to create a TF-IDF matrix representation1:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sklearn.pipeline
vectorizer = TfidfVectorizer(lowercase=False)
representation1 = vectorizer.fit_transform(input) # TFIDF representation
Now, I want to manually add one feature representation2 for every letter. This feature should tell the amount of different words compared to all words in a specific letter/string:
count_vectorizer = CountVectorizer()
sum_words = np.sum(count_vectorizer.fit_transform(input).toarray(), axis=-1)
sum_different_words = np.count_nonzero(count_vectorizer.fit_transform(input).toarray(), axis=-1)
representation2 = np.divide(sum_different_words, sum_words) # percentage of different words
The array representation2 is now an array of shape (39077,) (as expected). I now want to combine representation1 and representation2 into one feature vector representation.
I read about using FeatureUnion to combine two kinds of features in SKLearn, but I am not sure how to correctly use the Numpy array representation2as a feature here. I tried:
union = sklearn.pipeline.make_union([representation1, representation2])
But now I can't use e.g. union.get_feature_names_out(), since it throws: AttributeError: Transformer list (type list) does not provide get_feature_names_out.
What did I understand incorrectly here?

one hot encoder for the categorical variables of more one word

I have a dataset like below. I want to do one hot encoding for logistic regression for the 'Item' column. There are 313 distinct items in the 'Item' column I'm getting below error. Can you please assist how to resolve it?
enter image description here
Here is the code:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])],
remainder='passthrough')
X = np.array(ct.fit_transform(X))**
array(<1126x316 sparse matrix of type '<class 'numpy.float64'>'
with 4493 stored elements in Compressed Sparse Row format>, dtype=object)
Use this code, where df is the name of your dataframe
import pandas as pd
df = pd.get_dummies(data = df, columns = ['Item'])

How to Normalize or standardize specific or selected features of a data set in python

I have data and the name of the data frame is Table, Table contains 15 features and I want to normalize only 3 features that are numeric data, the names of these features are 'rate', 'cost', and 'Total cost'.Please, how do I fix this?
I tried to extract the required features by filtering them using
Table.loc[:,['rate',cost',total cost'] and passing to column_trans
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandarScaler
column_trans = ColumnTransformer(
[('scaler', StandardScaler(),Table.loc[:,['rate','cost','Totalcost']]
remainder='passthrough')
column_trans.fit_transform(X)
I expected to get values between 0 and 1 for the normalized features.
But I got the following error message.
File "", line 5
remainder='passthrough')
^
SyntaxError: invalid syntax
smart contribution # Parthasarathy, I noticed that one of the features was having NAN values
and one other feature is an integer, so I converted the Nan values to 0 and also applied astype to the integer feature. I applied the code below:
from sklearn.preprocessing import normalize
continuous_columns = ['rate','cost','Totalcost']
continuous_data= Telco[continuous_columns]
continuous_data['rate']= continuous_data['rate'].astype(float)
normalized_data = normalize(continuous_data)
You could try this:
from sklearn.preprocessing import normalize
continuous_columns = ['rate','cost','Totalcost']
continuous_data= Table.loc[:, continuous_columns]
continuous_data['rate']= continuous_data['rate'].astype(float)
continuous_data['cost']= continuous_data['cost'].astype(float)
continuous_data['Totalcost']= continuous_data['Totalcost'].astype(float)
normalized_data = normalize(continuous_data)
normalized_data_df =pd.DataFrame(normalized_data , columns=continuous_columns)
Table = Table.drop(continuous_columns, axis=1)
Final_data = pd.concat([Table, normalized_data_df ], axis=1)
Now Final_data contains what you are looking for.

Why am I getting a score of 0.0 when finding the score of test data using Gaussian NB classifier?

I have two different data sets. One for training my classifier and the other one is for testing. Both the datasets are text files with two columns separated by a ",". FIrst column (numbers) is for the independent variable (group) and the second column is for the dependent variable.
Training data set
(just few lines for example. there are no empty lines between each row):
EMI3776438,1
EMI3776438,1
EMI3669492,1
EMI3752004,1
Testing data setup
(as you can see, i have picked data from the training data to be sure that the score surely can't be zero)
EMI3776438,1
Code in Python 3.6:
# #all the import statements have been ignored to keep the code short
# #loading the training data set
training_file_path=r'C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\modified_columns.txt'
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
training_file_data = pandas.read_table(training_file_path,
header=None,
names=['numbers','group'],
sep=',')
training_file_data = training_file_data.apply(le.fit_transform)
features = ['numbers']
x = training_file_data[features]
y = training_file_data["group"]
from sklearn.model_selection import train_test_split
training_x,testing_x, training_y, testing_y = train_test_split(x, y,
random_state=0,
test_size=0.1)
from sklearn.naive_bayes import GaussianNB
gnb= GaussianNB()
gnb.fit(training_x, training_y)
# #loading the testing data
testing_final_path=r"C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\testing_final.txt"
testing_sample_data=pandas.read_table(testing_final_path,
sep=',',
header=None,
names=['numbers','group'])
testing_sample_data = testing_sample_data.apply(le.fit_transform)
category = ["numbers"]
testing_sample_data_x = testing_sample_data[category]
# #finding the score of the test data
print(gnb.score(testing_sample_data_x, testing_sample_data["group"]))
First, the above data samples dont show how many classes are there in it. You need to describe more about it.
Secondly, you are calling le.fit_transform again on test data which will forget all the training samples mappings from strings to numbers. The LabelEncoder le will start encoding the test data again from scratch, which will not be equal to how it mapped training data. So the input to GaussianNB is now incorrect and hence incorrect results.
Change that to:
testing_sample_data = testing_sample_data.apply(le.transform)
UPDATE:
I'm sorry I overlooked the fact that you had two columns in your data. LabelEncoder only works on a single column of data. For making it work on multiple pandas columns at once, look at the answers of following question:
Label encoding across multiple columns in scikit-learn
If you are using the latest version of scikit (0.20) or can update to it, then you would not need any such hacks and directly use the OrdinalEncoder:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
training_file_data = enc.fit_transform(training_file_data)
And during testing:
training_file_data = enc.transform(training_file_data)

Missing data Prediction

I have a jester data, the data has 100 movies and it's raiting which is given by 24983 user and the data has lots of missing datas. My job is predict its.
I want to start with Decision Tree,
I'm thinking that, First I will select first column of data(it has first movies raitings) and then I will delete first column from data. Then I will fit them, and finally I will found prediction probablity of first column(which is deleted from data)
I'm working on Python
import numpy as np
import pandas as pd
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestClassifier
df = pd.read_excel(input_file, header=None)
matrix = df.as_matrix()
imp = Imputer(missing_values=99, strategy='mean', axis=0)
imp.fit(matrix)
matrix= imp.transform(matrix)
train_data = matrix[:,:90] #train data (train data has 90 column)
test_data = matrix[:,90:] #%10 test data (test data has 10 column)
array2 = train_data.copy()
column = array2[:,0] # 0. column should be delete
array2 = np.delete(array2,0,axis=1) # 0. column should be select
clf = RandomForestClassifier(n_estimators=25)
clf.fit(array2.astype(int), column.astype(int))
clf_probs = clf.predict_proba(column)
my last giving error -> ValueError: Number of features of the model must match the input. Model n_features is 89 and input n_features is 24983
I have to predict the column like what I tell you (above the code)
What should I do? I really need help.

Resources