ValueError: could not convert string to float in panda - python-3.x

My code is :
import pandas as pd
data = pd.read_table('train.tsv')
X=data.Phrase
Y=data.Sentiment
from sklearn import cross_validation
X_train,X_test,Y_train,Y_test=cross_validation.train_test_split(X,Y,test_size=0.2,random_state=0)
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X,Y)
I get the error :ValueError: could not convert string to float:
What changes can I make that my code works?

You can't pass in text data into MultinomialNB of scikit-learn as stated in its documentation.
None of the algorithms in scikit-learn works directly with text data. You need to do some preprocessing to get desired output. You'll need to first extract the features from text data using techniques like bagging or tokenizing. Have a look at this link for better understanding.
You also might want to look at using NLTK for such use cases as yours.

ValueError when using Multinomial Naive Bayes classifier
You probably should preprocess your data as shown in the answer above.

Related

Feature importance/ selection per class. How?

I followed this scikit learn guidance to find feature importance for a classification problem. Here's the code from the link:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
X.shape
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
clf.feature_importances_
The problem is that, it's not actually what I really want. What I'd like to do is to discover feature importance per class.
One idea that comes to my mind is to turn the data into a binary classification, per class and to train a DecisionTree per class.
Is that a good approach? What are common ideas to deal with this problem?
Thanks!
Yes, one-vs-all classification is a common way of dealing with that issue. You could take that approach. While I don't think there is a principled way of obtaining class-specific feature importance for random forests, you could use the SHAP package to get Shapley values empirically.

Python XGBoost prediction discrepancies with DMatrix

I found there are 2 problems with xbgoost predictions. I trained the model with XGBClassifier and tried to load the model using Booster for prediction, I found
Predictions are slightly different using xbg.Booster and xgb.Classifier, see below.
Predictions are different between list and numpy array when using DMatrix, see below,
Some difference is quite big, I am not sure why this is happening and which prediction should be the source of truth?
For the second question, your data types could change when you convert a list to a numpy array (depending on the numpy version you're using). For example on numpy 1.19.5, try converting list ["1",1] to a numpy array and see the result.

Distribution of time series data in TensorFlow

I currently have a kdb+ database with ~1mil rows of financial tick data. Using Python3, TensorFlow, and numpy, what is the best way to break up time-series financial data into train/dev/test sets?
This paper suggests the use of k-fold cross-validation, which partitions the data into complimentary subsets. But it's from Spring-2014 and after reading it I'm still unclear on how to implement it in practice. Is this the best solution or is something like hold-out validation more appropriate for financial data?
I'm also interested in learning best practices for importing locally stored time-series data into my TensorFlow model.
Thank you.
One can use qPython to load the data to the Python process and then KFold from sklearn to repeatedly split the data set into training and test part.
Suppose we have the following table defined on the KDB+ side:
t:([] time:.z.t+til 30;ask:100.0+30?1.0;bid:98+30?1.0)
Then on the Python side you can do the following to produce indices of the train/test splits:
from qpython import qconnection
import pandas as pd
from sklearn.model_selection import KFold
with qconnection.QConnection(host = 'localhost', port = 5001, pandas = True) as q:
X = q.sync('t')
kf = KFold(n_splits=4)
for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
See the KFold documentation for other variants of KFold.

Use gensim Random Projection in sklearn SVM

Is it possible to use a gensim Random Projection to train a SVM in sklearn?
I need to use gensim's tfidf implementation because it's better at dealing with large inputs and then want to put that into a random projection on which I will train my SVM. I'd also be happy to just pass the tfidf model generated by gensim to sklearn and use their random projection, if that makes things easier.
But so far I haven't found a way to get either model out of gensim into sklearn.
I have tried using gensim.matutils.corpus2cscbut of course that doesn't work: neither TfidfModel nor RpModel are corpi, so now I'm clueless at what to try next.
This is now very easy thanks to an awesome gensim contribution from Chinmaya Pancholi (see post here).
Simply import the sklearn wrapper from `gensim:
from gensim.sklearn_api import RpTransformer
Then, you can use the model to do analysis as you would any other sklearn classifier:
model = RpTransformer(num_topics=2)
clf = svm.SVC()
pipe = Pipeline([('features', model,), ('classifier', clf)])
pipe.fit(X_train, y_train)
One thing to be aware of, when using the gensim models, is that you still need to perform the dictionary and corpus steps. So instead of fitting your model on X_train, you'll have to do something along the following lines:
dictionary = Dictionary(X_train)
corpus_train = [dictionary.doc2bow(text) for text in X_train]
corpus_test = [dictionary.doc2bow(text) for text in X_test]
Then fit/predict your model on corpus_train or corpus_test.

name 'classification_model' is not defined

Im trying to model in Python 3.5 and am following an example that can be found at here.
I have imported all the required libraries from sklearn.
However I'm getting the following error.
Code:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']
classification_model(model, loan,predictor_var,outcome_var)
When I run the above code I get the following error:
NameError: name 'classification_model' is not defined
I'm not sure how to resolve this as I tried importing sklearn and all the sub libraries.
P.S. I'm new to Python, hence I'm trying to figure out basic steps
Depending on the exact details this may not be what you want but I have never had a problem with
import sklearn.linear_model as sk
logreg = sk.LogisticRegressionCV()
logreg.fit(predictor_var,outcome_var)
This means you have to explicitly separate your training and test set, but having fit to a training set (the process in the final line of my code), you can then use the methods detailed in the documentation [1].
For example figuring out what scores (how many did I get correct) you get on unseen data with the .score method
[1] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html
It appears this code came from this tutorial.
The issue is exactly as the error describes. classification_model is currently undefined. You need to create this function yourself before you can call it. Check out this part of that tutorial so you can see how it's defined. Good luck!
from sklearn.metrics import classification_report

Resources