In order to run NB classifier in about 400 MB of text data i need to use vectorizer.
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(X_data)
But it is giving out of memory error. I am using Linux64 an python 64 bit version. How does people work through Vectorization process in Scikit for large data set (text)
Traceback (most recent call last):
File "ParseData.py", line 234, in <module>
main()
File "ParseData.py", line 211, in main
classifier = MultinomialNB().fit(X_train, y_train)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 313, in fit
Y = labelbin.fit_transform(y)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 408, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 272, in transform
neg_label=self.neg_label)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 394, in label_binarize
Y = np.zeros((len(y), len(classes)), dtype=np.int)
Edited (ogrisel): I changed the title from "Out of Memory Error in Scikit Vectorizer" to "Out of Memory Error in Scikit-learn MultinomialNB" to make it more descriptive of the actual problem.
Let me summarize the outcome of the discussion in the comments:
The label preprocessing machinery used internally in many scikit-learn classifiers does not scale well memory wise w.r.t. the number of classes. This is a known issue and there is ongoing work to tackle it.
The MultinomialNB class it-self will probably not be suitable to classify in a label space with cardinality 43K even if the label preprocessing limitation is fixed.
To address the large cardinality classification problem you could try:
fit binary SGDClassifier(loss='log', penalty='elasticnet') instances on columns of y_train converted as numpy arrays independently, then call clf.sparsify() and finally wrap those sparse models as a final one-vs-rest classifier (or rank predictions of the binary classifier by proba). Dependending on the value of the regularizer parameter alpha you might get sparse models that are small enough to fit in memory. You can also try to do the same with LogisticRegression, that is something like:
clf_label_i = LogisticRegression(penalty='l1').fit(X_train, y_train[:, label_i].toarray()).sparsify()
alternatively try to do a PCA of the target labels y_train, then cast your classification problem as a multi-output regression problem in the reduced label PCA space, and then decode the regressor's output by looking for the nearest class encoding in the label PCA space.
You can also have a look at
Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classification implemented in lightning but I am not sure it suitable for label cardinality 43K either.
Related
I'm new to the Machine learning domain and in Learn Regression i have some doubt
1:While practicing the sklearn learn regression model prediction method getting the below error.
Code:
sklearn.linear_model.LinearRegression.predict(25)
Error:
"ValueError: Expected 2D array, got scalar array instead: array=25. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
Do i need to pass a 2-D array? Checked on sklearn documentation page any haven't found any thing for version update.
**Running my code on Kaggle
https://www.kaggle.com/aman9d/bikesharingdemand-upx/
2: Is index of dataset going to effect model's score (weights)?
First of all you should put your code as you use:
# import, instantiate, fit
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)
# use the predict method
linreg.predict(25)
Because what you post in the question is not properly executable, predict method is not static for the class LinearRegression.
When you fit a model, the first step is recognize which kind of data will be the input, in your case will be similar to X, that means that if you pass something with different shape of X to the model it will raise an error.
In your example X seems to be a pd.DataFrame() instance with only 1 column, this should be replaceable with an array of 2 dimension representing the number of examples by the number of features, so if you try:
linreg.predict([[25]])
should work.
For example if you were trying a regression with more than 1 feature aka column, let's say temp and humidity, your input would look like this:
linreg.predict([[25, 56]])
I hope this will help you and always keep in mind which is the shape of your data.
Documentation: LinearRegression fit
X : array-like or sparse matrix, shape (n_samples, n_features)
I have a created a binary classifier in Tensorflow that will output a generator object containing predictions. I extract the predictions (e.g [0.98, 0.02]) from the object into a list, later converting this into a numpy array. I have the corresponding array of labels for these predictions. Using these two arrays I believe I should be able to plot a roc curve via:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
fpr, tpr, thr = roc_curve(labels, predictions[:,1])
plt.plot(fpr, tpr)
plt.show()
print(fpr)
print(tpr)
print(thr)
Where predictions[:,1] gives the positive prediction score. However, running this code leads to only a flat line and only three values for each fpr, tpr and thr:
Flat line roc plot and limited function outputs.
The only theory I have as to why this is happening is because my classifier is too sure of it's predictions. Many, if not all, of the positive prediction scores are 1.0, or incredibly close to zero:
[[9.9999976e-01 2.8635742e-07]
[3.3693312e-11 1.0000000e+00]
[1.0000000e+00 9.8642090e-09]
...
[1.0106111e-15 1.0000000e+00]
[1.0000000e+00 1.0030269e-09]
[8.6156778e-15 1.0000000e+00]]
According to a few sources including this stackoverflow thread and this stackoverflow thread, the very polar values of my predictions could be creating an issue for roc_curve().
Is my intuition correct? If so is there anything I can do about it to plot my roc_curve?
I've tried to include all the information I think would be relevant to this issue but if you would like any more information about my program please ask.
ROC is generated by changing the threshold on your predictions and finding the sensitivity and specificity for each threshold. This generally means that as you increase the threshold, your sensitivity decreases but your specificity increases and it draws a picture of the overall quality of your predicted probabilities. In your case, since everything is either 0 or 1 (or very close to it) there are no meaningful thresholds to use. That's why the thr value is basically [ 1, 1, 1 ].
You can try to arbitrarily pull the values closer to 0.5 or alternatively implement your own ROC curve calculation with more tolerance for small differences.
On the other hand you might want to review your network because such result values often mean there is a problem there, maybe the labels leaked into the network somehow and therefore it produces perfect results.
I am new to RNNs and keras.
I am trying to compare performance of LSTM against traditional machine learning algorithms (like RF or GBM) on a sequential data (not necessarily time-series but in order). My data contains 276 predictors and an output (for eg. stock price with 276 various informations of the stock's firm) with 8564 retro observations. Since, LSTMs are great in capturing sequential trend, I decided to use a time_step of 300. From the below figure, I believe I have the task of creating a many-to-many network (last figure from left). (Pic:http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
With each pink box being of size 276 (number of predictors) and 300 (time_steps) such pink boxes in one batch.However, I am struggling to see how I design the blue boxes here as each blue box should be the output (stock price) of each instance. From other posts on Keras gihub forum #2403 and #2654 , I think I have to implement TimeDistributed(Dense()) but I don't know how . This is my code to check if it works (train_idv is the data to predict from and train_dv is stock price)
train_idv.shape
#(8263, 300, 276)
train_dv.shape
#(8263, 300, 1)
batch_size = 1
time_Steps=300
model = Sequential()
model.add(LSTM(300,
batch_input_shape=(batch_size, time_Steps, train_idv.shape[2]),
stateful=True,
return_sequences=True))
model.add(Dropout(0.3))
model.add(TimeDistributed(Dense(300)))
# Model Compilation
model.compile(loss='mean_squared_error',optimizer='adam',metrics=['accuracy'])
model.fit(train_idv, traindv, nb_epoch=1, batch_size=batch_size, verbose=2, shuffle=False)
Running the model.fit gives this error
Traceback (most recent call last):
File "", line 1, in
File "/home/user/.local/lib/python2.7/site-packages/keras/models.py", line 627, in fit
sample_weight=sample_weight)
File "/home/user/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1052, in fit
batch_size=batch_size)
File "/home/user/.local/lib/python2.7/site-packages/keras/engine/training.py", line 983, in _standardize_user_data
exception_prefix='model target')
File "/home/user/.local/lib/python2.7/site-packages/keras/engine/training.py", line 111, in standardize_input_data
str(array.shape))
Exception: Error when checking model target: expected timedistributed_4 to have shape (1, 300, 300) but got array with shape (8263, 300, 1)
Now, I have successfully ran it with time_step=1 and just using Dense(1) as last layer. But I am not sure how should I shape my train_dv (output in training) or how to use TimeDistributed(Dense()) correctly. Finally, I want to use
trainPredict = model.predict(train_idv,batch_size=1)
to predict scores on any data.
I have posted this question on keras github forum as well.
From your post I understand that you want each LSTM time step to predict a single scalar correct? Then you Time Distributed Dense layer should have output 1, not 300 (i.e. TimeDistributed(Dense(1))).
Also for your reference, there's an example in the keras repo for using Time Distributed Dense.
In this example one wants basically to train a multi-class classifier (with shared weights) for each timestep, where the different possible classes are the different possible digit characters:
# For each of step of the output sequence, decide which character should be chosen
model.add(TimeDistributed(Dense(len(chars))))
The number of time steps is defined by the preceding recurrent layers.
I'm a scikit n00b who's trying out this neural network example:
from sknn.mlp import Regressor, Layer
nn = Regressor(
layers=[
Layer("Rectifier", units=100),
Layer("Linear")],
learning_rate=0.02,
n_iter=10)
nn.fit(X_train, y_train)
found on [0].
I have the appropriate (normalized) dataset (x_train and Y_train) that i'm using. When I execute the nn.fit command, it works once. But any subsequent attempt to re-run it results in a very annoying,
File "1.py", line 39, in <module>
nn.fit(X, Y.values.ravel())
File "/Library/Python/2.7/site-packages/sknn/mlp.py", line 397, in fit
return super(Classifier, self)._fit(X, yp, w)
File "/Library/Python/2.7/site-packages/sknn/mlp.py", line 248, in _fit
raise e
RuntimeError: Training diverged and returned NaN.
This error doesn't seem to be documented, so i'm at my wits end. The only way to get it to work again seems to be a re-start of my computer. Has anyone seen this before? does this mean that I need to do some sort of 'cleaning-up' once i'm done fitting?
[0] http://scikit-neuralnetwork.readthedocs.io/en/latest/guide_model.html
I meet the similar problem and I tried normalizing my data using scikit-learn’s pipeline, as shown in http://scikit-neuralnetwork.readthedocs.io/en/latest/guide_sklearn.html#example-pipeline. And I also tried changing the type of output layer from 'Linear' to 'Softmax'.
I am still a novice and I don't know why these methods would help.
Yes,I met the same problem as you were,and solved it by changing"Linear" to “Softmax”.
In the documentation of scikit-learn in section 1.9.2.1 (excerpt is posted below), why does the implementation of random forest differ from the original paper by Breiman? As far as I'm aware, Breiman opted for a majority vote (mode) for classification and an average for regression (paper written by Liaw and Wiener, the maintainers of the original R code with citation below) when aggregating the ensembles of classifiers.
Why does scikit-learn use probabilistic prediction instead of a majority vote?
Is there any advantage in using probabilistic prediction?
The section in question:
In contrast to the original publication [B2001], the scikit-learn
implementation combines classifiers by averaging their probabilistic
prediction, instead of letting each classifier vote for a single
class.
Source: Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R news, 2(3), 18-22.
This question has now been answered on Cross Validated. Included here for reference:
Such questions are always best answered by looking at the code, if
you're fluent in Python.
RandomForestClassifier.predict, at least in the current version
0.16.1, predicts the class with highest probability estimate, as given by predict_proba. (this
line)
The documentation for predict_proba says:
The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest. The
class probability of a single tree is the fraction of samples of the
same class in a leaf.
The difference from the original method is probably just so that
predict gives predictions consistent with predict_proba. The
result is sometimes called "soft voting", rather than the "hard"
majority vote used in the original Breiman paper. I couldn't in quick
searching find an appropriate comparison of the performance of the two
methods, but they both seem fairly reasonable in this situation.
The predict documentation is at best quite misleading; I've
submitted a pull
request to
fix it.
If you want to do majority vote prediction instead, here's a function
to do it. Call it like predict_majvote(clf, X) rather than
clf.predict(X). (Based on predict_proba; only lightly tested, but
I think it should work.)
from scipy.stats import mode
from sklearn.ensemble.forest import _partition_estimators, _parallel_helper
from sklearn.tree._tree import DTYPE
from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted
def predict_majvote(forest, X):
"""Predict class for X.
Uses majority voting, rather than the soft voting scheme
used by RandomForestClassifier.predict.
Parameters
----------
X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
Returns
-------
y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes.
"""
check_is_fitted(forest, 'n_outputs_')
# Check data
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
# Assign chunk of trees to jobs
n_jobs, n_trees, starts = _partition_estimators(forest.n_estimators,
forest.n_jobs)
# Parallel loop
all_preds = Parallel(n_jobs=n_jobs, verbose=forest.verbose,
backend="threading")(
delayed(_parallel_helper)(e, 'predict', X, check_input=False)
for e in forest.estimators_)
# Reduce
modes, counts = mode(all_preds, axis=0)
if forest.n_outputs_ == 1:
return forest.classes_.take(modes[0], axis=0)
else:
n_samples = all_preds[0].shape[0]
preds = np.zeros((n_samples, forest.n_outputs_),
dtype=forest.classes_.dtype)
for k in range(forest.n_outputs_):
preds[:, k] = forest.classes_[k].take(modes[:, k], axis=0)
return preds
On the dumb synthetic case I tried, predictions agreed with the
predict method every time.
This was studied by Breiman in Bagging predictor (http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf).
This gives nearly identical results, but using soft voting gives smoother probabilities. Note that if you are using completely developed tree, you won't have any difference.