spark 1.6.1 python 3.5.1 building naive bayes classifier - python-3.x

My question is based upon this.
Would it be possible more detailed comments/explain code starting
line tf = HashingTF().transform( training_raw.map(lambda doc:
doc["text"], preservesPartitioning=True))
How could I print the confusion matrix?
What does below error mean? How can I fix it? The model still gets built and I get predictions
>>> # Train and check
... model = NaiveBayes.train(training)
[Stage 2:=============================> (2 + 2) / 4]16/04/05 18:18:28 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
16/04/05 18:18:28 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
How could I print results for the new observation. I tried and
failed
>>> model.predict("love")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\spark-1.6.1-bin-hadoop2.6\spark-1.6.1-bin-hadoop2.6\python\pyspark\mllib\classification.py", line 594, in predict
x = _convert_to_vector(x)
File "c:\spark-1.6.1-bin-hadoop2.6\spark-1.6.1-bin-hadoop2.6\python\pyspark\mllib\linalg\__init__.py", line 77, in _convert_to_vector
raise TypeError("Cannot convert type %s into Vector" % type(l))
TypeError: Cannot convert type <class 'str'> into Vector

1.hashingTF in spark is similiar to the scikitlearn HashingVectorizer. training_raw is an rdd of text.For a detailed explanation of the available vectorizers in pySpark see Vectorizers. For a complete example see this post
2.BLAS is the Basic Linear Algebra Subprograms library. You can check out this page on github for a potential solution.
3.You are trying to use model.predict on a string ("love"). You must first convert the string to a vector. A simple example that takes a dense vector string and outputs a dense vector with label is
def parseLine(line):
parts = line.split(',')
label = float(parts[0])
features = Vectors.dense([float(x) for x in parts[1].split(' ')])
return LabeledPoint(label, features)
You are probably looking for a sparse vector. So try Vectors.sparse.

Related

BERT NER: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first

I want to train my BERT NER model on colab. But following error occurs
Code:
tr_logits = tr_logits.detach().cpu().numpy()
tr_label_ids = torch.masked_select(b_labels, (preds_mask == 1))
tr_batch_preds = np.argmax(tr_logits[preds_mask.squeeze()], axis=1)
tr_batch_labels = tr_label_ids.to(device).numpy()
tr_preds.extend(tr_batch_preds)
tr_labels.extend(tr_batch_labels)
Error:
Using TensorFlow backend.
Saved standardized data to ./data/en/combined/train_combined.txt.
Saved standardized data to ./data/en/combined/dev_combined.txt.
Saved standardized data to ./data/en/combined/test_combined.txt.
Constructed SentenceGetter with 25650 examples.
Constructed SentenceGetter with 8934 examples.
Loaded training and validation data into DataLoaders.
Initialized model and moved it to cuda.
Initialized optimizer and set hyperparameters.
Epoch: 0% 0/5 [00:00<?, ?it/s]Starting training loop.
Epoch: 0% 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/content/FYP_Presentation/python/main.py", line 102, in <module>
valid_dataloader,
File "/content/FYP_Presentation/python/utils/main_utils.py", line 431, in train_and_save_model
tr_batch_preds = torch.max(tr_logits[preds_mask.squeeze()], axis=1)
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 412, in __array__
return self.numpy()
TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
How would I solve this issue?
In the first line of your code, tr_logits = tr_logits.detach().cpu().numpy() already turns tr_logits into a numpy array. In the line that raises the error:
tr_batch_preds = torch.max(tr_logits[preds_mask.squeeze()], axis=1)
the first thing for the program to do is to evaluate tr_logits[preds_mask.squeeze()]. Now that tr_logits is numpy array, its index preds_mask must also be numpy array. So the programs calls preds_mask.numpy() to change it to a numpy array. However, it is on GPU and hence the error.
I'd suggest using either numpy arrays or pytorch tensors all the way in one program, not alternatively .

OCR code written without custom loss function

I am working on OCR model. my final goal is to convert OCR code into coreML and deploy it into ios.
I have looked and run a couple of the github source codes namely:
here
here
as you have a look on them they all implemented loss as a custom layer with lambda layer.
the problem start when I want to convert this to coreML.
my piece of the code to convert to CoreMl:
import coremltools
def convert_lambda(layer):
# Only convert this Lambda layer if it is for our swish function.
if layer.function == ctc_lambda_func:
params = NeuralNetwork_pb2.CustomLayerParams()
# The name of the Swift or Obj-C class that implements this layer.
params.className = "x"
# The desciption is shown in Xcode's mlmodel viewer.
params.description = "A fancy new loss"
return params
else:
return None
print("\nConverting the model:")
# Convert the model to Core ML.
coreml_model = coremltools.converters.keras.convert(
model,
# 'weightswithoutstnlrchangedbackend.best.hdf5',
input_names="image",
image_input_names="image",
output_names="output",
add_custom_layers=True,
custom_conversion_functions={"Lambda": convert_lambda},
)
but it raises error
Converting the model:
Traceback (most recent call last):
File "/home/sgnbx/Downloads/projects/CRNN-with-STN-master/CRNN_with_STN.py", line 201, in <module>
custom_conversion_functions={"Lambda": convert_lambda},
File "/home/sgnbx/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/coremltools/converters/keras/_keras_converter.py", line 760, in convert
custom_conversion_functions=custom_conversion_functions)
File "/home/sgnbx/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/coremltools/converters/keras/_keras_converter.py", line 556, in convertToSpec
custom_objects=custom_objects)
File "/home/sgnbx/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/coremltools/converters/keras/_keras2_converter.py", line 255, in _convert
if input_names[idx] in input_name_shape_dict:
IndexError: list index out of range
Input name length mismatch
I am kind of not sure I can resolve this as I did not find anything relevant to this error to resolve.
In other hand most codes for OCR have Custom Loss function which probably again I face with the same problem.
So in the end I have two question:
Do you know how to resolve this error
my main question do you know any source code for OCR which is in KERAS (As i have to convert it to coreMl) and do not have custom loss function so it will be ok converting to CoreMl without problem?
Thanks in advance:)
just to make my question thorough:
this is the custom loss function in the source I am working:
def ctc_lambda_func(args):
iy_pred, ilabels, iinput_length, ilabel_length = args
# the 2 is critical here since the first couple outputs of the RNN
# tend to be garbage:
iy_pred = iy_pred[:, 2:, :] # no such influence
return backend.ctc_batch_cost(ilabels, iy_pred, iinput_length, ilabel_length)
loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')
([fc_2, labels, input_length, label_length])
and then use it in compile:
model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer=sgd)
CoreML doesn't allow you train model, so it's not important to have a loss function or not. If you only want to use CRNN as predictor on iOS , you should just convert base_model in second link.

Missing method NeuralNet.train_split() in lasagne

I am learning to deal with python and lasagne. I have following installed on my pc:
python 3.4.3
theano 0.9.0
lasagne 0.2.dev1
and also six, scipy and numpy. I call net.fit(), and the stacktrace tries to call train_split(X, y, self), which, I guess, should split the samples into training set and validation set (both the inputs X as well as the outputs Y).
But there is no method like train_split(X, y, self) , there is only a float field train_split - I assume, the ratio between training and validation set sizes. Then I get following error:
Traceback (most recent call last):
File "...\workspaces\python\cnn\dl_tutorial\lasagne\Test.py", line
72, in
net = net1.fit(X[0:10,:,:,:],y[0:10])
File "...\Python34\lib\site-packages\nolearn\lasagne\base.py", line
544, in fit
self.train_loop(X, y, epochs=epochs)
File "...\Python34\lib\site-packages\nolearn\lasagne\base.py", line
554, in train_loop
X_train, X_valid, y_train, y_valid = self.train_split(X, y, self)
TypeError: 'float' object is not callable
What could be wrong or missing? Any suggestions? Thank you very much.
SOLVED
in previous versions, the input parameter train_split has been a number, that was used by the same-named method. In nolearn 0.6.0, it's a callable object, that can implement its own logic to split the data. So instead of providing a float number to the input parameter train_split, I have to provide a callable instance (the default one is TrainSplit), that will be executed in each training epoch.

scikit learn says num samples must be greater than num clusters

Using sklearn.cluster.KMeans. Nearly this exact code worked earlier, all I changed was the way I built my dataset. I have just no idea where even to start... Here's the code:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=20)
for item in dfX:
if type(item) != type(dfX[0]):
print(item)
print(len(dfX))
print(dfX[:10])
km.fit(dfX)
print(km.cluster_centers_)
Which outputs the following:
12147
[1.201, 1.237, 1.092, 1.074, 0.979, 0.885, 1.018, 1.083, 1.067, 1.071]
/home/sbendl/anaconda3/lib/python3.5/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
Traceback (most recent call last):
File "/home/sbendl/PycharmProjects/MLFP/K-means.py", line 20, in <module>
km.fit(dfX)
File "/home/sbendl/anaconda3/lib/python3.5/site-packages/sklearn/cluster/k_means_.py", line 812, in fit
X = self._check_fit_data(X)
File "/home/sbendl/anaconda3/lib/python3.5/site-packages/sklearn/cluster/k_means_.py", line 789, in _check_fit_data
X.shape[0], self.n_clusters))
ValueError: n_samples=1 should be >= n_clusters=20
Process finished with exit code 1
As you can see from the output, there are definitely 12147 samples, which is greater than 20 in most counting systems ;). Additionally they're all floats, so it couldn't be having a problem with that. Anyone have any ideas?

How to handle NaNs returned from 'roc_curve' before passing to 'auc'?

I am using 'roc_curve' from the metrics model in scikit-learn. The example shows that 'roc_curve' should be called before 'auc' similar to:
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
and then:
metrics.auc(fpr, tpr)
However the following error is returned:
Traceback (most recent call last): File "analysis.py", line 207, in <module>
r = metrics.auc(fpr, tpr) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/metrics/metrics.py", line 66, in auc
x, y = check_arrays(x, y) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 215, in check_arrays
_assert_all_finite(array) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 18, in _assert_all_finite
raise ValueError("Array contains NaN or infinity.") ValueError: Array contains NaN or infinity.
What does it mean in terms or results/is there a way to overcome this?
Are you trying to us roc_curve to evaluate a multiclass classifier? In other words, if you are using roc_curve on a classification problem that is not binary, then this won't work correctly. There is math out there for multidimensional ROC analysis, but the current ROC methods in python don't implement them.
To evaluate multiclass problems trying using methods like: confusion_matrix and classification_report from sklearn, and kappa() from skll.
You state this line:
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
which leads to the conclusion that you may have copied the sklearn example which also uses "pos_label=2".
However, in most cases you want the "pos_label" to be 1. So if your code outputs probabilities and they are between 0 and 1, then your pos_label should be 1.

Resources