I am now trying to fit a classification model for a Multi label text classification problem.
I have a train set X_train that contains list of cleaned text, like
["I am constructing Markov chains with to states and inferring
transition probabilities empirically by simply counting how many
times I saw each transition in my raw data",
"I know the chips only of the players of my table and mine obviously I
also know the total number of chips the max and min amount chips the
players have and the average stackIs it possible to make an
approximation of my probability of winningI have,
...]
and a train multiple tags set y corresponding to each text in X_train, like
[['hypothesis-testing', 'statistical-significance', 'markov-process'],
['probability', 'normal-distribution', 'games'],
...]
Now I want to fit a model that could predict the tags in a text set X_test that has same format as X_train.
I have used the MultiLabelBinarizer to convert the tags and used TfidfVectorizer to convert the cleaned text in train set.
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(y)
Y = multilabel_binarizer.transform(y)
vectorizer = TfidfVectorizer(stop_words = stopWordList)
vectorizer.fit(X_train)
x_train = vectorizer.transform(X_train)
But when I try to fit the model I always get bugs.I have tried OneVsRestClassifier and LogisticRegression.
When I fit a OneVsRestClassifier model I got bugs like
Traceback (most recent call last):
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
self.process_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 348, in process_request
self.finish_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 696, in __init__
self.handle()
File "/usr/local/spark/python/pyspark/accumulators.py", line 268, in handle
poll(accum_updates)
File "/usr/local/spark/python/pyspark/accumulators.py", line 241, in poll
if func():
File "/usr/local/spark/python/pyspark/accumulators.py", line 245, in accum_updates
num_updates = read_int(self.rfile)
File "/usr/local/spark/python/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError
When I fit a LogisticRegression model I got bugs like
/opt/conda/envs/data3/lib/python3.6/site-packages/sklearn/linear_model/sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
"the coef_ did not converge", ConvergenceWarning)
Anyone knows where the problem is and how to solve this? Many thanks.
OneVsRestClassifier fits one classifier per class. You need to tell it which type of classifier you want (for example Losgistic regression).
The following code works for me:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
classifier = OneVsRestClassifier(LogisticRegression())
classifier.fit(x_train, Y)
X_test= ["I play with Markov chains"]
x_test = vectorizer.transform(X_test)
classifier.predict(x_test)
output: array([[0, 1, 1, 0, 0, 1]])
Related
First of all, sorry for any newbie mistakes that I've made. But I couldn't figure out and couldn't find a source specifically for deeppavlov (NER) library. I'm trying to train ner_ontonotes_bert_mult as described here. I guess it can be trained from its checkpoint to make it recognize some specific patterns like;
"Round 23/22; 24,9 x 12,2 x 12,3"
as
[[['Round', '23/22', ';', '24,9 x 12,2 x 12,3']], [['B-PRODUCT', 'I-PRODUCT', 'B-QUANTITY']]]
My questions are (before I dig into details):
Is it possible? And I realized I can't use samples like " Round 23/22; 24,9 x 12,2 x 12,3 ". I need them to be in full sentences.
Where can I find more info about it specifically related to deeppavlov's model(s)?
How can I train pre-trained deeppavlov model to recognize my custom patterns?
I don't even understand if it is possible but I've decided to give it go and prepared 3 .txt files as "train.txt", "test.txt" and "validation.txt" as described in deeppovlov web page. And I put them under the folder '~/.deeppavlov/downloads/ontonotes/ner_ontonotes_bert_mult'. My dataset looks like this:
Round B-PRODUCT
23/22 I-PRODUCT
24,9 x 12,2 x 12,3 B-QUANTITY
Ring B-PRODUCT
HDFAA I-PRODUCT
12,7 x 10 B-QUANTITY
and so on... This is the code I am trying to train it:
import os
# Force tensorflow to use CPU instead of GPU.
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
from deeppavlov import configs, train_model
from deeppavlov.core.commands.utils import parse_config
config_dict = parse_config(configs.ner.ner_ontonotes_bert_mult)
print(config_dict['dataset_reader']['data_path'])
from deeppavlov import configs, train_model
ner_model = train_model(configs.ner.ner_ontonotes_bert_mult)
But I am getting this error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [3] rhs shape= [37]
[[{{node save/Assign_280}}]]
Full traceback:
2019-09-26 15:50:27.63 ERROR in 'deeppavlov.core.common.params'['params'] at line 110: Exception in <class 'deeppavlov.models.bert.bert_ner.BertNerModel'>
Traceback (most recent call last):
File "/home/custom_user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/home/custom_user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/custom_user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [3] rhs shape= [37]
[[{{node save/Assign_280}}]]
UPDATE 2:
And I realized I can't use samples like " Round 23/22; 24,9 x 12,2 x 12,3 ". I need them to be in full sentences.
UPDATE:
It seems like this is happening due to my dataset. My custom dataset only has 3 tags (B-PRODUCT, I-PRODUCT and B-QUANTITY) but the pre-trained model has 37 of them. All available tags can be found here under the sentence of "The list of available tags and their descriptions are presented below.". 18 main tags(with B and I 36 tags), and O tag (“O” means the absence of entity.)). Total of all of the 37 tags needs to be present in the dataset. I was able to pass that error by adding dummy sentences by tagging them all with the missing tags. This is a terrible workaround since I'm willingly disrupting my own data-set. I'm still looking for a 'logical' way to train...
PS: Now I am getting this error.
Traceback (most recent call last):
File "/home/custom_user/.PyCharm2019.2/config/scratches/scratch_9.py", line 13, in <module>
ner_model = train_model(configs.ner.ner_ontonotes_bert_mult)
File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/__init__.py", line 31, in train_model
train_evaluate_model_from_config(config, download=download, recursive=recursive)
File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/commands/train.py", line 121, in train_evaluate_model_from_config
trainer.train(iterator)
File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 294, in train
self.train_on_batches(iterator)
File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 234, in train_on_batches
self._validate(iterator)
File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 150, in _validate
metrics = list(report['metrics'].items())
AttributeError: 'NoneType' object has no attribute 'items'
There are at least two problems here:
1. instead of validation.txt there should be a valid.txt file;
2. you are trying to retrain a model that was pretrained on a different dataset with a different set of tags, it's not necessary.
To train your model from scratch you can do something like:
import json
from deeppavlov import configs, build_model, train_model
with configs.ner.ner_ontonotes_bert_mult.open(encoding='utf8') as f:
ner_config = json.load(f)
ner_config['dataset_reader']['data_path'] = '~/my_data_dir/' # directory with train.txt, valid.txt and test.txt files
ner_config['metadata']['variables']['NER_PATH'] = '~/where_to_save_the_model/'
ner_config['metadata']['download'] = [ner_config['metadata']['download'][-1]] # do not download the pretrained ontonotes model
ner_model = train_model(ner_config, download=True)
The other thing that could go wrong is tokenization: "Round 23/22; 24,9 x 12,2 x 12,3" will be split by the model to ['Round', '23', '/', '22', ';', '24', ',', '9', 'x', '12', ',', '2', 'x', '12', ',', '3'] and not ['Round', '23/22', ';', '24,9 x 12,2 x 12,3'].
But you can tokenize your texts beforehand:
ner_model([['Round', '23/22', ';', '24,9 x 12,2 x 12,3']])
I tried deeppavlov training, and successfully trained the 'ner' model
I also got the same error at first while training, then I overcome by researching more about it
things to know before training -
-> you can find the 'ner_ontonotes_bert_multi.json' config file link in deeppavlov doc, which gives the dataset path, pretrained model path , dataset_reader and chain pipe to train
-> there is a pretrained model in the directory mentioned in the 'config' ,by default it is inside 'C:/users/{user_name}/.deeppavlov/' is the root directory and pretrained models are gonna store in 'models' subdirectory
-> when you started training the already trained model is gonna be modified which means, training just try to improve the pre-trained model
so to train and build your own model (by scratch), simply delete the 'models' subdirectory from the '.deeppavlov' path and execute the training
I am willing to create a GRU model of 3 layers where each layer will have 32,16,8 units respectively. The model would take analog calue as input and produce analog value as output.
I have written the following code:
def getAModelGRU(neuron=(10), look_back=1, numInputs = 1, numOutputs = 1):
model = Sequential()
if len(neuron) > 1:
model.add(GRU(units=neuron[0], input_shape=(look_back,numInputs)))
for i in range(1,len(neuron)-1):
model.add(GRU(units=neuron[i]))
model.add(GRU(units=neuron[-1], input_shape=(look_back,numInputs)))
else:
model.add(GRU(units=neuron, input_shape=(look_back,numInputs)))
model.add(Dense(numOutputs))
model.compile(loss='mean_squared_error', optimizer='adam')
return model
And, I will call this function as:
chkEKF = getAModelGRU(neuron=(32,16,8), look_back=1, numInputs=10, numOutputs=6)
And, I obtained the following:
Traceback (most recent call last):
File "/home/momtaz/Dropbox/QuadCopter/quad_simHierErrorCorrectionEstimator.py", line 695, in <module>
Single_Point2Point()
File "/home/momtaz/Dropbox/QuadCopter/quad_simHierErrorCorrectionEstimator.py", line 74, in Single_Point2Point
chkEKF = getAModelGRU(neuron=(32,16,8), look_back=1, numInputs=10, numOutputs=6)
File "/home/momtaz/Dropbox/QuadCopter/rnnUtilQuad.py", line 72, in getAModelGRU
model.add(GRU(units=neuron[i]))
File "/home/momtaz/PycharmProjects/venv/lib/python3.6/site-packages/keras/engine/sequential.py", line 181, in add
output_tensor = layer(self.outputs[0])
File "/home/momtaz/PycharmProjects/venv/lib/python3.6/site-packages/keras/layers/recurrent.py", line 532, in __call__
return super(RNN, self).__call__(inputs, **kwargs)
File "/home/momtaz/PycharmProjects/venv/lib/python3.6/site-packages/keras/engine/base_layer.py", line 414, in __call__
self.assert_input_compatibility(inputs)
File "/home/momtaz/PycharmProjects/venv/lib/python3.6/site-packages/keras/engine/base_layer.py", line 311, in assert_input_compatibility
str(K.ndim(x)))
ValueError: Input 0 is incompatible with layer gru_2: expected ndim=3, found ndim=2
I tried online but did not find any solution for 'ndim' related issue.
Please let me know which I am doing wrong here.
You need to ensure input_shape parameter is being defined in the first layer exclusively, and every layer to have return_sequences=True except potentially the last one (depending on your model).
The code below serves for the common case where you want to stack several layers and only the number of units in each layer changes.
model = tf.keras.Sequential()
gru_options = [dict(units = units,
time_major=False,
kernel_regularizer=0.01,
# ... potentially more options
return_sequences=True) for units in [32,16,8]]
gru_options[0]['input_shape'] = (n_timesteps, n_inputs)
gru_options[-1]['return_sequences']=False # optionally disable sequences in the last layer.
# If you want to return sequences in your last
# layer delete this line, however it is necessary
# if you want to connect this to a dense layer
# for example.
for opts in gru_options:
model.add(tf.keras.layers.GRU(**opts))
model.add(tf.keras.Dense(6))
By the way there is a bug in your code, as no indentation is done after the else clause. Also notice that Python classes that implement the Iterable protocol (e.g. lists and tuples) can be iterated over by using for-in syntax, you don't have to do C-like iteration (it's more idiomatic or pythonic to use the aforementioned syntax).
I'm trying to graph a learning curve using the SVC classifier. The dataset is kinda skewed, about 150, 1000, 1000, 1000 and 150 in size. I'm running into problem with fitting the estimator:
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/learning_curve.py", line 135, in learning_curve
for train, test in cv for n_train_samples in train_sizes_abs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 644, in __call__
self.dispatch(function, args, kwargs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 391, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 129, in __init__
self.results = func(*args, **kwargs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1233, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/svm/base.py", line 140, in fit
X = atleast2d_or_csr(X, dtype=np.float64, order='C')
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/svm/base.py", line 450, in _validate_targets
% len(cls))
ValueError: The number of classes has to be greater than one; got 1
My code
df = pd.read_csv('../resources/problem2_processed_validate.csv')
data, label = preprocess_text(df)
cv = StratifiedKFold(label, 10)
plt = plot_learning_curve(estimator=SVC(), title="Learning curve", X=data, y=label.values, cv
train_sizes, train_scores, test_scores = learning_curve(
estimator, data, y=label, cv=cv, train_sizes=np.linspace(.1, 1.0, 5))
Even though I use stratified sampling, I still run into this error. I believe its because the learning curve code doesn't perform stratification when incrementing dataset size, and I've got all similar class labels at one step.
How should I resolve this??
You could use StratifiedShuffleSplit instead of StratifiedKFold, and then write the learning curve loop yourself, creating a new CV object at each iteration. StratifiedShuffleSplit allows you to specify a train_size and a test_size which you can increment as you create your learning curve. As long as you let train_size be greater than the number of classes, it will be able to stratify.
You are right. learning_curve doesn't perform stratification when creating a smaller data set, it just takes the first bit of the data. Lines 134-136 in learning_curve.py say
train[:n_train_samples] for n_train_samples in train_sizes_abs
You can shuffle your data in advance, so that the slice train[:n_train_samples] may (but is not guaranteed to) include data points from all classes. If you are willing to do some more work, what #eickenberg proposed will work.
PS This sounds like something that should be included in sklearn. If you do end up writing that code, please send a pull request on github
I am using 'roc_curve' from the metrics model in scikit-learn. The example shows that 'roc_curve' should be called before 'auc' similar to:
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
and then:
metrics.auc(fpr, tpr)
However the following error is returned:
Traceback (most recent call last): File "analysis.py", line 207, in <module>
r = metrics.auc(fpr, tpr) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/metrics/metrics.py", line 66, in auc
x, y = check_arrays(x, y) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 215, in check_arrays
_assert_all_finite(array) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 18, in _assert_all_finite
raise ValueError("Array contains NaN or infinity.") ValueError: Array contains NaN or infinity.
What does it mean in terms or results/is there a way to overcome this?
Are you trying to us roc_curve to evaluate a multiclass classifier? In other words, if you are using roc_curve on a classification problem that is not binary, then this won't work correctly. There is math out there for multidimensional ROC analysis, but the current ROC methods in python don't implement them.
To evaluate multiclass problems trying using methods like: confusion_matrix and classification_report from sklearn, and kappa() from skll.
You state this line:
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
which leads to the conclusion that you may have copied the sklearn example which also uses "pos_label=2".
However, in most cases you want the "pos_label" to be 1. So if your code outputs probabilities and they are between 0 and 1, then your pos_label should be 1.
I'm using OneVsRestClassifier for multilabel classification. It works with LinearSVC, but when I apply it to SVC, the following error appears:
classifier = OneVsRestClassifier(SVC(class_weight='balanced'))
classifier.fit(X1, y1)
y2 = classifier.predict(X2)
Traceback (most recent call last):
...
File "/usr/local/lib/python2.7/dist-packages/sklearn/multiclass.py", line 219, in predict
return predict_ovr(self.estimators_, self.label_binarizer_, X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/multiclass.py", line 93, in predict_ovr
Y = np.array([_predict_binary(e, X) for e in estimators])
File "/usr/local/lib/python2.7/dist-packages/sklearn/multiclass.py", line 66, in _predict_binary
score = estimator.predict_proba(X)[:, 1]
File "/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py", line 490, in predict_proba
"probability estimates must be enabled to use this method")
NotImplementedError: probability estimates must be enabled to use this method</code>
Does anybody know what is it?
This is a bug. The OneVsRestClassifier calls the predict_proba method when it finds one, but the one on SVC does not actually work unless you construct it with probability=True to get Platt scaling (which I don't actually encourage).
The reason that it works for LinearSVC is that that class does not have a predict_proba, so OvR backs off to the decision_function method.