Tree classifier to graphviz ERROR - python-3.x

I had made a Tree Classifier named model and tried to use the export graphviz function like this:
export_graphviz(decision_tree=model,
out_file='NT_model.dot',
feature_names=X_train.columns,
class_names=model.classes_,
leaves_parallel=True,
filled=True,
rotate=False,
rounded=True)
For some reason my run had raised this exception:
TypeError Traceback (most recent call last)
<ipython-input-298-40fe56bb0c85> in <module>()
6 filled=True,
7 rotate=False,
----> 8 rounded=True)
C:\Users\yonatanv\AppData\Local\Continuum\Anaconda3\lib\site-
packages\sklearn\tree\export.py in export_graphviz(decision_tree, out_file,
max_depth, feature_names, class_names, label, filled, leaves_parallel,
impurity, node_ids, proportion, rotate, rounded, special_characters)
431 recurse(decision_tree, 0, criterion="impurity")
432 else:
--> 433 recurse(decision_tree.tree_, 0,
criterion=decision_tree.criterion)
434
435 # If required, draw leaf nodes at same depth as each other
C:\Users\yonatanv\AppData\Local\Continuum\Anaconda3\lib\site-
packages\sklearn\tree\export.py in recurse(tree, node_id, criterion, parent,
depth)
319 out_file.write('%d [label=%s'
320 % (node_id,
--> 321 node_to_str(tree, node_id,
criterion)))
322
323 if filled:
C:\Users\yonatanv\AppData\Local\Continuum\Anaconda3\lib\site-
packages\sklearn\tree\export.py in node_to_str(tree, node_id, criterion)
289 np.argmax(value),
290 characters[2])
--> 291 node_string += class_name
292
293 # Clean up any trailing newlines
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype('<U90') dtype('<U90') dtype('<U90')
My hyper parameters for the visualizations are those:
print(model)
DecisionTreeClassifier(class_weight={1.0: 10, 0.0: 1}, criterion='gini',
max_depth=7, max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=50,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=0, splitter='best')
print(model.classes_)
[ 0. , 1. ]
Help would be most appreciated!

As you see here specified in the documentation of export_graphviz, the param class_names works for strings, not float or int.
class_names : list of strings, bool or None, optional (default=None)
Try converting the model.classes_ to list of strings before passing them in export_graphviz.
Try class_names=['0', '1'] or class_names=['0.0', '1.0'] in the call to export_graphviz().
For a more general solution, use:
class_names=[str(x) for x in model.classes_]
But is there a specific reason that you are passing float values as y in model.fit()? Because that is mostly not required in classification task. Do you have actual y labels as this only or are you converting string labels to numeric before fitting the model?

Related

How to encode empty string using BERT

I have recently been trying to encode an empty string with CamemBERT (BERT model for French). I wasn't sure on how to do that. If I try to simply encode an empty string,
from transformers import CamembertModel, CamembertTokenizer
import torch
tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
camembert = CamembertModel.from_pretrained("camembert-base")
tokenized_sentence = tokenizer.tokenize("")
encoded_sentence = tokenizer.encode(tokenized_sentence, return_tensors='pt')
embeddings = camembert(encoded_sentence)
embeddings.last_hidden_state.squeeze()[0] # embedding of the CLS token
I get the error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-21-553400f369a8> in <module>
1 # Tokenize in sub-words with SentencePiece
2 tokenized_sentence = tokenizer.tokenize("")
----> 3 encoded_sentence = tokenizer.encode(tokenized_sentence, return_tensors='pt')
4 embeddings = camembert(encoded_sentence)
5 embeddings.last_hidden_state.squeeze()[0] # embeddings.last_hidden_state[0][0]
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in encode(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, return_tensors, **kwargs)
2057 ``convert_tokens_to_ids`` method).
2058 """
-> 2059 encoded_inputs = self.encode_plus(
2060 text,
2061 text_pair=text_pair,
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2376 )
2377
-> 2378 return self._encode_plus(
2379 text=text,
2380 text_pair=text_pair,
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
459 )
460
--> 461 first_ids = get_input_ids(text)
462 second_ids = get_input_ids(text_pair) if text_pair is not None else None
463
~/anaconda3/envs/r_nlp2/lib/python3.8/site-packages/transformers/tokenization_utils.py in get_input_ids(text)
446 )
447 else:
--> 448 raise ValueError(
449 f"Input {text} is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
450 )
ValueError: Input [] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Which I think is expected behavior. I have tried with spaCy's French transformer model but have also been unsuccessful. Here's the code I used for spaCy :
from transformers import BertTokenizer, BertModel
import spacy
#!python -m spacy download fr_dep_news_trf
trf_fr = spacy.load("fr_dep_news_trf")
example = trf_fr("")
example._.trf_data.tensors[1].flatten() # embedding of the CLS token
And the error is
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-27-c53de04d2e6f> in <module>
1 example = trf_fr("")
----> 2 example._.trf_data.tensors[1].flatten()
IndexError: list index out of range
simply because the model returns [].
I guess that at this point, my question is theoretical: what would be the best or a good way to encode an empty string using CamemBERT or spaCy? Would "forcing" the model to return a vector of 0 be a good thing? Would returning "impossible" values such as a (10,..., 10) be a good possibility? Should I force the tokenizer to create a sequence of [PAD] tokens? In this case, how would I implement that using spaCy and/or CamemBERT?
Thanks!
PS : I'm using
Python 3.8.10
spaCy 3.0.6
transformers 4.6.1

How do I pass the values to Catboost?

I'm trying to work with catboost and I've got a problem that I'm really stuck with right now. I have a dataframe with 28 columns, 2 of them are categorical. When the data is numerical there are some even and some fractional numbers, also some 0.00 values that should represent not an empty values but the actual nulls (like 1-1=0).
I'm trying to run this:
train_cl = cb.Pool(data=ret_df.iloc[:580000, :-1], label=ret_df.iloc[:580000, -1], cat_features=cats)
evl_cl = cb.Pool(data=ret_df.iloc[580000:, :-1], label=ret_df.iloc[580000:, -1], cat_features=cats)
But I have this error
---------------------------------------------------------------------------
CatBoostError Traceback (most recent call last)
<ipython-input-112-a515b0ab357b> in <module>
1 train_cl = cb.Pool(data=ret_df.iloc[:580000, :-1], label=ret_df.iloc[:580000, -1], cat_features=cats)
----> 2 evl_cl = cb.Pool(data=ret_df.iloc[580000:, :-1], label=ret_df.iloc[580000:, -1], cat_features=cats)
~\AppData\Local\Programs\Python\Python36\lib\site-packages\catboost\core.py in __init__(self, data, label, cat_features, text_features, embedding_features, column_description, pairs, delimiter, has_header, ignore_csv_quoting, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count, log_cout, log_cerr)
615 )
616
--> 617 self._init(data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
618 super(Pool, self).__init__()
619
~\AppData\Local\Programs\Python\Python36\lib\site-packages\catboost\core.py in _init(self, data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
1081 if label is not None:
1082 self._check_label_type(label)
-> 1083 self._check_label_empty(label)
1084 label = self._label_if_pandas_to_numpy(label)
1085 if len(np.shape(label)) == 1:
~\AppData\Local\Programs\Python\Python36\lib\site-packages\catboost\core.py in _check_label_empty(self, label)
723 """
724 if len(label) == 0:
--> 725 raise CatBoostError("Labels variable is empty.")
726
727 def _check_label_shape(self, label, samples_count):
CatBoostError: Labels variable is empty.
I've googled this trouble, but found nothing. My hypothesis is that there is a problem with 0.00 values, but I do not know how to solve this because I literally can't replace these values with anything.
Please, help me!

I keep getting "TypeError: only integer scalar arrays can be converted to a scalar index" while using custom-defined metric in KNeighborsClassifier

I am using a custom-defined metric in SKlearn's KNeighborsClassifier. Here's my code:
def chi_squared(x,y):
return np.divide(np.square(np.subtract(x,y)), np.sum(x,y))
Above function implementation of chi squared distance function. I have used NumPy functions because according to scikit-learn docs, metric function takes two one-dimensional numpy arrays.
I have passed the chi_squared function as an argument to KNeighborsClassifier().
knn = KNeighborsClassifier(algorithm='ball_tree', metric=chi_squared)
However, I keep getting following error:
TypeError Traceback (most recent call last)
<ipython-input-29-d2a365ebb538> in <module>
4
5 knn = KNeighborsClassifier(algorithm='ball_tree', metric=chi_squared)
----> 6 knn.fit(X_train, Y_train)
7 predictions = knn.predict(X_test)
8 print(accuracy_score(Y_test, predictions))
~/.local/lib/python3.8/site-packages/sklearn/neighbors/_classification.py in fit(self, X, y)
177 The fitted k-nearest neighbors classifier.
178 """
--> 179 return self._fit(X, y)
180
181 def predict(self, X):
~/.local/lib/python3.8/site-packages/sklearn/neighbors/_base.py in _fit(self, X, y)
497
498 if self._fit_method == 'ball_tree':
--> 499 self._tree = BallTree(X, self.leaf_size,
500 metric=self.effective_metric_,
501 **self.effective_metric_params_)
sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._ball_tree.BinaryTree.__init__()
sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._ball_tree.BinaryTree._recursive_build()
sklearn/neighbors/_ball_tree.pyx in sklearn.neighbors._ball_tree.init_node()
sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._ball_tree.BinaryTree.rdist()
sklearn/neighbors/_dist_metrics.pyx in sklearn.neighbors._dist_metrics.DistanceMetric.rdist()
sklearn/neighbors/_dist_metrics.pyx in sklearn.neighbors._dist_metrics.PyFuncDistance.dist()
sklearn/neighbors/_dist_metrics.pyx in sklearn.neighbors._dist_metrics.PyFuncDistance._dist()
<ipython-input-29-d2a365ebb538> in chi_squared(x, y)
1 def chi_squared(x,y):
----> 2 return np.divide(np.square(np.subtract(x,y)), np.sum(x,y))
3
4
5 knn = KNeighborsClassifier(algorithm='ball_tree', metric=chi_squared)
<__array_function__ internals> in sum(*args, **kwargs)
~/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py in sum(a, axis, dtype, out, keepdims, initial, where)
2239 return res
2240
-> 2241 return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
2242 initial=initial, where=where)
2243
~/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
85 return reduction(axis=axis, out=out, **passkwargs)
86
---> 87 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
88
89
TypeError: only integer scalar arrays can be converted to a scalar index
I can reproduce your error message with:
In [173]: x=np.arange(3); y=np.array([2,3,4])
In [174]: np.sum(x,y)
Traceback (most recent call last):
File "<ipython-input-174-1a1a267ebd82>", line 1, in <module>
np.sum(x,y)
File "<__array_function__ internals>", line 5, in sum
File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 2247, in sum
return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
TypeError: only integer scalar arrays can be converted to a scalar index
Correct use(s) of np.sum:
In [175]: np.sum(x)
Out[175]: 3
In [177]: np.sum(np.arange(6).reshape(2,3), axis=0)
Out[177]: array([3, 5, 7])
In [178]: np.sum(np.arange(6).reshape(2,3), 0)
Out[178]: array([3, 5, 7])
(re)read the np.sum docs if necessary!
Using np.add instead of np.sum:
In [179]: np.add(x,y)
Out[179]: array([2, 4, 6])
In [180]: x+y
Out[180]: array([2, 4, 6])
The following should be equivalent:
np.divide(np.square(np.subtract(x,y)), np.add(x,y))
(x-y)**2/(x+y)

Loading existing MODFLOW-USG model w/ Voronoi Mesh in FloPy

I am trying to load an existing MODFLOW-USG model with FloPy (Windows environment). The model has a Voronoi mesh, and this seems to trip the "load" function:
m1=flopy.modflow.Modflow.load(model_name+".nam",model_ws=model_dir,verbose=True,check=False,exe_name="mfusg.exe",version='mfusg')
I get the following error, which appears to relate to the fact that FloPy is expecting a structured grid with rows and columns:
TypeError Traceback (most recent call last)
<ipython-input-33-62420c415719> in <module>
6 head_file = os.path.join(model_dir,model_name+'.hds')
7 print(head_file)
----> 8 m1=flopy.modflow.Modflow.load(model_name+".nam",model_ws=model_dir,verbose=True,check=False,exe_name="mfusg.exe",version='mfusg')
9 headobj = bf.HeadUFile(head_file,verbose=True,text='HEADU')
10 headobj.list_records()
~\Anaconda3\lib\site-packages\flopy\modflow\mf.py in load(f, version, exe_name, verbose, model_ws, load_only, forgive, check)
797 item.package.load(item.filehandle, ml,
798 ext_unit_dict=ext_unit_dict,
--> 799 check=False)
800 else:
801 item.package.load(item.filehandle, ml,
~\Anaconda3\lib\site-packages\flopy\modflow\mfrch.py in load(f, model, nper, ext_unit_dict, check)
408 print(txt)
409 t = Util2d.load(f, model, (nrow, ncol), np.float32, 'rech',
--> 410 ext_unit_dict)
411 else:
412 parm_dict = {}
~\Anaconda3\lib\site-packages\flopy\utils\util_array.py in load(f_handle, model, shape, dtype, name, ext_unit_dict, array_free_format, array_format)
2699
2700 elif cr_dict['type'] == 'internal':
-> 2701 data = Util2d.load_txt(shape, f_handle, dtype, cr_dict['fmtin'])
2702 u2d = Util2d(model, shape, dtype, data, name=name,
2703 iprn=cr_dict['iprn'], fmtin="(FREE)",
~\Anaconda3\lib\site-packages\flopy\utils\util_array.py in load_txt(shape, file_in, dtype, fmtin)
2376 elif len(shape) == 2:
2377 nrow, ncol = shape
-> 2378 num_items = nrow * ncol
2379 else:
2380 raise ValueError(
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
I could not find any documentation or Jupyter notebooks with examples of loading an existing model with Voronoi mesh, only creating new triangular meshes or structured / local-grid-refined grids.
Try the code with forgive = True.
m1=flopy.modflow.Modflow.load(model_name+".nam",model_ws=model_dir,verbose=True,check=False,exe_name="mfusg.exe",version='mfusg', forgive = True)

Reconstruct Image from overlapping patches of image

I have used tf.extract_image_patches() to get a tensor of overlapping patches
from the image as described in this link. The answer in the mentioned link suggests to use tf.space_to_depth() to reconstruct the image from overlapping patches. But the problem is that this does not give the desirable results in my case and upon researching I came to know that tf.space_to_depth() does not deal with the overlapping blocks. My code looks like:
import tensorflow as tf
import numpy as np
c = 3
height = 3900
width = 6000
ksizes = [1, 150, 150, 1]
strides = [1, 75, 75, 1]
image = #image of shape [1, height, width, 3]
patches = tf.extract_image_patches(image, ksizes = ksizes, strides= strides, [1, 1, 1, 1], 'VALID')
patches = tf.reshape(patches, [-1, 150, 150, 3])
reconstructed = tf.reshape(patches, [1, height, width, 3])
rec_new = tf.space_to_depth(reconstructed,75)
rec_new = tf.reshape(rec_new,[height,width,3])
This gives me error:
InvalidArgumentError Traceback (most recent call
last)
D:\AnacondaIDE\lib\site-packages\tensorflow\python\framework\common_shapes.py
in _call_cpp_shape_fn_impl(op, input_tensors_needed,
input_tensors_as_shapes_needed, require_shape_fn)
653 graph_def_version, node_def_str, input_shapes, input_tensors,
--> 654 input_tensors_as_shapes, status)
655 except errors.InvalidArgumentError as err:
D:\AnacondaIDE\lib\contextlib.py in exit(self, type, value,
traceback)
87 try:
---> 88 next(self.gen)
89 except StopIteration:
D:\AnacondaIDE\lib\site-packages\tensorflow\python\framework\errors_impl.py
in raise_exception_on_not_ok_status()
465 compat.as_text(pywrap_tensorflow.TF_Message(status)),
--> 466 pywrap_tensorflow.TF_GetCode(status))
467 finally:
InvalidArgumentError: Dimension size must be evenly divisible by
70200000 but is 271957500 for 'Reshape_22' (op: 'Reshape') with input
shapes: [4029,150,150,3], [4] and with input tensors computed as
partial shapes: input1 = [?,3900,6000,3].
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call
last) in ()
----> 1 reconstructed = tf.reshape(features, [-1, height, width, channel])
2 rec_new = tf.space_to_depth(reconstructed,75)
3 rec_new = tf.reshape(rec_new,[h,h,c])
D:\AnacondaIDE\lib\site-packages\tensorflow\python\ops\gen_array_ops.py
in reshape(tensor, shape, name) 2617 """ 2618 result =
_op_def_lib.apply_op("Reshape", tensor=tensor, shape=shape,
-> 2619 name=name) 2620 return result 2621
D:\AnacondaIDE\lib\site-packages\tensorflow\python\framework\op_def_library.py
in apply_op(self, op_type_name, name, **keywords)
765 op = g.create_op(op_type_name, inputs, output_types, name=scope,
766 input_types=input_types, attrs=attr_protos,
--> 767 op_def=op_def)
768 if output_structure:
769 outputs = op.outputs
D:\AnacondaIDE\lib\site-packages\tensorflow\python\framework\ops.py in
create_op(self, op_type, inputs, dtypes, input_types, name, attrs,
op_def, compute_shapes, compute_device) 2630
original_op=self._default_original_op, op_def=op_def) 2631 if
compute_shapes:
-> 2632 set_shapes_for_outputs(ret) 2633 self._add_op(ret) 2634
self._record_op_seen_by_control_dependencies(ret)
D:\AnacondaIDE\lib\site-packages\tensorflow\python\framework\ops.py in
set_shapes_for_outputs(op) 1909 shape_func =
_call_cpp_shape_fn_and_require_op 1910
-> 1911 shapes = shape_func(op) 1912 if shapes is None: 1913 raise RuntimeError(
D:\AnacondaIDE\lib\site-packages\tensorflow\python\framework\ops.py in
call_with_requiring(op) 1859 1860 def
call_with_requiring(op):
-> 1861 return call_cpp_shape_fn(op, require_shape_fn=True) 1862 1863 _call_cpp_shape_fn_and_require_op =
call_with_requiring
D:\AnacondaIDE\lib\site-packages\tensorflow\python\framework\common_shapes.py
in call_cpp_shape_fn(op, require_shape_fn)
593 res = _call_cpp_shape_fn_impl(op, input_tensors_needed,
594 input_tensors_as_shapes_needed,
--> 595 require_shape_fn)
596 if not isinstance(res, dict):
597 # Handles the case where _call_cpp_shape_fn_impl calls unknown_shape(op).
D:\AnacondaIDE\lib\site-packages\tensorflow\python\framework\common_shapes.py
in _call_cpp_shape_fn_impl(op, input_tensors_needed,
input_tensors_as_shapes_needed, require_shape_fn)
657 missing_shape_fn = True
658 else:
--> 659 raise ValueError(err.message)
660
661 if missing_shape_fn:
ValueError: Dimension size must be evenly divisible by 70200000 but is
271957500 for 'Reshape_22' (op: 'Reshape') with input shapes:
[4029,150,150,3], [4] and with input tensors computed as partial
shapes: input1 = [?,3900,6000,3].
I know this is error due to non-compatible dimensions, but it should be that way, right? Please help me to solve this.
I guess that the problem is that in the link you posted the author is using the same value for strides and ksizes, while you are using strides equal to one half of ksizes. This is the reason why the dimensions do not match, you should write the logic of reducing the size of the patches before gluing them (for instance by selecting the central square of each patch).

Resources