tfidf first time, using it on a Pandas series that has a list per entry

tfidf first time, using it on a Pandas series that has a list per entry - scikit-learn

Data looks like this :
data_clean2.head(3)
text target
0 [deed, reason, earthquak, may, allah, forgiv, u] 1
1 [forest, fire, near, la, rong, sask, canada] 1
2 [resid, ask, shelter, place, notifi, offic, evacu, shelter, place, order, expect] 1
I got this by stemming and lemmatizing the sentence and tokenizing before that. ( Hope that is right).
Now I want to use:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data_clean2['text'])
It gives me the following error :
AttributeError Traceback (most recent call last)
<ipython-input-140-6f68d1115c5f> in <module>
1 vectorizer = TfidfVectorizer()
----> 2 vectors = vectorizer.fit_transform(data_clean2['text'])
~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1650 """
1651 self._check_params()
-> 1652 X = super().fit_transform(raw_documents)
1653 self._tfidf.fit(X)
1654 # X is already a transformed view of raw_documents so
~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1056
1057 vocabulary, X = self._count_vocab(raw_documents,
-> 1058 self.fixed_vocabulary_)
1059
1060 if self.binary:
~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
968 for doc in raw_documents:
969 feature_counter = {}
--> 970 for feature in analyze(doc):
971 try:
972 feature_idx = vocabulary[feature]
~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(doc)
350 tokenize)
351 return lambda doc: self._word_ngrams(
--> 352 tokenize(preprocess(self.decode(doc))), stop_words)
353
354 else:
~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(x)
254
255 if self.lowercase:
--> 256 return lambda x: strip_accents(x.lower())
257 else:
258 return strip_accents
AttributeError: 'list' object has no attribute 'lower'
I know that I somehow cannot use it on the list, so what is my play here, trying to return the list into a string again?

Yes, first convert to string using:
data_clean2['text'] = data_clean2['text'].apply(', '.join)
Then use:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data_clean2['text'])
v = pd.DataFrame(vectors.toarray(), columns = vectorizer.get_feature_names())

Related

'numpy.ndarray' object has no attribute 'iloc' in x_test

Here I have four inputs and I tried to predict the future value. Before that I scaled my inputs data into 0,1. Then I created x_test value.
Then before predict the code I have to write another code to predict my value in every one hour. For that I want to extract into rows in my x_test_n value . Then I used iloc code. But unfortunately it didn't work because of numpy array. Then I found the code for that and I tried that code , it gave me an error also. Here is my code that I tried,
data10 = pd.read_csv('data.csv',"," )
data10 = data10.replace(np.nan, 0)
data10 = pd.DataFrame(data10,columns=['date','x1','x2','x3','x4'])
data10.set_index('date', inplace=True)
data10 = data10.values
X = 1
n_out = 1
x,y=list(),list()
start =0
for _ in range(len(data10)):
in_end = start+X
out_end= in_end + n_out
if out_end < len(data10):
x_input = data10[start:in_end]
x.append(x_input)
y.append(data10[in_end:out_end,0])
start +=1
x = np.asanyarray(x)
y = np.asanyarray(y)
scaler_x = preprocessing.MinMaxScaler(feature_range =(0, 1))
x = np.array(x).reshape ((len(x),4 ))
x = scaler_x.fit_transform((x))
scaler_y = preprocessing.MinMaxScaler(feature_range =(0, 1))
y = np.array(y).reshape ((len(y), 1))
y = scaler_y.fit_transform(y)
train_end = 150
x_test=x[train_end: ,]
y_test=y[train_end:]
x_test,y_test = np.array(x_test),np.array(y_test)
x_test = np.reshape(x_test,(x_test.shape[0], x_test.shape[1],1))
Then my x_test be like this:
[[[0.0000000e+00 0.0000000e+00 1.4332613e-01 0.0000000e+00]
[0.0000000e+00 0.0000000e+00 0.0000000e+00 6.8191981e-01]]
[[0.0000000e+00 0.0000000e+00 0.0000000e+00 6.8191981e-01]
[0.0000000e+00 1.4034396e-02 0.0000000e+00 0.0000000e+00]]
[[0.0000000e+00 1.4034396e-02 0.0000000e+00 0.0000000e+00]
[0.0000000e+00 0.0000000e+00 6.3639030e-02 0.0000000e+00]]
After that I want to extract rows in my x_test_n using iloc
filtered_3 = x_test_n
new_df = pd.DataFrame(scaler_x.fit_transform(filtered_3), columns=filtered_3.columns, index=df.index)
Then got an error :
ValueError Traceback (most recent call last)
<ipython-input-26-715b662d895d> in <module>()
101
102 filtered_3 = x_test_n
--> 103 new_df = pd.DataFrame(scaler_x.fit_transform(filtered_3), columns=filtered_3.columns, index=df.index)
104 # current_calorie = filtered_3.iloc[:,]
105 # last_calorie_record = 0
~\Anaconda3\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
515 if y is None:
516 # fit method of arity 1 (unsupervised transformation)
--> 517 return self.fit(X, **fit_params).transform(X)
518 else:
519 # fit method of arity 2 (supervised transformation)
~\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in fit(self, X, y)
306 # Reset internal state before fitting
307 self._reset()
--> 308 return self.partial_fit(X, y)
309
310 def partial_fit(self, X, y=None):
~\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in partial_fit(self, X, y)
332
333 X = check_array(X, copy=self.copy, warn_on_dtype=True,
--> 334 estimator=self, dtype=FLOAT_DTYPES)
335
336 data_min = np.min(X, axis=0)
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
449 if not allow_nd and array.ndim >= 3:
450 raise ValueError("Found array with dim %d. %s expected <= 2."
--> 451 % (array.ndim, estimator_name))
452 if force_all_finite:
453 _assert_all_finite(array)
ValueError: Found array with dim 3. MinMaxScaler expected <= 2.
Can anyone help me to solve this error?

If "x_test.shape" is 3d, (3 values in result) you can try code below, else i can look different ways.
nsamples, nx, ny = x_test.shape
x_test_reshape = x_test.reshape((nsamples,nx*ny))
If it works, you can transform your array to dataframe with same way.

fit_transform error using CountVectorizer

So I have a dataframe X which looks something like this:
X.head()
0 My wife took me here on my birthday for breakf...
1 I have no idea why some people give bad review...
3 Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4 General Manager Scott Petello is a good egg!!!...
6 Drop what you're doing and drive here. After I...
Name: text, dtype: object
And then,
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(X)
But I get this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-61-8ff79b91e317> in <module>()
----> 1 X = cv.fit_transform(X)
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
867
868 vocabulary, X = self._count_vocab(raw_documents,
--> 869 self.fixed_vocabulary_)
870
871 if self.binary:
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
790 for doc in raw_documents:
791 feature_counter = {}
--> 792 for feature in analyze(doc):
793 try:
794 feature_idx = vocabulary[feature]
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
264
265 return lambda doc: self._word_ngrams(
--> 266 tokenize(preprocess(self.decode(doc))), stop_words)
267
268 else:
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(x)
230
231 if self.lowercase:
--> 232 return lambda x: strip_accents(x.lower())
233 else:
234 return strip_accents
~/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py in __getattr__(self, attr)
574 return self.getnnz()
575 else:
--> 576 raise AttributeError(attr + " not found")
577
578 def transpose(self, axes=None, copy=False):
AttributeError: lower not found
No idea why.

You need to specify the column name of the text data even if the dataframe has single column.
X_countMatrix = cv.fit_transform(X['text'])
Because a CountVectorizer expects an iterable as input and when you supply a dataframe as an argument, only thing thats iterated is the column names. So even if you did not have any errors, that would be incorrect. Lucky that you got an error and got a chance to correct it.

Tensorflow: function.defun with a a while loop in the body is throwing shape error

I am using a while loop to calculate a cost function for memory reasons. When calculating the gradient, tensorflow will store Nm tensors where Nm is the number of iterations in my while loop (this cuases the same memory issues I had with the original energy functions). I do not want that as I don't have enough memory. So I want to register a new op along with a gradient function that both use a while loop. However I am having issues with using function.defun and a while loop. To simplify things, I have a small test example below:
import numpy as np
import tensorflow as tf
from tensorflow.python.framework import ops
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import sparse_ops
from tensorflow.python.framework import function
def _run(tensor):
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
res = sess.run(tensor)
return res
#function.Defun(tf.float32,tf.float32,func_name ='tf_test_log')#,grad_func=tf_test_logGrad)
def tf_test_log(t_x,t_y):
#N = t_x.shape[0].value
condition = lambda i,m1: i<N
def body(index,x):
#return[(index+1),tf.concat([x, tf.expand_dims(tf.exp( tf.add( t_x[:,index],t_y[:,index]) ),1) ],1 ) ]
return[(index+1),tf.add(x, tf.exp( tf.add( t_x[:,0],t_y[:,0]) ) ) ]
i0 = tf.constant(0,dtype=tf.int32)
m0 = tf.zeros([N,1],dType)
ijk_0 = [i0,m0]
L,t_log_x = tf.while_loop(condition,body,ijk_0,
shape_invariants=[i0.get_shape(),
tf.TensorShape([N,None])]
)
return t_log_x
dType = tf.float32
N = np.int32(100)
t_N = tf.constant(N,dtype = tf.int32)
t_x = tf.constant(np.random.randn(N,N),dtype = dType)
t_y = tf.constant(np.random.randn(N,N),dtype = dType)
ys = _run(tf_test_log(t_x,t_y))
I then try to test the new op:
I get a Value error: The shape for while/Merge_1:0 is not an invariant for the loop. It enters the loop with shape (100, ?), but has shape after one iteration. Provide shape invariants using either the shape_invariants argument of tf.while_loop or set_shape() on the loop variables.
Note that calling
If i use a concatenate operation (instead of the add operation that gets returned by my while loop), I do not get any issues.
However, If I do not set N as a global variable (i.e. I do N = t_x.shape[0]) inside the body of the tf_test_log function, I get a Value error.
ValueError: Cannot convert a partially known TensorShape to a Tensor: (?, 1)
What is wrong with my code? Any help is greatly appreciated!
I am using python 3.5 on ubuntu 16.04 and tensorflow 1.4
full output:
ValueError Traceback (most recent call last)
~/Documents/TheEffingPhDHatersGonnaHate/PAM/defun_while.py in <module>()
51 t_x = tf.constant(np.random.randn(N,N),dtype = dType)
52 t_y = tf.constant(np.random.randn(N,N),dtype = dType)
---> 53 ys = _run(tf_test_log(t_x,t_y))
54
55
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/framework/function.py in __call__(self, *args, **kwargs)
503
504 def __call__(self, *args, **kwargs):
--> 505 self.add_to_graph(ops.get_default_graph())
506 args = [ops.convert_to_tensor(_) for _ in args] + self._extra_inputs
507 ret, op = _call(self._signature, *args, **kwargs)
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/framework/function.py in add_to_graph(self, g)
484 def add_to_graph(self, g):
485 """Adds this function into the graph g."""
--> 486 self._create_definition_if_needed()
487
488 # Adds this function into 'g'.
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/framework/function.py in _create_definition_if_needed(self)
319 """Creates the function definition if it's not created yet."""
320 with context.graph_mode():
--> 321 self._create_definition_if_needed_impl()
322
323 def _create_definition_if_needed_impl(self):
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/framework/function.py in _create_definition_if_needed_impl(self)
336 # Call func and gather the output tensors.
337 with vs.variable_scope("", custom_getter=temp_graph.getvar):
--> 338 outputs = self._func(*inputs)
339
340 # There is no way of distinguishing between a function not returning
~/Documents/TheEffingPhDHatersGonnaHate/PAM/defun_while.py in tf_test_log(t_x, t_y)
39 L,t_log_x = tf.while_loop(condition,body,ijk_0,
40 shape_invariants=[i0.get_shape(),
---> 41 tf.TensorShape([N,None])]
42 )
43 return t_log_x
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py in while_loop(cond, body, loop_vars, shape_invariants, parallel_iterations, back_prop, swap_memory, name)
2814 loop_context = WhileContext(parallel_iterations, back_prop, swap_memory) # pylint: disable=redefined-outer-name
2815 ops.add_to_collection(ops.GraphKeys.WHILE_CONTEXT, loop_context)
-> 2816 result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
2817 return result
2818
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py in BuildLoop(self, pred, body, loop_vars, shape_invariants)
2638 self.Enter()
2639 original_body_result, exit_vars = self._BuildLoop(
-> 2640 pred, body, original_loop_vars, loop_vars, shape_invariants)
2641 finally:
2642 self.Exit()
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py in _BuildLoop(self, pred, body, original_loop_vars, loop_vars, shape_invariants)
2619 for m_var, n_var in zip(merge_vars, next_vars):
2620 if isinstance(m_var, ops.Tensor):
-> 2621 _EnforceShapeInvariant(m_var, n_var)
2622
2623 # Exit the loop.
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py in _EnforceShapeInvariant(merge_var, next_var)
576 "Provide shape invariants using either the `shape_invariants` "
577 "argument of tf.while_loop or set_shape() on the loop variables."
--> 578 % (merge_var.name, m_shape, n_shape))
579 else:
580 if not isinstance(var, (ops.IndexedSlices, sparse_tensor.SparseTensor)):
ValueError: The shape for while/Merge_1:0 is not an invariant for the loop. It enters the loop with shape (100, ?), but has shape <unknown> after one iteration. Provide shape invariants using either the `shape_invariants` argument of tf.while_loop or set_shape() on the loop variables.

Thanks #Alexandre Passos for the suggestion in the comment above!
The following piece of code is a modification of the original with a set_shape function added inside the body.
import numpy as np
import tensorflow as tf
from tensorflow.python.framework import ops
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import sparse_ops
from tensorflow.python.framework import function
def _run(tensor):
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
res = sess.run(tensor)
return res
#function.Defun(tf.float32,tf.float32,tf.float32,func_name ='tf_test_logGrad')
def tf_test_logGrad(t_x,t_y,grad):
return grad
#function.Defun(tf.float32,tf.float32,func_name ='tf_test_log')#,grad_func=tf_test_logGrad)
def tf_test_log(t_x,t_y):
#N = t_x.shape[0].value
condition = lambda i,m1: i<N
def body(index,x):
#return[(index+1),tf.concat([x, tf.expand_dims(tf.exp( tf.add( t_x[:,index],t_y[:,index]) ),1) ],1 ) ]
x = tf.add(x, tf.exp( tf.add( t_x[:,0],t_y[:,0]) ) )
x.set_shape([N])
return[(index+1), x]
i0 = tf.constant(0,dtype=tf.int32)
m0 = tf.zeros([N],dType)
ijk_0 = [i0,m0]
L,t_log_x = tf.while_loop(condition,body,ijk_0,
shape_invariants=[i0.get_shape(),
tf.TensorShape([N])]
)
return t_log_x
dType = tf.float32
N = np.int32(100)
t_N = tf.constant(N,dtype = tf.int32)
t_x = tf.constant(np.random.randn(N,N),dtype = dType)
t_y = tf.constant(np.random.randn(N,N),dtype = dType)
ys = _run(tf_test_log(t_x,t_y))
The Issue of global N still persists.
You still need to set the shape of the loop tensors as a global variable outside of the defun decorator. If you try to get it from the shape of the inputs of the defun decorator, you get:
TypeError Traceback (most recent call last)
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py in zeros(shape, dtype, name)
1438 shape = tensor_shape.as_shape(shape)
-> 1439 output = constant(zero, shape=shape, dtype=dtype, name=name)
1440 except (TypeError, ValueError):
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name, verify_shape)
207 tensor_util.make_tensor_proto(
--> 208 value, dtype=dtype, shape=shape, verify_shape=verify_shape))
209 dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape)
379 # exception when dtype is set to np.int64
--> 380 if shape is not None and np.prod(shape, dtype=np.int64) == 0:
381 nparray = np.empty(shape, dtype=np_dt)
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/numpy/core/fromnumeric.py in prod(a, axis, dtype, out, keepdims)
2517 return _methods._prod(a, axis=axis, dtype=dtype,
-> 2518 out=out, **kwargs)
2519
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/numpy/core/_methods.py in _prod(a, axis, dtype, out, keepdims)
34 def _prod(a, axis=None, dtype=None, out=None, keepdims=False):
---> 35 return umr_prod(a, axis, dtype, out, keepdims)
36
TypeError: __int__ returned non-int (type NoneType)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
~/Documents/TheEffingPhDHatersGonnaHate/PAM/defun_while.py in <module>()
52 t_x = tf.constant(np.random.randn(N,N),dtype = dType)
53 t_y = tf.constant(np.random.randn(N,N),dtype = dType)
---> 54 ys = _run(tf_test_log(t_x,t_y))
55
56
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/framework/function.py in __call__(self, *args, **kwargs)
503
504 def __call__(self, *args, **kwargs):
--> 505 self.add_to_graph(ops.get_default_graph())
506 args = [ops.convert_to_tensor(_) for _ in args] + self._extra_inputs
507 ret, op = _call(self._signature, *args, **kwargs)
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/framework/function.py in add_to_graph(self, g)
484 def add_to_graph(self, g):
485 """Adds this function into the graph g."""
--> 486 self._create_definition_if_needed()
487
488 # Adds this function into 'g'.
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/framework/function.py in _create_definition_if_needed(self)
319 """Creates the function definition if it's not created yet."""
320 with context.graph_mode():
--> 321 self._create_definition_if_needed_impl()
322
323 def _create_definition_if_needed_impl(self):
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/framework/function.py in _create_definition_if_needed_impl(self)
336 # Call func and gather the output tensors.
337 with vs.variable_scope("", custom_getter=temp_graph.getvar):
--> 338 outputs = self._func(*inputs)
339
340 # There is no way of distinguishing between a function not returning
~/Documents/TheEffingPhDHatersGonnaHate/PAM/defun_while.py in tf_test_log(t_x, t_y)
33
34 i0 = tf.constant(0,dtype=tf.int32)
---> 35 m0 = tf.zeros([N],dType)
36
37
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py in zeros(shape, dtype, name)
1439 output = constant(zero, shape=shape, dtype=dtype, name=name)
1440 except (TypeError, ValueError):
-> 1441 shape = ops.convert_to_tensor(shape, dtype=dtypes.int32, name="shape")
1442 output = fill(shape, constant(zero, dtype=dtype), name=name)
1443 assert output.dtype.base_dtype == dtype
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, preferred_dtype)
834 name=name,
835 preferred_dtype=preferred_dtype,
--> 836 as_ref=False)
837
838
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx)
924
925 if ret is None:
--> 926 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
927
928 if ret is NotImplemented:
~/environments/tf_1_4_gpu/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py in _tensor_shape_tensor_conversion_function(s, dtype, name, as_ref)
248 if not s.is_fully_defined():
249 raise ValueError(
--> 250 "Cannot convert a partially known TensorShape to a Tensor: %s" % s)
251 s_list = s.as_list()
252 int64_value = 0
ValueError: Cannot convert a partially known TensorShape to a Tensor: (?,)

tfidf vectorizer process shows error

I am working on non-Engish corpus analysis but facing several problems. One of those problems is tfidf_vectorizer. After importing concerned liberaries, I processed following code to get results
contents = [open("D:\test.txt", encoding='utf8').read()]
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words=stopwords,
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(3,3))
%time tfidf_matrix = tfidf_vectorizer.fit_transform(contents)
print(tfidf_matrix.shape)
After processing above code I got following error message.
ValueError Traceback (most recent call last)
<ipython-input-144-bbcec8b8c065> in <module>()
5 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(3,3))
6
----> 7 get_ipython().magic('time tfidf_matrix = tfidf_vectorizer.fit_transform(contents) #fit the vectorizer to synopses')
8
9 print(tfidf_matrix.shape)
C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in magic(self, arg_s)
2156 magic_name, _, magic_arg_s = arg_s.partition(' ')
2157 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2158 return self.run_line_magic(magic_name, magic_arg_s)
2159
2160 #-------------------------------------------------------------------------
C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in run_line_magic(self, magic_name, line)
2077 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2078 with self.builtin_trap:
-> 2079 result = fn(*args,**kwargs)
2080 return result
2081
<decorator-gen-60> in time(self, line, cell, local_ns)
C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\magic.py in <lambda>(f, *a, **k)
186 # but it's overkill for just that one bit of state.
187 def magic_deco(arg):
--> 188 call = lambda f, *a, **k: f(*a, **k)
189
190 if callable(arg):
C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\magics\execution.py in time(self, line, cell, local_ns)
1178 else:
1179 st = clock2()
-> 1180 exec(code, glob, local_ns)
1181 end = clock2()
1182 out = None
<timed exec> in <module>()
C:\Users\mazhar\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1303 Tf-idf-weighted document-term matrix.
1304 """
-> 1305 X = super(TfidfVectorizer, self).fit_transform(raw_documents)
1306 self._tfidf.fit(X)
1307 # X is already a transformed view of raw_documents so
C:\Users\mazhar\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
836 max_doc_count,
837 min_doc_count,
--> 838 max_features)
839
840 self.vocabulary_ = vocabulary
C:\Users\mazhar\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _limit_features(self, X, vocabulary, high, low, limit)
731 kept_indices = np.where(mask)[0]
732 if len(kept_indices) == 0:
--> 733 raise ValueError("After pruning, no terms remain. Try a lower"
734 " min_df or a higher max_df.")
735 return X[:, kept_indices], removed_terms
ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.
If I change then min and max value the error is

Assuming your tokeniser works as expected, I see two problems with your code. First, TfIdfVectorizer expects a list of strings, whereas you are providing a single string. Second, min_df=0.2 is quite high- to be included, a term needs to occur in 20% of all documents, which is very unlikely for trigram features.
The following works for me
from sklearn.feature_extraction.text import TfidfVectorizer
with open("README.md") as infile:
contents = infile.readlines() # Note: readlines() instead of read()
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=2, use_idf=True, ngram_range=(3,3))
# note: minimum of 2 occurrences, rather than 0.2 (20% of all documents)
tfidf_matrix = tfidf_vectorizer.fit_transform(contents)
print(tfidf_matrix.shape)
outputs (155, 28)

sklearn RidgeCV with sample_weight

I'm trying to do a weighted Ridge Regression with sklearn. However, the code breaks when I call the fit method. The exception I get is :
Exception: Data must be 1-dimensional
But I'm sure (by checking through print-statements) that the data I'm passing has the right shapes.
print temp1.shape #(781, 21)
print temp2.shape #(781,)
print weights.shape #(781,)
result=RidgeCV(normalize=True).fit(temp1,temp2,sample_weight=weights)
What could be going wrong ??
Here's the whole output :
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-65-a5b1eba5d9cf> in <module>()
22
23
---> 24 result=RidgeCV(normalize=True).fit(temp2,temp1, sample_weight=weights)
25
26
/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/ridge.pyc in fit(self, X, y, sample_weight)
868 gcv_mode=self.gcv_mode,
869 store_cv_values=self.store_cv_values)
--> 870 estimator.fit(X, y, sample_weight=sample_weight)
871 self.alpha_ = estimator.alpha_
872 if self.store_cv_values:
/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/ridge.pyc in fit(self, X, y, sample_weight)
793 else alpha)
794 if error:
--> 795 out, c = _errors(weighted_alpha, y, v, Q, QT_y)
796 else:
797 out, c = _values(weighted_alpha, y, v, Q, QT_y)
/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/ridge.pyc in _errors(self, alpha, y, v, Q, QT_y)
685 w = 1.0 / (v + alpha)
686 c = np.dot(Q, self._diag_dot(w, QT_y))
--> 687 G_diag = self._decomp_diag(w, Q)
688 # handle case where y is 2-d
689 if len(y.shape) != 1:
/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/ridge.pyc in _decomp_diag(self, v_prime, Q)
672 def _decomp_diag(self, v_prime, Q):
673 # compute diagonal of the matrix: dot(Q, dot(diag(v_prime), Q^T))
--> 674 return (v_prime * Q ** 2).sum(axis=-1)
675
676 def _diag_dot(self, D, B):
/usr/local/lib/python2.7/dist-packages/pandas/core/ops.pyc in wrapper(left, right, name)
531 return left._constructor(wrap_results(na_op(lvalues, rvalues)),
532 index=left.index, name=left.name,
--> 533 dtype=dtype)
534 return wrapper
535
/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in __init__(self, data, index, dtype, name, copy, fastpath)
209 else:
210 data = _sanitize_array(data, index, dtype, copy,
--> 211 raise_cast_failure=True)
212
213 data = SingleBlockManager(data, index, fastpath=True)
/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
2683 elif subarr.ndim > 1:
2684 if isinstance(data, np.ndarray):
-> 2685 raise Exception('Data must be 1-dimensional')
2686 else:
2687 subarr = _asarray_tuplesafe(data, dtype=dtype)
Exception: Data must be 1-dimensional

The error seems to be due to sample_weights being a Pandas series rather than a numpy array:
from sklearn.linear_model import RidgeCV
temp1 = pd.DataFrame(np.random.rand(781, 21))
temp2 = pd.Series(temp1.sum(1))
weights = pd.Series(1 + 0.1 * np.random.rand(781))
result = RidgeCV(normalize=True).fit(temp1, temp2,
sample_weight=weights)
# Exception: Data must be 1-dimensional
If you use a numpy array instead, the error goes away:
result = RidgeCV(normalize=True).fit(temp1, temp2,
sample_weight=weights.values)
This seems to be a bug; I've opened a scikit-learn issue to report this.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

tfidf first time, using it on a Pandas series that has a list per entry - scikit-learn

Yes, first convert to string using: data_clean2['text'] = data_clean2['text'].apply(', '.join) Then use: vectorizer = TfidfVectorizer() vectors = vectorizer.fit_transform(data_clean2['text']) v = pd.DataFrame(vectors.toarray(), columns = vectorizer.get_feature_names())

Related

'numpy.ndarray' object has no attribute 'iloc' in x_test

fit_transform error using CountVectorizer

Tensorflow: function.defun with a a while loop in the body is throwing shape error

tfidf vectorizer process shows error

sklearn RidgeCV with sample_weight

Categories

Resources