train Word2vec model using Gensim - python-3.x

this is my code.it reads reviews from an excel file (rev column) and make a list of list.
xp is like this
["['intrepid', 'bumbling', 'duo', 'deliver', 'good', 'one'],['better', 'offering', 'considerable', 'cv', 'freshly', 'qualified', 'private', 'investigator', 'thrust', 'murder', 'investigation', 'invisible'],[ 'man', 'alone', 'tell', 'fun', 'flow', 'decent', 'clip', 'need', 'say', 'sequence', 'comedy', 'gold', 'like', 'scene', 'restaurant', 'excellent', 'costello', 'pretending', 'work', 'ball', 'gym', 'final', 'reel']"]
but when use list for model, it gives me error"TypeError: 'float' object is not iterable".i don't know where is my problem.
Thanks.
xp=[]
import gensim
import logging
import pandas as pd
file = r'FileNamelast.xlsx'
df = pd.read_excel(file,sheet_name='FileNamex')
pages = [i for i in range(0,1000)]
for page in pages:
text =df.loc[page,["rev"]]
xp.append(text[0])
model = gensim.models.Word2Vec (xp, size=150, window=10, min_count=2,
workers=10)
model.train(xp,total_examples=len(xp),epochs=10)
this is what i got.TypeError: 'float' object is not iterable
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-32-aa34c0e432bf> in <module>()
14
15
---> 16 model = gensim.models.Word2Vec (xp, size=150, window=10, min_count=2, workers=10)
17 model.train(xp,total_examples=len(xp),epochs=10)
C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\word2vec.py in __init__(self, sentences, corpus_file, size, alpha, window, min_count, max_vocab_size, sample, seed, workers, min_alpha, sg, hs, negative, ns_exponent, cbow_mean, hashfxn, iter, null_word, trim_rule, sorted_vocab, batch_words, compute_loss, callbacks, max_final_vocab)
765 callbacks=callbacks, batch_words=batch_words, trim_rule=trim_rule, sg=sg, alpha=alpha, window=window,
766 seed=seed, hs=hs, negative=negative, cbow_mean=cbow_mean, min_alpha=min_alpha, compute_loss=compute_loss,
--> 767 fast_version=FAST_VERSION)
768
769 def _do_train_epoch(self, corpus_file, thread_id, offset, cython_vocab, thread_private_mem, cur_epoch,
C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py in __init__(self, sentences, corpus_file, workers, vector_size, epochs, callbacks, batch_words, trim_rule, sg, alpha, window, seed, hs, negative, ns_exponent, cbow_mean, min_alpha, compute_loss, fast_version, **kwargs)
757 raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.")
758
--> 759 self.build_vocab(sentences=sentences, corpus_file=corpus_file, trim_rule=trim_rule)
760 self.train(
761 sentences=sentences, corpus_file=corpus_file, total_examples=self.corpus_count,
C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py in build_vocab(self, sentences, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
934 """
935 total_words, corpus_count = self.vocabulary.scan_vocab(
--> 936 sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, trim_rule=trim_rule)
937 self.corpus_count = corpus_count
938 self.corpus_total_words = total_words
C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\word2vec.py in scan_vocab(self, sentences, corpus_file, progress_per, workers, trim_rule)
1569 sentences = LineSentence(corpus_file)
1570
-> 1571 total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule)
1572
1573 logger.info(
C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\word2vec.py in _scan_vocab(self, sentences, progress_per, trim_rule)
1552 sentence_no, total_words, len(vocab)
1553 )
-> 1554 for word in sentence:
1555 vocab[word] += 1
1556 total_words += len(sentence)
TypeError: 'float' object is not iterable

The sentences corpus argument to Word2Vec should be an iterable sequence of lists-of-word-tokens.
Your reported value for xp is actually a list with one long string in it:
[
"['intrepid', 'bumbling', 'duo', 'deliver', 'good', 'one'],['better', 'offering', 'considerable', 'cv', 'freshly', 'qualified', 'private', 'investigator', 'thrust', 'murder', 'investigation', 'invisible'],[ 'man', 'alone', 'tell', 'fun', 'flow', 'decent', 'clip', 'need', 'say', 'sequence', 'comedy', 'gold', 'like', 'scene', 'restaurant', 'excellent', 'costello', 'pretending', 'work', 'ball', 'gym', 'final', 'reel']"
]
I don't see how this would give the error you've reported, but it's definitely wrong, so should be fixed. You should perhaps print xp just before you instantiate Word2Vec to be sure you know what it contains.
A true list, with each item being a list-of-string-tokens, would work. So if xp were the following that'd be correct:
[
['intrepid', 'bumbling', 'duo', 'deliver', 'good', 'one'],
['better', 'offering', 'considerable', 'cv', 'freshly', 'qualified', 'private', 'investigator', 'thrust', 'murder', 'investigation', 'invisible'],
[ 'man', 'alone', 'tell', 'fun', 'flow', 'decent', 'clip', 'need', 'say', 'sequence', 'comedy', 'gold', 'like', 'scene', 'restaurant', 'excellent', 'costello', 'pretending', 'work', 'ball', 'gym', 'final', 'reel']
]
Note, however:
Word2Vec doesn't do well with toy-sized datasets. So while this tiny setup may be helpful to check for basic syntax/format issues, don't expect realistic results until you're training with many hundreds-of-thousands of words.
You don't need to call train() if you already supplied your corpus at instantiation, as you have. The model will do all steps automatically. (If, on the other hand, you don't supply your corpus, you'd then have to call both build_vocab() and train().) If you enable logging at the INFO level all the steps happening behind the scenes will be clearer.

Related

How do I pass the values to Catboost?

I'm trying to work with catboost and I've got a problem that I'm really stuck with right now. I have a dataframe with 28 columns, 2 of them are categorical. When the data is numerical there are some even and some fractional numbers, also some 0.00 values that should represent not an empty values but the actual nulls (like 1-1=0).
I'm trying to run this:
train_cl = cb.Pool(data=ret_df.iloc[:580000, :-1], label=ret_df.iloc[:580000, -1], cat_features=cats)
evl_cl = cb.Pool(data=ret_df.iloc[580000:, :-1], label=ret_df.iloc[580000:, -1], cat_features=cats)
But I have this error
---------------------------------------------------------------------------
CatBoostError Traceback (most recent call last)
<ipython-input-112-a515b0ab357b> in <module>
1 train_cl = cb.Pool(data=ret_df.iloc[:580000, :-1], label=ret_df.iloc[:580000, -1], cat_features=cats)
----> 2 evl_cl = cb.Pool(data=ret_df.iloc[580000:, :-1], label=ret_df.iloc[580000:, -1], cat_features=cats)
~\AppData\Local\Programs\Python\Python36\lib\site-packages\catboost\core.py in __init__(self, data, label, cat_features, text_features, embedding_features, column_description, pairs, delimiter, has_header, ignore_csv_quoting, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count, log_cout, log_cerr)
615 )
616
--> 617 self._init(data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
618 super(Pool, self).__init__()
619
~\AppData\Local\Programs\Python\Python36\lib\site-packages\catboost\core.py in _init(self, data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
1081 if label is not None:
1082 self._check_label_type(label)
-> 1083 self._check_label_empty(label)
1084 label = self._label_if_pandas_to_numpy(label)
1085 if len(np.shape(label)) == 1:
~\AppData\Local\Programs\Python\Python36\lib\site-packages\catboost\core.py in _check_label_empty(self, label)
723 """
724 if len(label) == 0:
--> 725 raise CatBoostError("Labels variable is empty.")
726
727 def _check_label_shape(self, label, samples_count):
CatBoostError: Labels variable is empty.
I've googled this trouble, but found nothing. My hypothesis is that there is a problem with 0.00 values, but I do not know how to solve this because I literally can't replace these values with anything.
Please, help me!

How to parallelize classification with Zero Shot Classification by Huggingface?

I have around 70 categories (it can be 20 or 30 also) and I want to be able to parallelize the process using ray but I get an error:
import pandas as pd
import swifter
import json
import ray
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
labels = ["vegetables", "potato", "bell pepper", "tomato", "onion", "carrot", "broccoli",
"lettuce", "cucumber", "celery", "corn", "garlic", "mashrooms", "cabbage", "spinach",
"beans", "cauliflower", "asparagus", "fruits", "bananas", "apples", "strawberries",
"grapes", "oranges", "lemons", "avocados", "peaches", "blueberries", "pineapple",
"cherries", "pears", "mangoe", "berries", "red meat", "beef", "pork", "mutton",
"veal", "lamb", "venison", "goat", "mince", "white meat", "chicken", "turkey",
"duck", "goose", "pheasant", "rabbit", "Processed meat", "sausages", "bacon",
"ham", "hot dogs", "frankfurters", "tinned meat", "salami", "pâtés", "beef jerky",
"chorizo", "pepperoni", "corned beef", "fish", "catfish", "cod", "pangasius", "pollock",
"tilapia", "tuna", "salmon", "seafood", "shrimp", "squid", "mussels", "scallop",
"octopus", "grains", "rice", "wheat", "bulgur", "corn", "oat", "quinoa", "buckwheat",
"meals", "salad", "soup", "steak", "pizza", "pie", "burger", "backery", "bread", "souce",
"pasta", "sandwich", "waffles", "barbecue", "roll", "wings", "ribs", "cookies"]
ray.init()
#ray.remote
def get_meal_category(seq, labels, n=3):
res_dict = classifier(seq, labels)
return list(zip([seq for i in range(n)], res_dict["labels"][0:n], res_dict["scores"][0:n]))
res_list = ray.get([get_meal_category.remote(merged_df["title"][i], labels) for i in range(10)])
Where merged_df is a big dataframe with meal names in it's labels column like:
['Cappuccino',
'Stove Top Stuffing Mix For Turkey (Kraft)',
'Stove Top Stuffing Mix For Turkey (Kraft)',
'Roasted Dark Turkey Meat',
'Roasted Dark Turkey Meat',
'Roasted Dark Turkey Meat',
'Cappuccino',
'Low Fat 2% Small Curd Cottage Cheese (Daisy)',
'Rice Cereal (Gerber)',
'Oranges']
Please advise how to avoid ray's error and parallelize the classification.
The error:
2021-02-17 16:54:51,689 WARNING worker.py:1107 -- Warning: The remote function __main__.get_meal_category has size 1630925709 when pickled. It will be stored in Redis, which could cause memory issues. This may mean that its definition uses a large array or other object.
---------------------------------------------------------------------------
ConnectionResetError Traceback (most recent call last)
~/.local/lib/python3.8/site-packages/redis/connection.py in send_packed_command(self, command, check_health)
705 for item in command:
--> 706 sendall(self._sock, item)
707 except socket.timeout:
~/.local/lib/python3.8/site-packages/redis/_compat.py in sendall(sock, *args, **kwargs)
8 def sendall(sock, *args, **kwargs):
----> 9 return sock.sendall(*args, **kwargs)
10
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
<ipython-input-9-1a5345832fba> in <module>
----> 1 res_list = ray.get([get_meal_category.remote(merged_df["title"][i], labels) for i in range(10)])
<ipython-input-9-1a5345832fba> in <listcomp>(.0)
----> 1 res_list = ray.get([get_meal_category.remote(merged_df["title"][i], labels) for i in range(10)])
~/.local/lib/python3.8/site-packages/ray/remote_function.py in _remote_proxy(*args, **kwargs)
99 #wraps(function)
100 def _remote_proxy(*args, **kwargs):
--> 101 return self._remote(args=args, kwargs=kwargs)
102
103 self.remote = _remote_proxy
~/.local/lib/python3.8/site-packages/ray/remote_function.py in _remote(self, args, kwargs, num_returns, num_cpus, num_gpus, memory, object_store_memory, accelerator_type, resources, max_retries, placement_group, placement_group_bundle_index, placement_group_capture_child_tasks, override_environment_variables, name)
205
206 self._last_export_session_and_job = worker.current_session_and_job
--> 207 worker.function_actor_manager.export(self)
208
209 kwargs = {} if kwargs is None else kwargs
~/.local/lib/python3.8/site-packages/ray/function_manager.py in export(self, remote_function)
142 key = (b"RemoteFunction:" + self._worker.current_job_id.binary() + b":"
143 + remote_function._function_descriptor.function_id.binary())
--> 144 self._worker.redis_client.hset(
145 key,
146 mapping={
~/.local/lib/python3.8/site-packages/redis/client.py in hset(self, name, key, value, mapping)
3048 items.extend(pair)
3049
-> 3050 return self.execute_command('HSET', name, *items)
3051
3052 def hsetnx(self, name, key, value):
~/.local/lib/python3.8/site-packages/redis/client.py in execute_command(self, *args, **options)
898 conn = self.connection or pool.get_connection(command_name, **options)
899 try:
--> 900 conn.send_command(*args)
901 return self.parse_response(conn, command_name, **options)
902 except (ConnectionError, TimeoutError) as e:
~/.local/lib/python3.8/site-packages/redis/connection.py in send_command(self, *args, **kwargs)
723 def send_command(self, *args, **kwargs):
724 "Pack and send a command to the Redis server"
--> 725 self.send_packed_command(self.pack_command(*args),
726 check_health=kwargs.get('check_health', True))
727
~/.local/lib/python3.8/site-packages/redis/connection.py in send_packed_command(self, command, check_health)
715 errno = e.args[0]
716 errmsg = e.args[1]
--> 717 raise ConnectionError("Error %s while writing to socket. %s." %
718 (errno, errmsg))
719 except BaseException:
ConnectionError: Error 104 while writing to socket. Connection reset by peer.
This error is happening because of sending large objects to redis. merged_df is a large dataframe and since you are calling get_meal_category 10 times, Ray will attempt to serialize merged_df 10 times. Instead if you put merged_df into the Ray object store just once, and then pass along a reference to the object, this should work.
EDIT: Since the classifier is also large, do something similar for that as well.
Can you try something like this:
ray.init()
df_ref = ray.put(merged_df)
model_ref = ray.put(classifier)
#ray.remote
def get_meal_category(classifier, df, i, labels, n=3):
seq = df["title"][i]
res_dict = classifier(seq, labels)
return list(zip([seq for i in range(n)], res_dict["labels"][0:n], res_dict["scores"][0:n]))
res_list = ray.get([get_meal_category.remote(model_ref, df_ref, i, labels) for i in range(10)])

lightgbm || ValueError: Series.dtypes must be int, float or bool

Dataframe has filled na values .
Schema of dataset has no object dtype as specified in documentation.
df.info()
output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 429 entries, 351 to 559
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 429 non-null category
1 Married 429 non-null category
2 Dependents 429 non-null category
3 Education 429 non-null category
4 Self_Employed 429 non-null category
5 ApplicantIncome 429 non-null int64
6 CoapplicantIncome 429 non-null float64
7 LoanAmount 429 non-null float64
8 Loan_Amount_Term 429 non-null float64
9 Credit_History 429 non-null float64
10 Property_Area 429 non-null category
dtypes: category(6), float64(4), int64(1)
memory usage: 23.3 KB
I have following code .....................................................................................................................................................................................................................................................................................................................
I am trying to classification of dataset using lightgbm
import lightgbm as lgb
train_data=lgb.Dataset(x_train,label=y_train,categorical_feature=cat_cols)
#define parameters
params = {'learning_rate':0.001}
model= lgb.train(params, train_data, 100,categorical_feature=cat_cols)
getting following error :
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-178-aaa91a2d8719> in <module>
6
7
----> 8 model= lgb.train(params, train_data, 100,categorical_feature=cat_cols)
~\Anaconda3\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
229 # construct booster
230 try:
--> 231 booster = Booster(params=params, train_set=train_set)
232 if is_valid_contain_train:
233 booster.set_train_data_name(train_data_name)
~\Anaconda3\lib\site-packages\lightgbm\basic.py in __init__(self, params, train_set, model_file, model_str, silent)
1981 break
1982 # construct booster object
-> 1983 train_set.construct()
1984 # copy the parameters from train_set
1985 params.update(train_set.get_params())
~\Anaconda3\lib\site-packages\lightgbm\basic.py in construct(self)
1319 else:
1320 # create train
-> 1321 self._lazy_init(self.data, label=self.label,
1322 weight=self.weight, group=self.group,
1323 init_score=self.init_score, predictor=self._predictor,
~\Anaconda3\lib\site-packages\lightgbm\basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
1133 raise TypeError('Cannot initialize Dataset from {}'.format(type(data).__name__))
1134 if label is not None:
-> 1135 self.set_label(label)
1136 if self.get_label() is None:
1137 raise ValueError("Label should not be None")
~\Anaconda3\lib\site-packages\lightgbm\basic.py in set_label(self, label)
1648 self.label = label
1649 if self.handle is not None:
-> 1650 label = list_to_1d_numpy(_label_from_pandas(label), name='label')
1651 self.set_field('label', label)
1652 self.label = self.get_field('label') # original values can be modified at cpp side
~\Anaconda3\lib\site-packages\lightgbm\basic.py in list_to_1d_numpy(data, dtype, name)
88 elif isinstance(data, Series):
89 if _get_bad_pandas_dtypes([data.dtypes]):
---> 90 raise ValueError('Series.dtypes must be int, float or bool')
91 return np.array(data, dtype=dtype, copy=False) # SparseArray should be supported as well
92 else:
ValueError: Series.dtypes must be int, float or bool
did anyone helped you yet? If not: The answer lies within transforming your variable.
Go to this link:GitHub Discussion lightGBM
The creators of LightGBM were confronted with that same question once.
In the Link above they (STRIKER) tell you, that you should: transform your variables with astype("category") (pandas/scikit) AND you should labelEncode them, because you need an INT ! value in your feature column, especially an INT32.
However, labelEncoding and astype('category') should normally do the same:
Encoding
Antoher useful link is this advanced doc about the categorical feature:Categorical feature light gbm homepage where they tell you that they cant deal with object(string) dtypes as in your data.
If you are still feeling uncomfortable with this explanation, here is my code snippet from the kaggle space_race_set. If you are still having problems. Just ask away.
cat_feats = ['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country']
labelencoder = LabelEncoder()
for col in cat_feats:
train_df[col] = labelencoder.fit_transform(train_df[col])
for col in cat_feats:
train_df[col] = train_df[col].astype('int')
y = train_df[["Status Mission"]]
X = train_df.drop(["Status Mission"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
train_data = lgb.Dataset(X_train,
label=y_train,
categorical_feature=['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country'],
free_raw_data=False)
test_data = lgb.Dataset(X_test,
label=y_test,
categorical_feature=['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country'],
free_raw_data=False)
I had the same problem. My y_train was in int64 dtype. This solved my problem:
model_LGB.fit(
X = X_train,
y = y_train.astype('int32'))

Tree classifier to graphviz ERROR

I had made a Tree Classifier named model and tried to use the export graphviz function like this:
export_graphviz(decision_tree=model,
out_file='NT_model.dot',
feature_names=X_train.columns,
class_names=model.classes_,
leaves_parallel=True,
filled=True,
rotate=False,
rounded=True)
For some reason my run had raised this exception:
TypeError Traceback (most recent call last)
<ipython-input-298-40fe56bb0c85> in <module>()
6 filled=True,
7 rotate=False,
----> 8 rounded=True)
C:\Users\yonatanv\AppData\Local\Continuum\Anaconda3\lib\site-
packages\sklearn\tree\export.py in export_graphviz(decision_tree, out_file,
max_depth, feature_names, class_names, label, filled, leaves_parallel,
impurity, node_ids, proportion, rotate, rounded, special_characters)
431 recurse(decision_tree, 0, criterion="impurity")
432 else:
--> 433 recurse(decision_tree.tree_, 0,
criterion=decision_tree.criterion)
434
435 # If required, draw leaf nodes at same depth as each other
C:\Users\yonatanv\AppData\Local\Continuum\Anaconda3\lib\site-
packages\sklearn\tree\export.py in recurse(tree, node_id, criterion, parent,
depth)
319 out_file.write('%d [label=%s'
320 % (node_id,
--> 321 node_to_str(tree, node_id,
criterion)))
322
323 if filled:
C:\Users\yonatanv\AppData\Local\Continuum\Anaconda3\lib\site-
packages\sklearn\tree\export.py in node_to_str(tree, node_id, criterion)
289 np.argmax(value),
290 characters[2])
--> 291 node_string += class_name
292
293 # Clean up any trailing newlines
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype('<U90') dtype('<U90') dtype('<U90')
My hyper parameters for the visualizations are those:
print(model)
DecisionTreeClassifier(class_weight={1.0: 10, 0.0: 1}, criterion='gini',
max_depth=7, max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=50,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=0, splitter='best')
print(model.classes_)
[ 0. , 1. ]
Help would be most appreciated!
As you see here specified in the documentation of export_graphviz, the param class_names works for strings, not float or int.
class_names : list of strings, bool or None, optional (default=None)
Try converting the model.classes_ to list of strings before passing them in export_graphviz.
Try class_names=['0', '1'] or class_names=['0.0', '1.0'] in the call to export_graphviz().
For a more general solution, use:
class_names=[str(x) for x in model.classes_]
But is there a specific reason that you are passing float values as y in model.fit()? Because that is mostly not required in classification task. Do you have actual y labels as this only or are you converting string labels to numeric before fitting the model?

tfidf vectorizer process shows error

I am working on non-Engish corpus analysis but facing several problems. One of those problems is tfidf_vectorizer. After importing concerned liberaries, I processed following code to get results
contents = [open("D:\test.txt", encoding='utf8').read()]
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words=stopwords,
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(3,3))
%time tfidf_matrix = tfidf_vectorizer.fit_transform(contents)
print(tfidf_matrix.shape)
After processing above code I got following error message.
ValueError Traceback (most recent call last)
<ipython-input-144-bbcec8b8c065> in <module>()
5 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(3,3))
6
----> 7 get_ipython().magic('time tfidf_matrix = tfidf_vectorizer.fit_transform(contents) #fit the vectorizer to synopses')
8
9 print(tfidf_matrix.shape)
C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in magic(self, arg_s)
2156 magic_name, _, magic_arg_s = arg_s.partition(' ')
2157 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2158 return self.run_line_magic(magic_name, magic_arg_s)
2159
2160 #-------------------------------------------------------------------------
C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in run_line_magic(self, magic_name, line)
2077 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2078 with self.builtin_trap:
-> 2079 result = fn(*args,**kwargs)
2080 return result
2081
<decorator-gen-60> in time(self, line, cell, local_ns)
C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\magic.py in <lambda>(f, *a, **k)
186 # but it's overkill for just that one bit of state.
187 def magic_deco(arg):
--> 188 call = lambda f, *a, **k: f(*a, **k)
189
190 if callable(arg):
C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\magics\execution.py in time(self, line, cell, local_ns)
1178 else:
1179 st = clock2()
-> 1180 exec(code, glob, local_ns)
1181 end = clock2()
1182 out = None
<timed exec> in <module>()
C:\Users\mazhar\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1303 Tf-idf-weighted document-term matrix.
1304 """
-> 1305 X = super(TfidfVectorizer, self).fit_transform(raw_documents)
1306 self._tfidf.fit(X)
1307 # X is already a transformed view of raw_documents so
C:\Users\mazhar\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
836 max_doc_count,
837 min_doc_count,
--> 838 max_features)
839
840 self.vocabulary_ = vocabulary
C:\Users\mazhar\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _limit_features(self, X, vocabulary, high, low, limit)
731 kept_indices = np.where(mask)[0]
732 if len(kept_indices) == 0:
--> 733 raise ValueError("After pruning, no terms remain. Try a lower"
734 " min_df or a higher max_df.")
735 return X[:, kept_indices], removed_terms
ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.
If I change then min and max value the error is
Assuming your tokeniser works as expected, I see two problems with your code. First, TfIdfVectorizer expects a list of strings, whereas you are providing a single string. Second, min_df=0.2 is quite high- to be included, a term needs to occur in 20% of all documents, which is very unlikely for trigram features.
The following works for me
from sklearn.feature_extraction.text import TfidfVectorizer
with open("README.md") as infile:
contents = infile.readlines() # Note: readlines() instead of read()
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=2, use_idf=True, ngram_range=(3,3))
# note: minimum of 2 occurrences, rather than 0.2 (20% of all documents)
tfidf_matrix = tfidf_vectorizer.fit_transform(contents)
print(tfidf_matrix.shape)
outputs (155, 28)

Resources