Poor probability results for SVM text classification

Poor probability results for SVM text classification - python-3.x

I'm fairly new to machine learning technologies and I'm using sklearn and SVC to perform date classification on texts as apart of a project, but I'm getting incredibly low probabiltiy scores.
I have a corpus of 13 texts all authored at different dates ranging from 598 to 1358 (which I use as classes) stored in a train and test file directory, I use a CountVectoriser and TfidfTransformer to prepare the data and pickle my results for later use:
src = "../data/datasets/pickledWordLists/"
corpus = []
for filename in os.listdir(src):
with (open(os.path.join(src, filename), "rb")) as openfile:
while True:
try:
text = pickle.load(openfile)
text = ' '.join(word for word in text)
corpus.append(text)
except EOFError:
break
src = "../data/datasets/TestsPickled/"
for filename in os.listdir(src):
with (open(os.path.join(src, filename), "rb")) as openfile:
while True:
try:
text = pickle.load(openfile)
text = ' '.join(word for word in text)
corpus.append(text)
except EOFError:
break
vectorizer = CountVectorizer()
vectorizer.fit(corpus)
vector = vectorizer.transform(corpus)
tfidf_transformer = TfidfTransformer()
vector = tfidf_transformer.fit_transform(vector)
with open('../data/datasets/labeledData/training/train_matrices/corpus.pickle', 'wb') as handle:
pickle.dump(vector, handle)
After this I load it back in:
train_sparse = []
with open('../data/datasets/labeledData/training/train_matrices/corpus.pickle', 'rb') as handle:
train_sparse = pickle.load(handle)
train_set = np.array(train_sparse.toarray()[0:9])
test_set = np.array(train_sparse.toarray()[10:14])
labels = [1028, 1107, 1358, 598, 707, 875, 884, 890, 988]
I then fit a SVC with the training data and class labels:
clf = SVC(kernel='linear', C=100, cache_size=300, class_weight='balanced', coef0=10.0,
decision_function_shape='ovo', degree=10, gamma='auto',
max_iter=-1, probability=True, random_state=None, shrinking=False,
tol=1, verbose=False)
clf.fit(train_set, labels)
Experimenting with the fitted SVC, and to get some confidence I test it on one of the examples it was actually trained on (clf.predict([train_sparse.toarray()[6]], class 884) expecting a close to 1.0 probabiltiy score for that class. I actually get a very poor result with an incorrect classification.
Actual Class: 884
Predicted Class: [988]
Class: 1028 probability: 0.13680521863292022 %
Class: 1107 probability: 0.1372151835630488 %
Class: 1358 probability: 0.09314753496099398 %
Class: 598 probability: 0.11216304253012621 %
Class: 707 probability: 0.07705449472997644 %
Class: 875 probability: 0.07702437742491991 %
Class: 884 probability: 0.11694844959109225 %
Class: 890 probability: 0.12739816653603753 %
Class: 988 probability: 0.1222435320308847 %
Other attempts produce simmilar results:
Actual Class: 890
Predicted Class: [890]
Class: 1028 probability: 0.13682366473176108 %
Class: 1107 probability: 0.1372180833047104 %
Class: 1358 probability: 0.09312179238345174 %
Class: 598 probability: 0.11286567433780788 %
Class: 707 probability: 0.07636519076484871 %
Class: 875 probability: 0.07712682059152805 %
Class: 884 probability: 0.1169118710112273 %
Class: 890 probability: 0.12735823672708854 %
Class: 988 probability: 0.12220866614757639 %
Is there anything I can do to get these probabiltiy scores up (or down)? rather than sitting around the 10% - 12% mark for everything I try it on? I tested this on a seperate english language corpus; 9 texts, all about the same size ranging in dates from 900 - 1600, this was giving me very simmilar scores. I need a probability score because part of my project is to see whether a text can be roughly dated based on a range of class simmilarity scores from various dates.

Related

lightgbm || ValueError: Series.dtypes must be int, float or bool

Dataframe has filled na values .
Schema of dataset has no object dtype as specified in documentation.
df.info()
output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 429 entries, 351 to 559
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 429 non-null category
1 Married 429 non-null category
2 Dependents 429 non-null category
3 Education 429 non-null category
4 Self_Employed 429 non-null category
5 ApplicantIncome 429 non-null int64
6 CoapplicantIncome 429 non-null float64
7 LoanAmount 429 non-null float64
8 Loan_Amount_Term 429 non-null float64
9 Credit_History 429 non-null float64
10 Property_Area 429 non-null category
dtypes: category(6), float64(4), int64(1)
memory usage: 23.3 KB
I have following code .....................................................................................................................................................................................................................................................................................................................
I am trying to classification of dataset using lightgbm
import lightgbm as lgb
train_data=lgb.Dataset(x_train,label=y_train,categorical_feature=cat_cols)
#define parameters
params = {'learning_rate':0.001}
model= lgb.train(params, train_data, 100,categorical_feature=cat_cols)
getting following error :
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-178-aaa91a2d8719> in <module>
6
7
----> 8 model= lgb.train(params, train_data, 100,categorical_feature=cat_cols)
~\Anaconda3\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
229 # construct booster
230 try:
--> 231 booster = Booster(params=params, train_set=train_set)
232 if is_valid_contain_train:
233 booster.set_train_data_name(train_data_name)
~\Anaconda3\lib\site-packages\lightgbm\basic.py in __init__(self, params, train_set, model_file, model_str, silent)
1981 break
1982 # construct booster object
-> 1983 train_set.construct()
1984 # copy the parameters from train_set
1985 params.update(train_set.get_params())
~\Anaconda3\lib\site-packages\lightgbm\basic.py in construct(self)
1319 else:
1320 # create train
-> 1321 self._lazy_init(self.data, label=self.label,
1322 weight=self.weight, group=self.group,
1323 init_score=self.init_score, predictor=self._predictor,
~\Anaconda3\lib\site-packages\lightgbm\basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
1133 raise TypeError('Cannot initialize Dataset from {}'.format(type(data).__name__))
1134 if label is not None:
-> 1135 self.set_label(label)
1136 if self.get_label() is None:
1137 raise ValueError("Label should not be None")
~\Anaconda3\lib\site-packages\lightgbm\basic.py in set_label(self, label)
1648 self.label = label
1649 if self.handle is not None:
-> 1650 label = list_to_1d_numpy(_label_from_pandas(label), name='label')
1651 self.set_field('label', label)
1652 self.label = self.get_field('label') # original values can be modified at cpp side
~\Anaconda3\lib\site-packages\lightgbm\basic.py in list_to_1d_numpy(data, dtype, name)
88 elif isinstance(data, Series):
89 if _get_bad_pandas_dtypes([data.dtypes]):
---> 90 raise ValueError('Series.dtypes must be int, float or bool')
91 return np.array(data, dtype=dtype, copy=False) # SparseArray should be supported as well
92 else:
ValueError: Series.dtypes must be int, float or bool

did anyone helped you yet? If not: The answer lies within transforming your variable.
Go to this link:GitHub Discussion lightGBM
The creators of LightGBM were confronted with that same question once.
In the Link above they (STRIKER) tell you, that you should: transform your variables with astype("category") (pandas/scikit) AND you should labelEncode them, because you need an INT ! value in your feature column, especially an INT32.
However, labelEncoding and astype('category') should normally do the same:
Encoding
Antoher useful link is this advanced doc about the categorical feature:Categorical feature light gbm homepage where they tell you that they cant deal with object(string) dtypes as in your data.
If you are still feeling uncomfortable with this explanation, here is my code snippet from the kaggle space_race_set. If you are still having problems. Just ask away.
cat_feats = ['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country']
labelencoder = LabelEncoder()
for col in cat_feats:
train_df[col] = labelencoder.fit_transform(train_df[col])
for col in cat_feats:
train_df[col] = train_df[col].astype('int')
y = train_df[["Status Mission"]]
X = train_df.drop(["Status Mission"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
train_data = lgb.Dataset(X_train,
label=y_train,
categorical_feature=['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country'],
free_raw_data=False)
test_data = lgb.Dataset(X_test,
label=y_test,
categorical_feature=['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country'],
free_raw_data=False)

I had the same problem. My y_train was in int64 dtype. This solved my problem:
model_LGB.fit(
X = X_train,
y = y_train.astype('int32'))

How to do clustering with k-means algorithm for an imported data set with proper scaling of both axis

I m new to data science and python, and jupyter notebook, I m currently studying how to do k means clustering on a data set. I came across ways in which can introduce data
Data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46],
'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7]
}
df = DataFrame(Data,columns=['x','y'])
and use of blobs
data = make_blobs(n_samples=200, n_features=2, centers=4, cluster_std=1.6, random_state=50)
but I would like to know how to do a proper code with a csv file imported from my computer and do a k means with scaling, thank you in advance, I could not find relevant blogs to help me
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans
data=pd.read_csv("C:/Users/Dulangi/Downloads/winequality-red.csv")
data
data["alcohol"]=data["alcohol"]/data["alcohol"].max()
data["quality"]=data["quality"]/data["quality"].max()
plt.scatter(data["alcohol"],data['quality'])
plt.xlabel("alcohol")
plt.ylabel('quality')
plt.show()
x=data.copy()
kmeans=KMeans(2)
kmeans.fit(x)
clusters=x.copy()
clusters['cluster_pred']=kmeans.fit_predict(x)
plt.scatter(clusters["alcohol"],clusters['quality'],c=clusters['cluster_pred'],cmap='rainbow')
plt.xlabel("alcohol")
plt.ylabel('quality')
plt.show()
from sklearn import preprocessing
x_scaled=preprocessing.scale(x)
#x_scaled
wcss=[]
for i in range(1,30):
kmeans=KMeans(i)
kmeans.fit(x_scaled)
wcss.append(kmeans.inertia_)
wcss
plt.plot(range(1,30),wcss)
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
This is what i tried
the error i got
ValueError Traceback (most recent call last)
<ipython-input-12-d4955ce8615e> in <module>
39
40
---> 41 plt.plot(range(1,30),wcss)
42 plt.xlabel('Number of clusters')
43 plt.ylabel('WCSS')
~\Anaconda3\lib\site-packages\matplotlib\pyplot.py in plot(scalex, scaley, data, *args, **kwargs)
2787 return gca().plot(
2788 *args, scalex=scalex, scaley=scaley, **({"data": data} if data
-> 2789 is not None else {}), **kwargs)
2790
2791
~\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in plot(self, scalex, scaley, data, *args, **kwargs)
1664 """
1665 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D._alias_map)
-> 1666 lines = [*self._get_lines(*args, data=data, **kwargs)]
1667 for line in lines:
1668 self.add_line(line)
~\Anaconda3\lib\site-packages\matplotlib\axes\_base.py in __call__(self, *args, **kwargs)
223 this += args[0],
224 args = args[1:]
--> 225 yield from self._plot_args(this, kwargs)
226
227 def get_next_color(self):
~\Anaconda3\lib\site-packages\matplotlib\axes\_base.py in _plot_args(self, tup, kwargs)
389 x, y = index_of(tup[-1])
390
--> 391 x, y = self._xy_from_xy(x, y)
392
393 if self.command == 'plot':
~\Anaconda3\lib\site-packages\matplotlib\axes\_base.py in _xy_from_xy(self, x, y)
268 if x.shape[0] != y.shape[0]:
269 raise ValueError("x and y must have same first dimension, but "
--> 270 "have shapes {} and {}".format(x.shape, y.shape))
271 if x.ndim > 2 or y.ndim > 2:
272 raise ValueError("x and y can be no greater than 2-D, but have "
ValueError: x and y must have same first dimension, but have shapes (29,) and (1,)

You can easily do by using scikit-Learn
import pandas as pd
data=pd.read_csv('myfile.csv')
df=pd.DataFrame(data,index=None)
df.head()
Check if rows contain any null values
df.isnull().sum()
Drop all the rows with null values if any
df_numeric.dropna(inplace=True)
Normalize data
Normalize the data with MinMax scaling provided by sklearn
from sklearn import preprocessing
minmax_processed = preprocessing.MinMaxScaler().fit_transform(df.drop('title',axis=1))
df_numeric_scaled = pd.DataFrame(minmax_processed, index=df.index, columns=df.columns[:-1])
df_numeric_scaled.head()
from sklearn.cluster import KMeans
Apply K-Means Clustering
What k to choose?
Let's fit cluster size 1 to 20 on our data and take a look at the corresponding score value.
Nc = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in Nc]
score = [kmeans[i].fit(df_numeric_scaled).score(df_numeric_scaled) for i in range(len(kmeans))]
These score values signify how far our observations are from the cluster center. We want to keep this score value around 0. A large positive or a large negative value would indicate that the cluster center is far from the observations.
Based on these scores value, we plot an Elbow curve to decide which cluster size is optimal. Note that we are dealing with tradeoff between cluster size(hence the computation required) and the relative accuracy.
import matplotlib as pl
pl.plot(Nc,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
Fit K-Means for clustering with k=5
kmeans = KMeans(n_clusters=5)
kmeans.fit(df_numeric_scaled)
df['cluster'] = kmeans.labels_
df.head()

skLearn fitting data input fails even though numpy data shape is correct

I am trying to fit some (numpy) data into python skLearn modules, but keep getting error messages.
When I use the example data-set from iris, where I load it as per below
from sklearn import datasets
iris = datasets.load_diabetes() # load pseudo test data
print(np.shape(iris.data))
print(np.shape(iris.target))
(442, 10)
(442,)
It works fine. But when I use my own data-set which I convert to numpy array, it fails. I cannot figure out why, as I've explicitly converted it into the same shape type as iris
fileLoc = 'C:\\Users\\2018_signal.csv'
data = pd.read_csv(fileLoc)
fl_data = data[['signal', 'sig_dig', 'std_prx']].values
fl_target = data[['actual']].actual.values
ml_data = fl_data[0:int(fraction * len(fl_data))]
ml_target = fl_target[0:int(fraction * len(fl_target))]
print(np.shape(ml_data))
print(np.shape(ml_target))
(6663, 3)
(6663,)
The skLearn code as per below
start_time = time.time()
SKknn_pred = KNeighborsClassifier(n_neighbors=1, algorithm='ball_tree', metric = 'euclidean').fit(ml_data, ml_target).predict(ml_data)
print("knn --- %s seconds ---" % (time.time() - start_time))
print("Number of mislabeled points out of a total %d points : %d" % (fl_data.shape[0],(fl_target != SKknn_pred).sum()))
l_time.append(['knn', 1000 * (time.time() - start_time)])
I get the error message below... Help!!!!!
ValueError Traceback (most recent call last)
<ipython-input-96-91e2b93e2580> in <module>()
57
58 start_time = time.time()
---> 59 SKgnb_pred = GaussianNB().fit(ml_data, ml_target).predict(fl_data)
60 print("gnb --- %s seconds ---" % (time.time() - start_time))
61 print("Number of mislabeled points out of a total %d points : %d" % (fl_data.shape[0],(fl_target != SKgnb_pred).sum()))
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\naive_bayes.py in fit(self, X, y, sample_weight)
183 X, y = check_X_y(X, y)
184 return self._partial_fit(X, y, np.unique(y), _refit=True,
--> 185 sample_weight=sample_weight)
186
187 #staticmethod
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\naive_bayes.py in _partial_fit(self, X, y, classes, _refit, sample_weight)
348 self.classes_ = None
349
--> 350 if _check_partial_fit_first_call(self, classes):
351 # This is the first call to partial_fit:
352 # initialize various cumulative counters
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in _check_partial_fit_first_call(clf, classes)
319 else:
320 # This is the first call to partial_fit
--> 321 clf.classes_ = unique_labels(classes)
322 return True
323
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in unique_labels(*ys)
95 _unique_labels = _FN_UNIQUE_LABELS.get(label_type, None)
96 if not _unique_labels:
---> 97 raise ValueError("Unknown label type: %s" % repr(ys))
98
99 ys_labels = set(chain.from_iterable(_unique_labels(y) for y in ys))
ValueError: Unknown label type: (array([-78.375, -67.625, -66.75 , ..., 71.375, 76.75 , 78.1 ]),)

A way to use python to correct for your own error.
from sklearn import preprocessing
from sklearn import utils
ml_target = lab_enc.fit_transform(ml_target)
print(utils.multiclass.type_of_target(ml_target))
print(utils.multiclass.type_of_target(ml_target.astype('float')))
print(utils.multiclass.type_of_target(ml_target))
The skLearn module fits the data after the transform above

Tree classifier to graphviz ERROR

I had made a Tree Classifier named model and tried to use the export graphviz function like this:
export_graphviz(decision_tree=model,
out_file='NT_model.dot',
feature_names=X_train.columns,
class_names=model.classes_,
leaves_parallel=True,
filled=True,
rotate=False,
rounded=True)
For some reason my run had raised this exception:
TypeError Traceback (most recent call last)
<ipython-input-298-40fe56bb0c85> in <module>()
6 filled=True,
7 rotate=False,
----> 8 rounded=True)
C:\Users\yonatanv\AppData\Local\Continuum\Anaconda3\lib\site-
packages\sklearn\tree\export.py in export_graphviz(decision_tree, out_file,
max_depth, feature_names, class_names, label, filled, leaves_parallel,
impurity, node_ids, proportion, rotate, rounded, special_characters)
431 recurse(decision_tree, 0, criterion="impurity")
432 else:
--> 433 recurse(decision_tree.tree_, 0,
criterion=decision_tree.criterion)
434
435 # If required, draw leaf nodes at same depth as each other
C:\Users\yonatanv\AppData\Local\Continuum\Anaconda3\lib\site-
packages\sklearn\tree\export.py in recurse(tree, node_id, criterion, parent,
depth)
319 out_file.write('%d [label=%s'
320 % (node_id,
--> 321 node_to_str(tree, node_id,
criterion)))
322
323 if filled:
C:\Users\yonatanv\AppData\Local\Continuum\Anaconda3\lib\site-
packages\sklearn\tree\export.py in node_to_str(tree, node_id, criterion)
289 np.argmax(value),
290 characters[2])
--> 291 node_string += class_name
292
293 # Clean up any trailing newlines
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype('<U90') dtype('<U90') dtype('<U90')
My hyper parameters for the visualizations are those:
print(model)
DecisionTreeClassifier(class_weight={1.0: 10, 0.0: 1}, criterion='gini',
max_depth=7, max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=50,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=0, splitter='best')
print(model.classes_)
[ 0. , 1. ]
Help would be most appreciated!

As you see here specified in the documentation of export_graphviz, the param class_names works for strings, not float or int.
class_names : list of strings, bool or None, optional (default=None)
Try converting the model.classes_ to list of strings before passing them in export_graphviz.
Try class_names=['0', '1'] or class_names=['0.0', '1.0'] in the call to export_graphviz().
For a more general solution, use:
class_names=[str(x) for x in model.classes_]
But is there a specific reason that you are passing float values as y in model.fit()? Because that is mostly not required in classification task. Do you have actual y labels as this only or are you converting string labels to numeric before fitting the model?

tfidf vectorizer process shows error

I am working on non-Engish corpus analysis but facing several problems. One of those problems is tfidf_vectorizer. After importing concerned liberaries, I processed following code to get results
contents = [open("D:\test.txt", encoding='utf8').read()]
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words=stopwords,
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(3,3))
%time tfidf_matrix = tfidf_vectorizer.fit_transform(contents)
print(tfidf_matrix.shape)
After processing above code I got following error message.
ValueError Traceback (most recent call last)
<ipython-input-144-bbcec8b8c065> in <module>()
5 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(3,3))
6
----> 7 get_ipython().magic('time tfidf_matrix = tfidf_vectorizer.fit_transform(contents) #fit the vectorizer to synopses')
8
9 print(tfidf_matrix.shape)
C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in magic(self, arg_s)
2156 magic_name, _, magic_arg_s = arg_s.partition(' ')
2157 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2158 return self.run_line_magic(magic_name, magic_arg_s)
2159
2160 #-------------------------------------------------------------------------
C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in run_line_magic(self, magic_name, line)
2077 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2078 with self.builtin_trap:
-> 2079 result = fn(*args,**kwargs)
2080 return result
2081
<decorator-gen-60> in time(self, line, cell, local_ns)
C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\magic.py in <lambda>(f, *a, **k)
186 # but it's overkill for just that one bit of state.
187 def magic_deco(arg):
--> 188 call = lambda f, *a, **k: f(*a, **k)
189
190 if callable(arg):
C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\magics\execution.py in time(self, line, cell, local_ns)
1178 else:
1179 st = clock2()
-> 1180 exec(code, glob, local_ns)
1181 end = clock2()
1182 out = None
<timed exec> in <module>()
C:\Users\mazhar\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1303 Tf-idf-weighted document-term matrix.
1304 """
-> 1305 X = super(TfidfVectorizer, self).fit_transform(raw_documents)
1306 self._tfidf.fit(X)
1307 # X is already a transformed view of raw_documents so
C:\Users\mazhar\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
836 max_doc_count,
837 min_doc_count,
--> 838 max_features)
839
840 self.vocabulary_ = vocabulary
C:\Users\mazhar\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _limit_features(self, X, vocabulary, high, low, limit)
731 kept_indices = np.where(mask)[0]
732 if len(kept_indices) == 0:
--> 733 raise ValueError("After pruning, no terms remain. Try a lower"
734 " min_df or a higher max_df.")
735 return X[:, kept_indices], removed_terms
ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.
If I change then min and max value the error is

Assuming your tokeniser works as expected, I see two problems with your code. First, TfIdfVectorizer expects a list of strings, whereas you are providing a single string. Second, min_df=0.2 is quite high- to be included, a term needs to occur in 20% of all documents, which is very unlikely for trigram features.
The following works for me
from sklearn.feature_extraction.text import TfidfVectorizer
with open("README.md") as infile:
contents = infile.readlines() # Note: readlines() instead of read()
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=2, use_idf=True, ngram_range=(3,3))
# note: minimum of 2 occurrences, rather than 0.2 (20% of all documents)
tfidf_matrix = tfidf_vectorizer.fit_transform(contents)
print(tfidf_matrix.shape)
outputs (155, 28)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Poor probability results for SVM text classification - python-3.x

Related

lightgbm || ValueError: Series.dtypes must be int, float or bool

How to do clustering with k-means algorithm for an imported data set with proper scaling of both axis

skLearn fitting data input fails even though numpy data shape is correct

Tree classifier to graphviz ERROR

tfidf vectorizer process shows error

Categories

Resources