scikit learn says num samples must be greater than num clusters - scikit-learn

Using sklearn.cluster.KMeans. Nearly this exact code worked earlier, all I changed was the way I built my dataset. I have just no idea where even to start... Here's the code:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=20)
for item in dfX:
if type(item) != type(dfX[0]):
print(item)
print(len(dfX))
print(dfX[:10])
km.fit(dfX)
print(km.cluster_centers_)
Which outputs the following:
12147
[1.201, 1.237, 1.092, 1.074, 0.979, 0.885, 1.018, 1.083, 1.067, 1.071]
/home/sbendl/anaconda3/lib/python3.5/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
Traceback (most recent call last):
File "/home/sbendl/PycharmProjects/MLFP/K-means.py", line 20, in <module>
km.fit(dfX)
File "/home/sbendl/anaconda3/lib/python3.5/site-packages/sklearn/cluster/k_means_.py", line 812, in fit
X = self._check_fit_data(X)
File "/home/sbendl/anaconda3/lib/python3.5/site-packages/sklearn/cluster/k_means_.py", line 789, in _check_fit_data
X.shape[0], self.n_clusters))
ValueError: n_samples=1 should be >= n_clusters=20
Process finished with exit code 1
As you can see from the output, there are definitely 12147 samples, which is greater than 20 in most counting systems ;). Additionally they're all floats, so it couldn't be having a problem with that. Anyone have any ideas?

Related

How to recover keras model weights from bytes?

I have extracted the weights of a specific model from a .pb file. It gives me all the weights trucked in a variable in bytes format. as below:
weights = b'\n\x1b\n\t\x08\x01\x12\x05model\n\x0e\x08\x02\x12\nsignatures\n\xe2\x01\n\x18\x08\x03\x12\x14layer_with_weights-0\n\x0b\x08\x03\x12\x07layer-0\n\x0b\x08\x04\x12\x07layer-1\n\x18\x08\x05\x12\x14layer_with_weights-1\n\x0b\x08\x05\x12\x07layer-2\n\r\x08\x06\x12\tvariables\n\x17\x08\x07\x12\x13trainable_variables\n\x19\x08\x08\x12\x15regularization_losses\n\r\x08\t\x12\tkeras_api\n\x0e\x08\n\x12\nsignatures\n#\x08\x0b\x12\x1f_self_saveable_object_factories\n\x00\n\x92R\n\x0b\x08\x0c\x12\x07layer-0\n\x0b\x08\r\x12\x07layer-1\n\x18\x08\x0e\x12\x14layer_with_weights-0\n\x0b\x08\x0e\x12\x07layer-2\n\x0b\x08\x0f\x12\x07layer-3\n\x18\x08\x10\x12\x14layer_with_weights-1\n\x0b\x08\x10\x12\x07layer-4\n\x18\x08\x11\x12\x14layer_with_weights-2\n\x0b\x08\x11\x12\x07layer-5\n\x0b\x08\x12\x12\x07layer-6\n\x18\x08\x13\x12\x14layer_with_weights-3\n\x0b\x08\x13\x12\x07layer-7\n\x18\x08\x14\x12\x14layer_with_weights-4\n\x0b\x08\x14\x12\x07layer-8\n\x0b\x08\x15\x12\x07layer-9\n\x0c\x08\x16\x12\x08layer-10\n\x0c\x08\x17\x12\x08layer-11\n\x18\x08\x18\x12\x14layer_with_weights-5\n\x0c\x08\x18\x12\x08layer-12\n\x18\x08\x19\x12\x14layer_with_weights-6\n\x0c\x08\x19\x12\x08layer-13\n\x0c\x08\x1a\x12\x08layer-14\n\x18\x08\x1b\x12\x14layer_with_weights-7\n\x0c\x08\x1b\x12\x08layer-15\n\x18\x08\x1c\x12\x14layer_with_weights-8\n\x0c\x08\x1c\x12\x08layer-16\n\x18\x08\x1d\x12\x14layer_with_weights-9\n\x0c\x08\x1d\x12\x08layer-17\n\x19\x08\x1e\x12\x15layer_with_weights-10\n\x0c\x08\x1e\x12\x08layer-18\n\x0c\x08\x1f\x12\x08layer-19\n\x0c\x08 \x12\x08layer-20\n\x0c\x08!\x12\x08layer-21\n\x19\x08"\x12\x15layer_with_weights-11\n\x0c\x08"\x12\x08layer-22\n\x19\x08#\x12\x15layer_with_weights-12\n\x0c\x08#\x12\x08layer-23\n\x0c\x08$\x12\x08layer-24\n\x19\x08...
I have tried to convert it using array like this:
import array
arr = array.array('f', weights)
However, I get the following error:
Traceback (most recent call last):
File "/tmp/ipykernel_4441/2375324399.py", line 1, in <module>
arr = array.array('f', value)
ValueError: bytes length not a multiple of item size

MemoryError: Unable to allocate GiB for an array with shape and data type float64 - on a sparse matrix

I am working with textual data and have a document-term matrix, represented in a scipy sparse matrix (for memory efficiency).
I have built a class in which I train a topic model (the outcome of the topic model is the matrix prob_word_given_topic.
Currently, I am doing some post analysis on different models, with the following code:
colnames = ['Model', 'Coherence','SVD_values','Min_c0','Max_c0','Min_c1','Max_c1','Min_sv0','Max_sv0','Min_sv1','Max_sv1', 'PWGT']
analysis_two_factors = pd.DataFrame(columns=colnames)
directory = 'C:~/Images/'
#Experiment with: singular values, number of topics, weighting methods
for i, top in enumerate(range(3,28,2)):
for weighting_method in [2,3,4,5,1]:
print(type(top))
one_round=[]
model = FLSA(input_file = data_list,
num_topics = top,
num_words = 20,
word_weighting =weighting_method,
svd_factors=2,
cluster_method='fcm')
model.plot_svd_graph_2D(directory)
model.plot_cluster_datapoints_graph(directory)
one_round.append(model.setting)
one_round.append(model.calc_coherence_value)
one_round.append(model.s)
one_round.append(min(model.cluster_centers[:,0]))
one_round.append(max(model.cluster_centers[:,0]))
one_round.append(min(model.cluster_centers[:,1]))
one_round.append(min(model.cluster_centers[:,1]))
one_round.append(min(model.svd_data[:,0]))
one_round.append(max(model.svd_data[:,0]))
one_round.append(min(model.svd_data[:,1]))
one_round.append(min(model.svd_data[:,1]))
one_round.append(model.prob_word_given_topic)
analysis_two_factors.loc[i] = one_round
print('Finished iteration',str(i))
However, while being in top = 19, I suddenly got the following error:
Traceback (most recent call last):
File "<ipython-input-687-fe7cf1e4ea7a>", line 15, in <module>
cluster_method='fcm')
File "<ipython-input-672-e9c098fb0e45>", line 92, in __init__
prob_word_given_doc = np.asarray(self.sparse_weighted_matrix / self.sparse_weighted_matrix.sum(1))
File "c:~\continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 620, in __truediv__
return self._divide(other, true_divide=True)
File "c:~\continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 599, in _divide
return np.true_divide(self.todense(), other)
MemoryError: Unable to allocate 2.87 GiB for an array with shape (4280, 90140) and data type float64
This surprises me, as it did perform all previous iterations in the loop and also self.sparse_weighted_matrix is a sparse matrix (dok_matrix), so I dont expect such high memory requirements here. Can somebody explain why I get this error? And what I can do to overcome the problem?
prob_word_given_doc = np.asarray(self.sparse_weighted_matrix / self.sparse_weighted_matrix.sum(1))

UFuncTypeError: ufunc 'gcd' did not contain a loop with signature matching types (dtype('float64'), dtype('float64')) -> dtype('float64')

I am trying to find the gcd to obtain the desired array length. so array_length = sub_array_length * number_sub_arrays.
I am want the gcd between array_length and sub_array_length so I can determine number_sub_arrays.
I am using a Gaussian distribution to give me 100 potential sub_array_lengths where hopefully 1 of them provides me large GCD value.
Here is my code so far (I added the value of lp_fill_size and pdx for clarity):
self.lp_fill_size = 7958
self.pdx = 8.296138303238362e-05
if self.lp_fill_size != 0:
dx_sub_arr_size = np.round(np.random.normal(0.01, 0.003, 100) / self.pdx)
num_sub_arr_gcd = np.gcd(dx_sub_arr_size, self.lp_fill_size)
The error I get is:
Traceback (most recent call last):
File "process_exp_nmr.py", line 157, in <module>
main()
File "process_exp_nmr.py", line 127, in main
dataFormatter.evaluateData()
File "/workspaces/NMars/NMars/DataStructures/MLFormatter/ExpMLFormatter.py", line 27, in evaluateData
self._evaluateProton()
File "/workspaces/NMars/NMars/DataStructures/MLFormatter/ExpMLFormatter.py", line 58, in _evaluateProton
num_sub_arr_gcd = np.gcd(dx_sub_arr_size, self.lp_fill_size)
numpy.core._exceptions.UFuncTypeError: ufunc 'gcd' did not contain a loop with signature matching types (dtype('float64'), dtype('float64')) -> dtype('float64')
Previous similar stackoverflow questions suggest match the data types. I have tried changing the types of both arguments to np.float64, lists, and integers. I keep getting the same error. I was originally using numpy 1.17 and updated to bumpy 1.19.1 (with conda) and am still getting the same error.
I do not know what else to do. Any help would be greatly appreciated.
Here is the python code that got it to work:
import numpy as np
lp_fill_size = 7958
pdx = 8.296138303238362e-05
if lp_fill_size != 0:
dx_sub_arr_size = np.round(np.random.normal(0.01, 0.003, 100) / pdx)
num_sub_arr_gcd = np.gcd(dx_sub_arr_size.astype(np.int32), lp_fill_size)
It worked with casting to np.int32 and with a regular python int.

Scikit-learn Incremental PCA - ValueError: array must not contain infs or NaNs

I'm trying to use IncrementalPCA from scikit-learn. I really need the incremental version of the algorithm because of the online nature of my application. My code couldn't really be simpler:
from sklearn.decomposition import IncrementalPCA
import pandas as pd
with open('C:/My/File/Path/file.csv', 'r') as fp:
data = pd.read_csv(fp)
ipca = IncrementalPCA(n_components=4)
ipca.fit(data)
but this is how it finishes when launched:
C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py:293: RuntimeWarning: overflow encountered in long_scalars
np.sqrt((self.n_samples_seen_ * n_samples) /
C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py:293: RuntimeWarning: invalid value encountered in sqrt
np.sqrt((self.n_samples_seen_ * n_samples) /
Traceback (most recent call last):
File "C:/Users/myuser/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/scratch_9.py", line 6, in <module>
ipca.fit(data)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py", line 215, in fit
self.partial_fit(X_batch, check_input=False)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py", line 298, in partial_fit
U, S, V = linalg.svd(X, full_matrices=False)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\scipy\linalg\decomp_svd.py", line 106, in svd
a1 = _asarray_validated(a, check_finite=check_finite)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\scipy\_lib\_util.py", line 263, in _asarray_validated
a = toarray(a)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\numpy\lib\function_base.py", line 498, in asarray_chkfinite
raise ValueError(
ValueError: array must not contain infs or NaNs
Process finished with exit code 1
My data is 243 columns of only 0s and 1s. I already checked:
There is no NaN anywhere in my data
There is no inf anywhere in my data
I had scikit-learn v0.22.2.post1, I updated to 0.23.1, no difference
If I use PCA instead of IncrementalPCA leaving everything else the same, everything works fine, no warnings, no errors, all good
There were similar issues in previous versions, but they refer to versions around 0.16/0.17, most were with more complex code and all were fixed around those versions
If anyone could help me I would be most grateful
Edit:
My data, exactly as I feed them to the above code
https://drive.google.com/file/d/1JBIliADt9TViTk8qjnmIS3RFEO934dY6/view?usp=sharing
Edit 2:
Tried using both
data = pd.read_csv(fp, dtype = 'Int64')
and
data = pd.read_csv(fp, dtype = np.float64)
with no difference in results.
Edit 3:
Seems like the issue is related with the dataset size. If I try fitting to a smaller portion everything works fine. This is until I get around 1800000 rows. That's where the error starts showing.
I issued this to scikit-learn and they got it fast. This is happening due to numpy array defaulting to int32 on Windows, which causes the RuntimeWarning at the top of the traceback and escalate into having NaNs passed to partial_fit(). I'm temporary moving to Linux waiting for it to be fixed.
Here for anyone having similar problems to track its resolution in future.
tl;dr: check above link to see if issue is resolved. If it is not, use a batch_size such as that batch_size * n_samples < 2^31 - 1. If that's not possible for you move to Linux.
Something is wrong with your data.
Here is an 100% working example using some artificial data (n=2000000 and d=243).
To help more, upload a sample of your data that results in the error.
from sklearn.decomposition import IncrementalPCA
import pandas as pd, numpy as np
n=2000000
d=243
data = pd.DataFrame(np.ones((n,d)))
ipca = IncrementalPCA(n_components=4)
ipca.fit(data.values)

spark 1.6.1 python 3.5.1 building naive bayes classifier

My question is based upon this.
Would it be possible more detailed comments/explain code starting
line tf = HashingTF().transform( training_raw.map(lambda doc:
doc["text"], preservesPartitioning=True))
How could I print the confusion matrix?
What does below error mean? How can I fix it? The model still gets built and I get predictions
>>> # Train and check
... model = NaiveBayes.train(training)
[Stage 2:=============================> (2 + 2) / 4]16/04/05 18:18:28 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
16/04/05 18:18:28 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
How could I print results for the new observation. I tried and
failed
>>> model.predict("love")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\spark-1.6.1-bin-hadoop2.6\spark-1.6.1-bin-hadoop2.6\python\pyspark\mllib\classification.py", line 594, in predict
x = _convert_to_vector(x)
File "c:\spark-1.6.1-bin-hadoop2.6\spark-1.6.1-bin-hadoop2.6\python\pyspark\mllib\linalg\__init__.py", line 77, in _convert_to_vector
raise TypeError("Cannot convert type %s into Vector" % type(l))
TypeError: Cannot convert type <class 'str'> into Vector
1.hashingTF in spark is similiar to the scikitlearn HashingVectorizer. training_raw is an rdd of text.For a detailed explanation of the available vectorizers in pySpark see Vectorizers. For a complete example see this post
2.BLAS is the Basic Linear Algebra Subprograms library. You can check out this page on github for a potential solution.
3.You are trying to use model.predict on a string ("love"). You must first convert the string to a vector. A simple example that takes a dense vector string and outputs a dense vector with label is
def parseLine(line):
parts = line.split(',')
label = float(parts[0])
features = Vectors.dense([float(x) for x in parts[1].split(' ')])
return LabeledPoint(label, features)
You are probably looking for a sparse vector. So try Vectors.sparse.

Resources