OneHotEncoder failing after combining dataframes - python-3.x

I have a model that runs successfully.
When I tried to predict using it, it was failing due to the fact that after OneHotEncoding, the test set had more columns than the train.
After some reading I found where I need to concat the two df's first, OneHotEncode, then split them apart.
Added a 'temp' column to the train data set with value 'train'.
Added a 'temp' column to the test data set with value 'test'.
This is so that I can split the df apart later using boolean indexing like this:
X = temp_df[temp_df['temp'] == 'train']
X2 = temp_df[temp_df['temp'] == 'test']
Vertically concat the two df's.
Verify the shape of the new combined df.
Change all columns to type 'category' except 'temp', which is object:
basin category
region category
lga category
extraction_type_class category
management category
quality_group category
quantity category
source category
waterpoint_type category
cluster category
temp object
Now I am simply trying to OneHotEncode like I did before. I choose only categorical columns:
cat_ix = temp_df.select_dtypes(include=['category']).columns
And I try to apply with:
ct = ColumnTransformer([('o', OneHotEncoder(), cat_ix)], remainder='passthrough')
temp_df = ct.fit_transform(temp_df)
It fails on the temp_df = ct.fit_transform(temp_df) line.
These identical steps worked perfectly before I added the temp column and concat'd the two df's.
The exact error:
Traceback (most recent call last):
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 778, in _hstack
converted_Xs = [
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 779, in <listcomp>
check_array(X, accept_sparse=True, force_all_finite=False)
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\utils\validation.py", line 738, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: 'train'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 783, in _hstack
raise ValueError(
ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.
Why is it complaining about 'train'? That is in the 'temp' column which is being excluded.

Note that the traceback doesn't reference OneHotEncoder, it's all the ColumnTransformer. You're trying to pass through the temp column, which gets tacked onto the one-hot-encoded sparse matrix in the method _hstack, and the second error message is the more relevant one. It cannot stack a string-type array onto a numeric sparse array (which leads to the first error message).
If the sparse matrix isn't too large, you can just force it to be dense by using sparse_threshold=0 in the ColumnTransformer or sparse=False in the OneHotEncoder. If it is too large for memory (or you'd prefer the sparse matrices), you could use a 0/1 indicator for the train/test split instead of the strings "train", "test".

Related

MemoryError: Unable to allocate GiB for an array with shape and data type float64 - on a sparse matrix

I am working with textual data and have a document-term matrix, represented in a scipy sparse matrix (for memory efficiency).
I have built a class in which I train a topic model (the outcome of the topic model is the matrix prob_word_given_topic.
Currently, I am doing some post analysis on different models, with the following code:
colnames = ['Model', 'Coherence','SVD_values','Min_c0','Max_c0','Min_c1','Max_c1','Min_sv0','Max_sv0','Min_sv1','Max_sv1', 'PWGT']
analysis_two_factors = pd.DataFrame(columns=colnames)
directory = 'C:~/Images/'
#Experiment with: singular values, number of topics, weighting methods
for i, top in enumerate(range(3,28,2)):
for weighting_method in [2,3,4,5,1]:
print(type(top))
one_round=[]
model = FLSA(input_file = data_list,
num_topics = top,
num_words = 20,
word_weighting =weighting_method,
svd_factors=2,
cluster_method='fcm')
model.plot_svd_graph_2D(directory)
model.plot_cluster_datapoints_graph(directory)
one_round.append(model.setting)
one_round.append(model.calc_coherence_value)
one_round.append(model.s)
one_round.append(min(model.cluster_centers[:,0]))
one_round.append(max(model.cluster_centers[:,0]))
one_round.append(min(model.cluster_centers[:,1]))
one_round.append(min(model.cluster_centers[:,1]))
one_round.append(min(model.svd_data[:,0]))
one_round.append(max(model.svd_data[:,0]))
one_round.append(min(model.svd_data[:,1]))
one_round.append(min(model.svd_data[:,1]))
one_round.append(model.prob_word_given_topic)
analysis_two_factors.loc[i] = one_round
print('Finished iteration',str(i))
However, while being in top = 19, I suddenly got the following error:
Traceback (most recent call last):
File "<ipython-input-687-fe7cf1e4ea7a>", line 15, in <module>
cluster_method='fcm')
File "<ipython-input-672-e9c098fb0e45>", line 92, in __init__
prob_word_given_doc = np.asarray(self.sparse_weighted_matrix / self.sparse_weighted_matrix.sum(1))
File "c:~\continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 620, in __truediv__
return self._divide(other, true_divide=True)
File "c:~\continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 599, in _divide
return np.true_divide(self.todense(), other)
MemoryError: Unable to allocate 2.87 GiB for an array with shape (4280, 90140) and data type float64
This surprises me, as it did perform all previous iterations in the loop and also self.sparse_weighted_matrix is a sparse matrix (dok_matrix), so I dont expect such high memory requirements here. Can somebody explain why I get this error? And what I can do to overcome the problem?
prob_word_given_doc = np.asarray(self.sparse_weighted_matrix / self.sparse_weighted_matrix.sum(1))

problem saving pre-trained fasttext vectors in "word2vec" format with _save_word2vec_format()

For a list of words I want to get their fasttext vectors and save them to a file in the same "word2vec" .txt format (word+space+vector in txt format).
This is what I did:
dict = open("word_list.txt","r") #the list of words I have
path = "cc.en.300.bin"
model = load_facebook_model(path)
vectors = []
words =[]
for word in dict:
vectors.append(model[word])
words.append(word)
vectors_array = np.array(vectors)
*I want to take the list "words" and nd.array "vectors_array" and save in the original .txt format.
I try to use the function from gensim "_save_word2vec_format":
def _save_word2vec_format(fname, vocab, vectors, fvocab=None, binary=False, total_vec=None):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.
Parameters
----------
fname : str
The file path used to save the vectors in.
vocab : dict
The vocabulary of words.
vectors : numpy.array
The vectors to be stored.
fvocab : str, optional
File path used to save the vocabulary.
binary : bool, optional
If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
total_vec : int, optional
Explicitly specify total number of vectors
(in case word vectors are appended with document vectors afterwards).
"""
if not (vocab or vectors):
raise RuntimeError("no input")
if total_vec is None:
total_vec = len(vocab)
vector_size = vectors.shape[1]
if fvocab is not None:
logger.info("storing vocabulary in %s", fvocab)
with utils.open(fvocab, 'wb') as vout:
for word, vocab_ in sorted(iteritems(vocab), key=lambda item: -item[1].count):
vout.write(utils.to_utf8("%s %s\n" % (word, vocab_.count)))
logger.info("storing %sx%s projection weights into %s", total_vec, vector_size, fname)
assert (len(vocab), vector_size) == vectors.shape
with utils.open(fname, 'wb') as fout:
fout.write(utils.to_utf8("%s %s\n" % (total_vec, vector_size)))
# store in sorted order: most frequent words at the top
for word, vocab_ in sorted(iteritems(vocab), key=lambda item: -item[1].count):
row = vectors[vocab_.index]
if binary:
row = row.astype(REAL)
fout.write(utils.to_utf8(word) + b" " + row.tostring())
else:
fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
but I get the error:
INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from cc.en.300.bin
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:Updating model with new vocabulary
INFO:gensim.models.word2vec:New added 2000000 unique words (50% of original 4000000) and increased the count of 2000000 pre-existing words (50% of original 4000000)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 2000000 items
INFO:gensim.models.word2vec:sample=1e-05 downsamples 6996 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 390315457935 word corpus (70.7% of prior 552001338161)
INFO:gensim.models.fasttext:loaded (4000000, 300) weight matrix for fastText model from cc.en.300.bin
trials.py:42: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
vectors.append(model[word])
INFO:__main__:storing 8664x300 projection weights into arrays_to_txt_oct3.txt
loading the model for: en
finish loading the model for: en
len(vectors): 8664
len(words): 8664
shape of vectors_array (8664, 300)
mission launched!
Traceback (most recent call last):
File "trials.py", line 102, in <module>
_save_word2vec_format(YOUR_VEC_FILE_PATH, words, vectors_array, fvocab=None, binary=False, total_vec=None)
File "trials.py", line 89, in _save_word2vec_format
for word, vocab_ in sorted(iteritems(vocab), key=lambda item: -item[1].count):
File "/cs/snapless/oabend/tailin/transdiv/lib/python3.7/site-packages/six.py", line 589, in iteritems
return iter(d.items(**kw))
AttributeError: 'list' object has no attribute 'items'
I understand that it has to do with the second argument in the function, but I don't understand how should I make a list of words into a dictionary object?
I tried doing that with:
#convert list of words into a dictionary
words_dict = {i:x for i,x in enumerate(words)}
But still got the error message:
Traceback (most recent call last):
File "trials.py", line 99, in <module>
_save_word2vec_format(YOUR_VEC_FILE_PATH, dict, vectors_array, fvocab=None, binary=False, total_vec=None)
File "trials.py", line 77, in _save_word2vec_format
total_vec = len(vocab)
TypeError: object of type '_io.TextIOWrapper' has no len()
I don't understand how to insert the word list in the right format...
You can directly import & re-use the Gensim KeyedVectors class to assemble your own (sub)set of word-vectors as one instance of KeyedVectors, then use its .save_word2vec_format() method.
For example, roughly this should work:
from gensim.models import KeyedVectors
words_file = open("word_list.txt","r") # your word-list as a text file
words_list = list(words_file) # reads each line of file into a new `list` object
fasttext_path = "cc.en.300.bin"
model = load_facebook_model(path)
kv = KeyedVectors(vector_size=model.wv.vector_size) # new empty KV object
vectors = []
for word in words_list:
vectors.append(model[word]) # vectors for words_list, in same order
kv.add(words_list, vectors) # adds those keys (words) & vectors as batch
kv.save_word2vec_format('my_kv.vec', binary=False)

Combining a list of Dataframes

I have a folder with several .csv-files. Each contains data on Time, High, Low, Open, Volumefrom, Volumeto, Close of a cryptocurrency.
I managed to load the .csvs into a list of dataframes and drop the columns Open, High, Low, Volumefrom, Volumeto , which I don't need, leaving me with Time and Close for each dataframe.
Now i want to combine the list of dataframes into one dataframe, where the index starts with the Timestamp of the youngest coin which would be iota in this example.
This is the code I wrote so far:
import pandas as pd
import os
# Path to my folder
PATH_COINS = r"C:\Users\...\Coins"
# creating a path for each of the .csv-files and saving it into a list
namelist = [name for name in os.listdir(PATH_COINS)]
path_lists = [os.path.join(PATH_COINS, path) for path in namelist]
# creating the dataframes and saving them into a list
dfs = [pd.read_csv(k, index_col=0) for k in path_lists]
# dropping unwanted columns
for num, i in enumerate(dfs):
i.drop(columns=["Open", "High", "Low", "Volumefrom", "Volumeto"], inplace=True)
# combining the list of dataframes into one dataframe
pd.concat(dfs, join="inner", axis=1)
However i am getting an Errormessage and cant figure out how to achieve my goal:
Traceback (most recent call last): File
"C:/Users/Jonas/PycharmProjects/Pandas/main.py", line 16, in
pd.concat(dfs, join="inner", axis=1)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\reshape\concat.py",
line 226, in concat
return op.get_result()
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\reshape\concat.py",
line 423, in get_result
copy=self.copy)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 5425, in concatenate_block_managers
return BlockManager(blocks, axes)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 3282, in init
self._verify_integrity()
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 3493, in _verify_integrity
construction_error(tot_items, block.shape[1:], self.axes)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 4843, in construction_error
passed, implied))
ValueError: Shape of passed values is (5, 8514), indices imply (5,
8490)
join should work
Check for duplicate index values as it doesn't know how to map multiple duplicate indexes across multiple DFs (e.g. df.index.is_unique)
Removing duplicate index values (e.g., df.drop_duplicates(inplace=True)) or one of the methods here should resolve it.

TypeError: 'Data argument can't be an iterator...'

Im trying to use the zip function to bring the column names together and the np.transpose function to bring together the coefficients of the log_model I created.
My code:
# Create LogisticRegression model object
log_model = LogisticRegression()
# Fit our data into that object
log_model.fit(X,Y)
# Check your accuracy
log_model.score(X,Y)
This code worked just fine as I was able to check the accuracy of my model.
However, the following code is where I get my error.
Erroneous code:
coeff_df = DataFrame(zip(X.columns,np.transpose(log_model.coef_)))
Error message:
TypeError Traceback (most recent call last)
<ipython-input-147-a4e0ad234518> in <module>()
1 # Use zip to bring the column names and the np.transpose function to bring together the coefficients from the model
----> 2 coeff_df = DataFrame(zip(X.columns,np.transpose(log_model.coef_)))
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
387 mgr = self._init_dict({}, index, columns, dtype=dtype)
388 elif isinstance(data, collections.Iterator):
--> 389 raise TypeError("data argument can't be an iterator")
390 else:
391 try:
TypeError: data argument can't be an iterator
What am I doing wrong? Sorry, newbie here. Im following a Udemy data visualization with Python tutorial. My lecturer is using Python 2 but I've been able to manage with Python 3, just making and researching the conversions to ensure my code still works. Any suggestions would be greatly appreciated.
For binary classification, only one linear model is used. In the case of multi-label classification, there will be one linear model for each class. Assuming your X is a pd.DataFrame, you can proceed as follows:
output = pd.DataFrame(my_model.coef_, columns=X.columns)
Rows represent linear models for different classes, columns represent coefficients.

slicing error in numpy array

I am trying to run the following code
fs = 1000
data = np.loadtxt("trainingdataset.txt", delimiter=",")
data1 = data[:,2]
data2 = data1.astype(int)
X,Y = data2['521']
but it gets me the following error
Traceback (most recent call last):
File "C:\Users\hadeer.elziaat\Desktop\testspec.py", line 58, in <module>
X,Y = data2['521']
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
my dataset
1,4,6,10
2,100,125,10
3,100,7216,254
4,100,527,263
5,100,954,13
6,100,954,23
You're using the string '521' rather than the number 521 for indexing. Try X,Y = data2[521] instead.
If you are only given the string, you could cast it to an int first: X,Y = data2[int('521')], but this might result in some errors and/or unexpected behaviour.
Next problem, you are requiring two variable, one for X and one for Y, yet the data2[521] selection only provides you with a single variable (the number in the 3rd column, 522nd row).
You say you want all the data in the 3rd column.
I assume you also want some kind of x-axis, since you are attempting to do X, Y = .... How about using the first column for that? Then your code would be:
import numpy as np
data = np.loadtxt("trainingdataset.txt", delimiter=',', dtype='int')
x = data[:, 0]
y = data[:, 2]
What remains unclear from your question is why you tried to index your data with 521 - which failed because you cannot use strings as indices on plain arrays.

Resources