Problem in reshaping numpy array of large size - python-3.x

I have a numpy array with following dimensions :
(1611216, 2)
I tried reshaping it to (804, 2004)
using :
df = np.reshape(df, (804, 2004))
but it gives an error :
Traceback (most recent call last):
File "Z:/Seismic/Geophysical/99_Personal/Abhishake/RMS_Machine_learning/RMS_data_analysis.py", line 19, in <module>
df = np.reshape(df, (804, 2004))
File "C:\python36\lib\site-packages\numpy\core\fromnumeric.py", line 232, in reshape
return _wrapfunc(a, 'reshape', newshape, order=order)
File "C:\python36\lib\site-packages\numpy\core\fromnumeric.py", line 57, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 3222432 into shape (804,2004)
df = np.reshape(df, (804, 2004))
but it gives an error :
Traceback (most recent call last):
File "Z:/Seismic/Geophysical/99_Personal/Abhishake/RMS_Machine_learning/RMS_data_analysis.py", line 19, in
df = np.reshape(df, (804, 2004))
File "C:\python36\lib\site-packages\numpy\core\fromnumeric.py", line 232, in reshape
return _wrapfunc(a, 'reshape', newshape, order=order)
File "C:\python36\lib\site-packages\numpy\core\fromnumeric.py", line 57, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 3222432 into shape (804,2004)

You cannot reshape (1611216, 2) numpy array into (804, 2004).
It is because 1611216 x 2 = 3,222,432 and 804 x 2004 = 1,611,216. The difference in size of the two array is very large. I think you have to come up with another set of dimensions for your numpy array and that would depend on how you want to use your array.
Hint : (1608, 2004) will be a valid reshape.

Related

How to apply label encoding to text data(list of list)

I am novice python data analyst trying to preprocess the text data (jsonl format) before it goes into Neural networks for topic modelling(VAE). I was able to clean the data and turn it into numpy array, further I wanted to apply label encoding to the cleaned text data but fail to do so. **How can one apply label encoding to list of list data format **?. The input data into label encoding is list of list and ouput has to be in same format.
numpy array format (type: <class 'numpy.ndarray'>)
[array([1131, 713, 857, 1130..........])
array([ 142, 1346, 1918, 1893, 61, 62, 1922,967......]) ])
array([135, 148, 14, 104, 154, 159, 136, 94, 149, 135, 117, 62, 130....])
array([135, 148, 14, 104, 154, 159, 136......])...................................]
The code is this way(after cleaning):(list of list -strings)
dictionary = gensim.corpora.Dictionary(process_texts) # creating a dictionary
label_covid_data =[list(filter(lambda x: x != -1, dictionary.doc2idx(doc))) for doc in process_texts] # converint it into numeric according to dictionary
covid_train_data,covid_test_data = train_test_split(label_covid_data, test_size=0.2, random_state = 3456) # dividing into train and test data
covid_train_narray = np.array([np.array(i) for i in covid_train_data]) # converting into numpy array format
label = preprocessing.LabelEncoder() # applying label encoding
covid_data_labels = label.fit_transform([label.fit_transform(i) for i in covid_train_narray])
Error I am getting:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in _encode(values, uniques, encode, check_unknown)
111 try:
--> 112 res = _encode_python(values, uniques, encode)
113 except TypeError:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in _encode_python(values, uniques, encode)
59 if uniques is None:
---> 60 uniques = sorted(set(values))
61 uniques = np.array(uniques, dtype=values.dtype)
TypeError: unhashable type: 'numpy.ndarray'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-217-ebce4e37aad8> in <module>
4 label = preprocessing.LabelEncoder()
5 #movie_line_labels = label.fit_transform(covid_train_narray[0])
----> 6 covid_data_labels = label.fit_transform([label.fit_transform(i) for i in covid_train_narray])
7 covid_data_labels
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in fit_transform(self, y)
250 """
251 y = column_or_1d(y, warn=True)
--> 252 self.classes_, y = _encode(y, encode=True)
253 return y
254
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in _encode(values, uniques, encode, check_unknown)
112 res = _encode_python(values, uniques, encode)
113 except TypeError:
--> 114 raise TypeError("argument must be a string or number")
115 return res
116 else:
TypeError: argument must be a string or number

Why am i getting ValueError: too many values to unpack (expected 2)?

my df 'td'is a time series data with months for index and one column for performance value ranging between 0-1. i'm trying to run a stationarity-test
'''
def stationarity_test(timeseries):
""""Augmented Dickey-Fuller Test
Test for Stationarity"""
from statsmodels.tsa.stattools import adfuller
print("Results of Dickey-Fuller Test:")
df_test = adfuller(timeseries, autolag = "AIC")
df_output = pd.Series(df_test[0:4],
index = ["Test Statistic", "p-value", "#Lags Used",
"Number of Observations Used"])
print(df_output)
stationarity_test(td)
'''
error received:
Results of Dickey-Fuller Test:
Traceback (most recent call last):
File "<ipython-input-143-69a68080d36a>", line 12, in <module>
stationarity_test(td)
File "<ipython-input-143-69a68080d36a>", line 6, in stationarity_test
df_test = adfuller(timeseries, autolag = "AIC")
File "C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\tsa\stattools.py", line 221, in adfuller
xdall = lagmat(xdiff[:, None], maxlag, trim='both', original='in')
File "C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py", line 397, in lagmat
nobs, nvar = xa.shape
ValueError: too many values to unpack (expected 2)

Getting 5 random crops - TypeError: pic should be PIL Image or ndarray. Got <type ‘tuple’>

I do transformations on images as below (which works with RandCrop): (it is from this dataloader script: https://github.com/jeffreyhuang1/two-stream-action-recognition/blob/master/dataloader/motion_dataloader.py)
def train(self):
training_set = motion_dataset(dic=self.dic_video_train, in_channel=self.in_channel, root_dir=self.data_path,
mode=‘train’,
transform = transforms.Compose([
transforms.Resize([256,256]),
transforms.FiveCrop([224, 224]),
#transforms.RandomCrop([224, 224]),
transforms.ToTensor(),
#transforms.Normalize([0.5], [0.5])
]))
print ‘==> Training data :’,len(training_set),’ videos’,training_set[1][0].size()
train_loader = DataLoader(
dataset=training_set,
batch_size=self.BATCH_SIZE,
shuffle=True,
num_workers=self.num_workers,
pin_memory=True
)
return train_loader
But when I do try to get Five Crops, I get this error:
Traceback (most recent call last):
File “motion_cnn.py”, line 267, in
main()
File “motion_cnn.py”, line 51, in main
train_loader,test_loader, test_video = data_loader.run()
File “/media/d/DATA_2/two-stream-action-recognition-master/dataloader/motion_dataloader.py”, line 120, in run
train_loader = self.train()
File “/media/d/DATA_2/two-stream-action-recognition-master/dataloader/motion_dataloader.py”, line 156, in train
print ‘==> Training data :’,len(training_set),’ videos’,training_set[1][0].size()
File “/media/d/DATA_2/two-stream-action-recognition-master/dataloader/motion_dataloader.py”, line 77, in getitem
data = self.stackopf()
File “/media/d/DATA_2/two-stream-action-recognition-master/dataloader/motion_dataloader.py”, line 51, in stackopf
H = self.transform(imgH)
File “/media/d/DATA_2/two-stream-action-recognition-master/venv/local/lib/python2.7/site-packages/torchvision/transforms/transforms.py”, line 60, in call
img = t(img)
File “/media/d/DATA_2/two-stream-action-recognition-master/venv/local/lib/python2.7/site-packages/torchvision/transforms/transforms.py”, line 91, in call
return F.to_tensor(pic)
File “/media/d/DATA_2/two-stream-action-recognition-master/venv/local/lib/python2.7/site-packages/torchvision/transforms/functional.py”, line 50, in to_tensor
raise TypeError(‘pic should be PIL Image or ndarray. Got {}’.format(type(pic)))
TypeError: pic should be PIL Image or ndarray. Got <type ‘tuple’>
Getting 5 random crops, I should handle a tuple of images instead of a PIL image - so I use Lambda, but then I get the error, at line 55, in stackopf
flow[2*(j),:,:] = H
RuntimeError: expand(torch.FloatTensor{[5, 1, 224, 224]}, size=[224,
224]): the number of sizes provided (2) must be greater or equal to
the number of dimensions in the tensor (4)
and when I try to set flow = torch.FloatTensor(5, 2*self.in_channel,self.img_rows,self.img_cols)
I get motion_dataloader.py", line 55, in stackopf
flow[:,2*(j),:,:] = H
RuntimeError: expand(torch.FloatTensor{[5, 1, 224, 224]}, size=[5,
224, 224]): the number of sizes provided (3) must be greater or equal
to the number of dimensions in the tensor (4)
when I multiply the train batchsize by 5 that is returned, I also get the same error.

ValueError after MinMaxScaler and Transform

I am experiencing difficulty in this area. I experienced ValueError in the following: (I have tried solutions online but to no avail)
Here's my original code, which returns Convert String to Float error
ValueError: could not convert string to float: '3,1,0,0,0,1,0,1,89874,49.99'):
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
training_data_df = pd.read_csv('./data/sales_data_training.csv')
scaler = MinMaxScaler(feature_range=(0,1))
scaled_training= scaler.fit_transform(training_data_df)
scaled_training_df = pd.DataFrame(scaled_training,columns= training_data_df.columns.values)
My CSV Data:
"critic_rating,is_action,is_exclusive_to_us,is_portable,is_role_playing,is_sequel,is_sports,suitable_for_kids,total_earnings,unit_price"
"3.5,1,0,1,0,1,0,0,132717,59.99"
"4.5,0,0,0,0,1,1,0,83407,49.99"...
'3,1,0,0,0,1,0,1,89874,49.99'
I have 9 columns of data across 1000 rows (~9999 data, with first row being the header).
Regards,
Yuki
The full error is as follows:
Traceback (most recent call last):
File "C:/Users/YukiKawaii/PycharmProjects/PandasTest/module2_NN/test.py", line 6, in <module>
scaled_training= scaler.fit_transform(training_data_df)
File "C:\Users\YukiKawaii\Python\Python35\lib\site-packages\sklearn\base.py", line 517, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "C:\Users\YukiKawaii\Python\Python35\lib\site-packages\sklearn\preprocessing\data.py", line 308, in fit
return self.partial_fit(X, y)
File "C:\Users\YukiKawaii\Python\Python35\lib\site-packages\sklearn\preprocessing\data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "C:\Users\YukiKawaii\Python\Python35\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: '3,1,0,0,0,1,0,1,89874,49.99'
You should remove the "" and '' wrapped around each line in the csv file.
By default pd.read_csv() splits each line by , and thus it cannot convert strings to floats if the "" and '' were there.
So the csv file should look as follows.
critic_rating,is_action,is_exclusive_to_us,is_portable,is_role_playing,is_sequel,is_sports,suitable_for_kids,total_earnings,unit_price
3.5,1,0,1,0,1,0,0,132717,59.99
4.5,0,0,0,0,1,1,0,83407,49.99
3,1,0,0,0,1,0,1,89874,49.99
I just verified by running your code after making the above change.

Tensorflow Greater Operator Giving an Error

I am stuck in a simple looking problem in Tensorflow.
Traceback (most recent call last):
File op_def_library.py, line 510, in _apply_op_helper
preferred_dtype=default_dtype)
File ops.py, line 1040, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File ops.py, line 883, in _TensorTensorConversionFunction
(dtype.name, t.dtype.name, str(t)))
ValueError: Tensor conversion requested dtype int64 for Tensor with dtype float32: 'Tensor("sequence_sparse_softmax_cross_entropy/zeros_like:0", shape=(?, ?, 10004), dtype=float32)'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "red.py", line 281, in <module>
main()
File "red.py", line 99, in main
sequence_length=lengths)
File loss.py, line 225, in sequence_sparse_softmax_cross_entropy
losses = xloss(labels=labels, logits=logits)
File loss.py", line 48, in loss
post = array_ops.where(target_tensor > zeros, target_tensor - sigmoid_p, zeros)
gen_math_ops.py, line 2924, in greater
"Greater", x=x, y=y, name=name)
op_def_library.py, line 546, in _apply_op_helper
inferred_from[input_arg.type_attr]))
TypeError: Input 'y' of 'Greater' Op has type float32 that does not match type int64 of argument 'x'
Using as type also does not work.
I just defined another function to be used. I defined it and tried to use it. What should I do to make it work? I just want to define a function that takes tensors as input just like tf cross entropy function. Please suggest how to do that.
In particular, how can I resolve the error?

Resources