ValueError: shapes (5,14) and (16,) not aligned: 14 (dim 1)!= 16 (dim 0) - python-3.x

I am working on housing dataset and when trying to fit the linear regression model getting error as mentioned. Complete code as below.
I am not sure where is code going wrong. I tried pasting the code as it is from the reference book.
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:\t", lin_reg.predict(some_data_prepared))
ERROR: ValueError: shapes (5,14) and (16,) not aligned: 14 (dim 1) != 16 (dim 0)
What am I doing wrong here?

Explanation
Hi, I guess you are reading and following the Hands on Machine Learning with Scikit Learn and Tensorflow book. The problem also occurred to me.
In the following part of the code you select from the data set the first 5 instances. One of the attributes in the data set which is called ocean_proximity is an object and for the linear regression model to be able to operate with it, it must be translated to an integer, which in the book is done with a one hot encoding.
One hot encoding works by analyzing all the categories that can be assigned to the attribute, in this case 5 ('<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND'), and then creating a matrix of that length for each instance and zeroing every element of the matrix except the category of that instance which is assigned a 1 (or another value). For example:
If ocean_proximity equals '<1H OCEAN' the conversion would be [1, 0, 0, 0, 0]
In this piece of code you select the five first instances of the data set, but this does not assure you that all the categories in "ocean_proximity" will appear. It could happen that only 3 of them appear or just 1. Therefor if you apply a one hot encoding to those five selected rows and only 3 categories appear (for example just 'INLAND', 'ISLAND' and 'NEAR BAY'), the matrices created by the one hot encoding will be of length 3.
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
The error is just telling you that, since the one hot conversion of some_data created matrices of a length inferior to 5, the total columns in some_data_prepared is 14, which is less than the columns in housing_prepared (16), thus making the model unable to predict the prices.
If you transform both some_data_prepared and housing_prepared into dataframes and then call .head() you will see the problem.
some_data_prepared.head()
housing_prepared.head()
Solution
To solve the problem you must create the columns missing in some_data_prepared by creating a zeroed numpy array of shape [5,x] (being 5 the number of rows and x the number of columns missing) and concatenating it to some_data_prepared to match the shape of the housing_prepared data set.
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.fit_transform(some_data)
dummy_array = np.zeros((5,1))
some_data_prepared = np.c_[some_data_prepared, dummy_array]
predictions = linear_regression.predict(some_data_prepared)
print("Predictions: ", predictions)
print("Labels: ", some_labels.values)

Missing category values (ocean proximity in this case) in some_data compared to housing_prepared is the issue.
housing_prepared.shape gives (16512, 16), but some_data_prepared.shape gives (5,14), so add zeros for the missing columns:
dummy_array = np.zeros((5,2))
some_data_prepared = np.c_[some_data_prepared,dummy_array]
the 2 in np.zeros determines the difference of columns

I've at first encountered the same issue on the considered piece of code. After exploring the issues of the handson-ml repository, I think I have understood the subtlety which is causing the error here.
My guess is that (as in my case), closing the notebook might have caused what was in memory (and the trained model in particular) to be lost. In my case, I could get the result and avoid the error rerunning the notebook from the beginning.
Instead, from a theoretical viewpoint, you should never call fit() or fit_transform() on data which is not training data (eg on some_data). Here, running fit_transform(some_data) and then stacking the dummy array to some_data_prepared works, but it forces the model to be trained again on some_data rather than on housing_prepared, which is not what you want.

Related

ValueError: Found array with dim 3. Estimator expected <= 2 during RandomUndersampling

For one of my datasets, I have a data imbalance problem as the minority class has very few samples compared to the majority class. So I want to balance the data by undersampling the majority class. When I am trying to use RandomUnderSamples from imblearn package on a 3D array and I have an error
ValueError: Found array with dim 3. Estimator expected <= 2.
The features in the data which are in 3D format
train['X'].shape
(276216, 101, 4)
The input labels
train['y'].shape
(276216, 1)
When I try to randomly undersample data when I run this
from imblearn.under_sampling import RandomUnderSampler
undersample = RandomUnderSampler(sampling_strategy='majority')
X_train_under, y_train_under = undersample.fit(train['X'], train['y'])
I get the above error. Any help would be appreciated.
The function expects 2D arrays to be passed as arguments. Reshape your data and you'll be fine. Also, you will have to call fit_resample as per docs.
X = train['X'].reshape(train['X'].shape[0], -1)
X_train_under, y_train_under = undersample.fit_resample(X, train['y'])

Flat-field correction on hyperspectral data

I am working on hyperspectral data set using the spectral python library. I started using python for the first time on Monday, so everything is taking me a long time.
My data is in envi format, and i believe I have successfully read it in and connverted to numpy arrays.
I am attempting a flat field correction using this code
corrected_nparr = np.divide(np.subtract(data_nparr, dark_nparr), np.subtract(white_nparr, dark_nparr))
ValueError: operands could not be broadcast together with shapes (1367,384,288) (100,384,288)
This doesnt work because my white reference and dark reference are a different size to the data capture.
print(white_nparr.shape)
(297, 384, 288)
print(dark_nparr.shape)
(100, 384, 288)
print(data_nparr.shape)
(1367, 384, 288)
So, I understand why I am getting the error. The original white and dark ref were captured using different image sizes to the dataset. So, my problem is creating a correction for the dataset whilst only having access to references of different sizes
Has anyone handled this before? What approach did you use?
btw the data I am using is mineral hyperspectral data captured from drill core, there is a huge dataset held by Geological Survey Ireland and is free upon request
So, I recieved and extremely helpful answer, which actually sparked a further question
# created these files to broadcast as they are a horizontal line of spectra,
#a 2D array which captures the variation
white_nparr_horiz = white_nparr[-2]
dark_nparr_horiz = dark_nparr[-2]
corrected_nparr = np.divide(np.subtract(data_nparr, dark_nparr_horiz), np.subtract(white_nparr_horiz, dark_nparr_horiz))
white_nparr_horiz.shape
Out[28]: (384, 288)
dark_nparr_horiz.shape Out[29]: (384, 288)
So the shape of these arrays are broadcastable accross the data_ref, and I have tested that it works as I expect with this, on a few different indices, and it does.
a = white_nparr_horiz[150, 144]
b = dark_nparr_horiz[150, 144]
c = data_nparr[500, 150, 144]
d = (c - b)/(a-b)
test = d == corrected_nparr[500, 150, 144]
print(test)
The output from this looks much more as I would expect reflectance data for this material to look, so I believe I am on the right path.
What I would like to do now is have white_nparr_horiz be the mean of each band along the original first axis in the white_ref (297, 384, 288), returned in an array of (384, 288), as opposed to a single value as I believe it is now. I am sure that this is possible, but I cannot figure out how.
As I said above, very new to python, numpy and image analysis, so apologies if this is obvious or I am going in the wrong direction
The problem is that your white and dark references should each be a single spectrum (1D array with 288 values), whereas yours are both 3-dimensional arrays (likely corresponding to image regions). To convert them to 1D, you can compute the mean, max, or min of each array, as appropriate. For example, to take the min of the dark reference and max of the white reference, you could convert them as follows:
dark_nparr = np.min(dark_nparr.reshape(-1, dark_nparr.shape[-1]), axis=0)
white_nparr = np.max(white_nparr.reshape(-1, white_nparr.shape[-1]), axis=0)
The lines above reshape the arrays to 2 dimensions and compute the max (or min) of the reshaped arrays.
If you prefer to use the spectral mean of each array instead, just replace np.max and np.min above with np.mean.
If you want each array to just be averaged over its first dimension, then (i.e., have shape (384, 288)), then just don't reshape the arrays when doing the reduction.
dark_nparr = np.min(dark_nparr, axis=0)
white_nparr = np.max(white_nparr, axis=0)

Python Surprise package gives different predictions for predict method vs manual compute using latent factors

I am using the surprise package for matrix factorization. Below is the code for the tutorial:
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)
algo.predict(str(196), str(302))
Out:
Prediction(uid='196', iid='301', r_ui=4, est=3.0740854315737174, details={'was_impossible': False})
However, when I use the SVD equation from its documentation and source code to manually compute the r_hat (r prediction):
algo.trainset.global_mean + algo.bi[301] + algo.bu[196] + np.dot(algo.qi[301], algo.pu[196])
Out:
2.817335384596893
The predictions does not match at all. Am I doing anything wrong or missing something?
I managed to figure it out. There's a difference between raw users/items and inner users/items. The former refers to the actual names of the users and items (e.g., user = John or a number like 10; items = Avengers or a number like 20) while the latter I assume to be the label encoded values given to the original users/items.
The hidden attributes of the trainset contain 4 attributes, _inner2raw_id_items, _inner2raw_id_users, _raw2inner_id_items, _raw2inner_id_users, which are dicts containing the conversion from one to the other.
If we call trainset._raw2inner_id_users and trainset._raw2inner_id_items, we get:
_raw2inner_id_users
{'196': 0,
'186': 1,
'22': 2, ...}
_raw2inner_id_items
{'242': 0,
'302': 1,
'377': 2, ...
'301': 404, ...}
Therefore, when we call:
algo.predict(str(196), str(302))
Out:
# different from original post as the prediction changes from run to run
Prediction(uid='196', iid='301', r_ui=None, est=3.2072618383879736, details={'was_impossible': False})
We are actually referring to the 0th user and 1st item. So when we use the manual computation using the latent factors, bias, and global mean according to the SVD equation, we should use these numbers instead:
algo.trainset.global_mean + algo.bi[404] + algo.bu[0] + np.dot(algo.qi[404], algo.pu[0])
Output:
3.2072618383879736

Why is ColumnTransformer producing a different output using the same code but different .csv files?

I am trying to finish this course tooth and nail with the hopes of being able to do this kind of stuff entry level by Spring time. This is my first post here on this incredible resource, and will do my best to conform to posting format. As a potential way to enforce my learning and commit to long term memory, I'm trying the same things on my own dataset of > 500 entries containing data more relevant to me as opposed to dummy data.
I'm learning about the data preprocessing phase where you fill in missing values and separate the columns into their respective X and Y to be fed into the models later on, if I understand correctly.
So in the course example, it's the top left dataset of countries. Then the bottom left is my own database of data I've been keeping for about a year on a multiplayer game I play. It has 100 or so characters you can choose from who are played between 5 different categorical roles.
Course data set (top left) personal dataset (bottom left
personal dataset column transformed results
What's up with the different outputs that are produced, with the only difference being the dataset (.csv file)? The course's dataset looks right; that first column of countries (textual categories) gets turned into binary vectors in the output no? Why is the output on my data set omitting columns, and producing these bizarre looking tuples followed by what looks like a random number? I've tried removing the np.array function, I've tried printing each output at each level, unable to see what's causing the difference. I expected on my dataset it would transform the characters' names into binary vectors (combinations of 1s/0s?) so the computer can understand the difference and map them to the appropriate results. Instead I'm getting that weird looking output I've never seen before.
EDIT: It turns out these bizarre number combinations are what's called a "sparse matrix." Had to do some research starting with the type() which yielded csr_array. If I understood what I Read correctly all the stuff inside takes up one column, so I just tried all rows/columns using [:] and I didn't get an error.
Really appreciate your time and assistance.
EDIT: Thanks to this thread I was able to make my way to the end of this data preprocessing/import/cleaning/ phase exercise, to feature scaling using my own dataset of ~ 550 rows.
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
# IMPORT RAW DATA // ASSIGN X AND Y RAW
df = pd.read_csv('datasets/winpredictor.csv')
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
# TRANSFORM CATEGORICAL DATA
ct = ColumnTransformer(transformers=\
[('encoder', OneHotEncoder(), [0, 1])], remainder='passthrough')
le = LabelEncoder()
X = ct.fit_transform(X)
y = le.fit_transform(y)
# SPLIT THE DATA INTO TRAINING AND TEST SETS
X_train, X_test, y_train, y_test = train_test_split(\
X, y, train_size=.8, test_size=.2, random_state=1)
# FEATURE SCALING
sc = StandardScaler(with_mean=False)
X_train[:, :] = sc.fit_transform(X_train[:, :])
X_test[:, :] = sc.transform(X_test[:, :])
First of all I encourage you to keep working with this course and for sure you will be a perfect Data Science in a few weeks.
Let's talk about your problem. It' seems that you only have a problem of visualization due to the big size of different types of "Hero" (I think you have 37 unique values).
I will explain you the results you have plotted. They programm only indicate you the values of the samples that are different of 0:
(0,10)=1 --> 0 refers to the first sample, and 10 refers to the 10th
value of the sample that is equal to 1.
(0,37)=5 --> 0 refers to the first sample, and 37 refers to the 37th, which is equal to 5.
etc..
So your first sample will be something like:
[0,0,0,0,0,0,0,0,0,0,1,.........., 5, 980,-30, 1000, 6023]
Which is the way to express the first sample of "Jakiro".
["Jakiro",5, 980,-30, 1000, 6023]
To sump up, the first 37 values refers to your OneHotEncoder, and last 5 refers to your initial numerical values.
So it seems to be correct, just a different way to plot the result due to the big size of classes of the categorical variable.
You can try to reduce the number of X rows (to 4 for example), and try the same process. Then you will have a similar output as the course.

How to preprocess a dataset with many types of missing data

I'm trying to do the beginner machine learning project Big Mart Sales.
The data set of this project contains many types of missing values (NaN), and values that need to be changed (lf -> Low Fat, reg -> Regular, etc.)
My current approach to preprocess this data is to create an imputer for every type of data needs to be fixed:
from sklearn.impute import SimpleImputer as Imputer
# make the values consistent
lf_imputer = Imputer(missing_values='LF', strategy='constant', fill_value='Low Fat')
lowfat_imputer = Imputer(missing_values='low fat', strategy='constant', fill_value='Low Fat')
X[:,1:2] = lf_imputer.fit_transform(X[:,1:2])
X[:,1:2] = lowfat_imputer.fit_transform(X[:,1:2])
# nan for a categorical variable
nan_imputer = Imputer(missing_values=np.nan, strategy='most_frequent')
X[:, 7:8] = nan_imputer.fit_transform(X[:, 7:8])
# nan for a numerical variable
nan_num_imputer = Imputer(missing_values=np.nan, strategy='mean')
X[:, 0:1] = nan_num_imputer.fit_transform(X[:, 0:1])
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
In addition, it is frustrating that imputer.fit_transform() requires a 2D array as an input whereas I only want to fix the values in a single column (1D). Thus, I always have to use the column that I want to fix plus a column next to it as inputs. Is there any other way to get around this? Thanks.
Here are some rows of my data:
There is a python package which can do this for you in a simple way, ctrl4ai
pip install ctrl4ai
from ctrl4ai import preprocessing
preprocessing.impute_nulls(dataset)
Usage: [arg1]:[pandas dataframe],[method(default=central_tendency)]:[Choose either central_tendency or KNN]
Description: Auto identifies the type of distribution in the column and imputes null values
Note: KNN consumes more system mermory if the size of the dataset is huge
Returns: Dataframe [with separate column for each categorical values]
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
If you have a numerical column, you can use some approaches to fill the missing data:
A constant value that has meaning within the domain, such as 0, distinct from all other values.
A value from another randomly selected record.
A mean, median or mode value for the column.
A value estimated by another predictive model.
Lets see how it works for a mean for one column e.g.:
One method would be to use fillna from pandas:
X['Name'].fillna(X['Name'].mean(), inplace=True)
For categorical data please have a look here: Impute categorical missing values in scikit-learn

Resources