DataFrame has no Reshape Attribute - python-3.x

I'm trying to plot to scatter the graph on the following conditions. But, it failed to give a graph. First, it gave me the error message, X and Y size are not equal. Then, when I tried to reshape the dimensions, which is (Row 13 and Col 4), it gave me another error, no attribute reshape, I need your help.
df.reshape((df.shape[0], df.shape[1], 1))
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_test, y_pred, color = 'green')
plt.title(' Test_Result vs Salary')
plt.xlabel('test_score')
plt.ylabel('salary')
plt.show()

Assuming that you want to plot X_train, y_train into 2D plot, make sure that the number of elements is equal. The first error that you encounter meant that either the dimension of X_train.shape and y_train.shape are not consistent or the number of element of len(X_train) != len(y_train) or (X_train.shape[0] != (y_train.shape[0]).
Besides, if your code is exactly what you posted, I don't see any relevance between df.reshape((df.shape[0], df.shape[1], 1)) and plt.scatter(X_train, y_train, color = 'red'). You need a trick to reshape pandas dataframe.

Related

I get ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (1000,) and requested shape (1000,1)

# In order to use the 3D plot, the objects should have a certain shape, so we reshape the targets.
# The proper method to use is reshape and takes as arguments the dimensions in which we want to fit the object.
targets = targets.reshape(observations,)
# Plotting according to the conventional matplotlib.pyplot syntax
# Declare the figure
fig = plt.figure()
# A method allowing us to create the 3D plot
ax = fig.add_subplot(111, projection='3d')
# Choose the axes.
ax.plot(xs, zs, targets)
# Set labels
ax.set_xlabel('xs')
ax.set_ylabel('zs')
ax.set_zlabel('Targets')
# You can fiddle with the azim parameter to plot the data from different angles. Just change the value of azim=100
# to azim = 0 ; azim = 200, or whatever. Check and see what happens.
ax.view_init(azim=100)
# So far we were just describing the plot. This method actually shows the plot.
plt.show()
# We reshape the targets back to the shape that they were in before plotting.
# This reshaping is a side-effect of the 3D plot. Sorry for that.
targets = targets.reshape(observations,1)
the shape of xs and zs is (1000,1)
the shape of targets is (1000,) even if keep its shape as (1000,1) I get a new error like this
AttributeError: 'Line3D' object has no attribute '_verts3d'

ValueError: y should be a 1d array, got an array of shape (74216, 2) instead

I am trying to apply Logistic Regression Models with text.
I Vectorized my data by TFIDF:
vectorizer = TfidfVectorizer(max_features=1500)
x = vectorizer.fit_transform(df['text_column'])
vectorizer_df = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names())
df.drop('text_column', axis=1, inplace=True)
result = pd.concat([df, vectorizer_df], axis=1)
I split my data:
x = result.drop('target', 1)
y = result['target']
and finally:
x_raw_train, x_raw_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
I build a classifier:
classifier = Pipeline([('clf', LogisticRegression(solver="liblinear"))])
classifier.fit(x_raw_train, y_train)
And I get this error:
ValueError: y should be a 1d array, got an array of shape (74216, 2) instead.
This is a strange thing because when I assign max_features=1000 it is working well, but when max_features=1500 I got an error.
Someone can help me please?
Basically, the text_column column in df contains at least one occurrence of the word target. This word becomes a column name when you convert the TF-IDF feature matrix to a dataframe with the parameter columns=vectorizer.get_feature_names(). Lastly, when you concatenate df with vectorized_df, you add both the target columns into the final dataframe.
Therefore, result['target'] will return two columns instead of one as there are effectively two target columns in the result dataframe. This will naturally lead to a ValueError, because, as specified in the error description, you need a 1d target array to fit your estimator, whereas your target array has two columns.
The reason why you are encountering this for a high max_features threshold is simply because the word target isn't making the cut with the lower threshold allowing the process to run as it should.
Unless you have a reason to vectorize separately, the best solution for this is to combine all your steps in a pipeline. It's as simple as:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=1500)),
('clf', LogisticRegression(solver="liblinear")),
])
pipeline.fit(x_train.text_column, y_train.target)

Scaling row-wise with MinMaxScaler from Sklearn

By default, scalers from Sklearn work column-wise. But i need my data to be scaled line-wise, so i did the following:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import numpy as np
# %% Generating sample data
x = np.array([[-1, 4, 2], [-0.5, 8, 9], [3, 2, 3]])
y = np.array([1, 2, 3])
#%% Train/Test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2)
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train.T).T # scaling line-wise
x_test = scaler.transform(x_test) <-------- Error here
But i am getting the following error:
ValueError: X has 3 features, but MinMaxScaler is expecting 2 features as input.
I don't understand whats wrong here. Why it says it is expecting 2 features, when all my X (x, x_train and x_test) has 3 features? How can i fix this?
StandardScaler is stateful: when you fit it, it calculates and saves the columns' means and standard deviations; when transforming (train or test sets), it uses those saved statistics. Your transpose trick doesn't work with that: each row has saved statistics, and then your test set doesn't have the same rows, so transform cannot work correctly (throwing an error if different number of rows, or silently mis-scaling if the same number of rows).
What you want isn't stateful: test sets should be transformed completely independently of the training set. Indeed, every row should be transformed independently of each other. So you could just do this kind of transformation before splitting, or using fit_transform on the test set('s transpose).
For l2 normalization of rows, there's a builtin for this: Normalizer (docs). I don't think there's an analogue for min-max normalization, but I think you could write a FunctionTransformer to do it.
This is possible to do. I can think of a scenario where this would be useful. Normally, MinMaxScaler would scale each x, y, and z with respect to other observations of that feature. That's the "series" scaling. Now imagine that instead, you wanted to map each point constrained by x+y+z = 1. I think this is what OP is asking for. I have done this in the past, I will describe how I did it.
You need to treat your individual observations as a column multi-index and treat it like a higher-dimensional feature. Then, you need to build a pipeline within which the observations are transformed from column-wise to row wise, post which you do the min/max scaling. This gets you to x+y+z=1, but you still need to get back to the original shape of the data, for which you will need to track the index of each observation. Within the pipeline, you'll need to use something like a DataframeFunctionTransformer which I have seen on the interwebs, reproducing it below. This way you can use pandas functions to shape the data and merge back in with the indices.
class DataframeFunctionTransformer():
def __init__(self, func):
self.func = func
def transform(self, input_df, **transform_params):
return self.func(input_df)
def fit(self, X, y=None, **fit_params):
return self
Regarding the statefulness of MinMaxScaler, I think in a scenario such as this, the state of MinMaxScaler doesn't get used, it is purely acting as a transformer that maps these points to a different space meeting the constraint that x, y, and z are scaled such that they add up to 1.
#Murilo hope this gets you started with a solution. Must be an interesting problem.

plot_decision_regions with error "Filler values must be provided when X has more than 2 training features."

I am plotting 2D plot for SVC Bernoulli output.
converted to vectors from Avg word2vec and standerdised data
split data to train and test.
Through grid search found the best C and gamma(rbf)
clf = SVC(C=100,gamma=0.0001)
clf.fit(X_train1,y_train)
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X_train, y_train, clf=clf, legend=2)
plt.xlabel(X.columns[0], size=14)
plt.ylabel(X.columns[1], size=14)
plt.title('SVM Decision Region Boundary', size=16)
Receive error :-
ValueError: y must be a NumPy array. Found
also tried to convert the y to numpy. Then it prompts error
ValueError: y must be an integer array. Found object. Try passing the array as y.astype(np.integer)
finally i converted it to integer array.
Now it is prompting of error.
ValueError: Filler values must be provided when X has more than 2 training features.
You can use PCA to reduce your data multi-dimensional data to two dimensional data. Then pass the obtained result in plot_decision_region and there will be no need of filler values.
from sklearn.decomposition import PCA
from mlxtend.plotting import plot_decision_regions
clf = SVC(C=100,gamma=0.0001)
pca = PCA(n_components = 2)
X_train2 = pca.fit_transform(X_train)
clf.fit(X_train2, y_train)
plot_decision_regions(X_train2, y_train, clf=clf, legend=2)
plt.xlabel(X.columns[0], size=14)
plt.ylabel(X.columns[1], size=14)
plt.title('SVM Decision Region Boundary', size=16)
I've spent some time with this too as plot_decision_regions was then complaining ValueError: Column(s) [2] need to be accounted for in either feature_index or filler_feature_values and there's one more parameter needed to avoid this.
So, say, you have 4 features and they come unnamed:
X_train_std.shape[1] = 4
We can refer to each feature by their index 0, 1, 2, 3. You only can plot 2 features at a time, say you want 0 and 2.
You'll need to specify one additional parameter (to those specified in #sos.cott's answer), feature_index, and fill the rest with fillers:
value=1.5
width=0.75
fig = plot_decision_regions(X_train.values, y_train.values, clf=clf,
feature_index=[0,2], #these one will be plotted
filler_feature_values={1: value, 3:value}, #these will be ignored
filler_feature_ranges={1: width, 3: width})
You can just do (Assuming X_train and y_train are still panda dataframes) for the numpy array problem.
plot_decision_regions(X_train.values, y_train.values, clf=clf, legend=2)
For the filler_feature issue, you have to specify the number of features so you do the following:
value=1.5
width=0.75
fig = plot_decision_regions(X_train.values, y_train.values, clf=clf,
filler_feature_values={2: value, 3:value, 4:value},
filler_feature_ranges={2: width, 3: width, 4:width},
legend=2, ax=ax)
You need to add one filler feature for each feature you have.

Plotting residuals of masked values with `statsmodels`

I'm using statsmodels.api to compute the statistical parameters for an OLS fit between two variables:
def computeStats(x, y, yName):
'''
Takes as an argument an array, and a string for the array name.
Uses Ordinary Least Squares to compute the statistical parameters for the
array against log(z), and determines the equation for the line of best fit.
Returns the results summary, residuals, statistical parameters in a list, and the
best fit equation.
'''
# Mask NaN values in both axes
mask = ~np.isnan(y) & ~np.isnan(x)
# Compute model parameters
model = sm.OLS(y, sm.add_constant(x), missing= 'drop')
results = model.fit()
residuals = results.resid
# Compute fit parameters
params = stats.linregress(x[mask], y[mask])
fit = params[0]*x + params[1]
fitEquation = '$(%s)=(%.4g \pm %.4g) \\times redshift+%.4g$'%(yName,
params[0], # slope
params[4], # stderr in slope
params[1]) # y-intercept
return results, residuals, params, fit, fitEquation
The second part of the function (using stats.linregress) plays nicely with the masked values, but statsmodels does not. When I try to plot the residuals against the x values with plt.scatter(x, resids), the dimensions do not match:
ValueError: x and y must be the same size
because there are 29007 x-values, and 11763 residuals (that's how many y-values made it through the masking process). I tried changing the model variable to
model = sm.OLS(y[mask], sm.add_constant(x[mask]), missing= 'drop')
but this had no effect.
How can I scatter-plot the residuals against the x-values they match with?
Hi #jim421616 Since statsmodels dropped few missing values, you should use the model's exog variable to plot the scatter as shown.
plt.scatter(model.model.exog[:,1], model.resid)
For reference a complete dummy example
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
#generate data
x = np.random.rand(1000)
y =np.sin( x*25)+0.1*np.random.rand(1000)
# Make some as NAN
y[np.random.choice(np.arange(1000), size=100)]= np.nan
x[np.random.choice(np.arange(1000), size=80)]= np.nan
# fit model
model = sm.OLS(y, sm.add_constant(x) ,missing='drop').fit()
print model.summary()
# plot
plt.scatter(model.model.exog[:,1], model.resid)
plt.show()

Resources