after using an Embedding layer in Keras why do I get: Input to reshape is a tensor with 2 values, but the requested shape has 4 [Op:Reshape] - keras

So I created below sample polars data frame. I want to use Keras's normalisation and Embedding layers to preprocess my data. sum_cost and sum_gmv are my numerical columns and I normalize each individual column by using normalization layer.category is my categorical column and I want to use embedding layer to get embedding vectors for each category.
import polars as pl
import tensorflow as tf
df = pl.DataFrame(
{'sum_cost':[1.,4.,7.,3.,2.],
'category':[311,210,450,311,567],
'sum_gmv':[-4.,-2.,0.,2.,4.],
}
)
numeric_col = ['sum_cost','sum_gmv']
categorical_col = ['category']
all_inputs = []
inputs = {}
for col in numeric_col + categorical_col:
if col in numeric_col:
inputs[col] = tf.keras.Input(shape=(), name=col,dtype=tf.float32)
normalizer = tf.keras.layers.Normalization(axis=None)
normalizer.adapt(df[col].to_numpy())
all_inputs.append(normalizer(inputs[col])[:,tf.newaxis])
elif col in categorical_col:
inputs[col] = tf.keras.Input(shape=(), name=col,dtype=tf.int32)
embedding = tf.keras.layers.Embedding(
567 + 1,
output_dim=2,
name='cat_embedding')(inputs[col])
em_model = tf.keras.layers.Reshape((2,))(embedding)
all_inputs.append(em_model)
outputs = tf.keras.layers.Concatenate(axis=1)(all_inputs)
model=tf.keras.Model([inputs[col] for col in numeric_col+categorical_col],outputs)
When I want to test my preprocessing model by using a single data point model(dict(df.to_pandas().iloc[1,:])) I receive the following error.
on the other hand when I pass this input:
model({'sum_cost':1.,'sum_gmv':1,'category':np.array([[1.]])})
it works well. I dont understand why i should provide an array for category but scalar for numerical columns. In the original dataset they are all scalar. Also I dont deifne a shape for my Input tensors. Why does this happening and how can I solve it?
Thanks!

Related

ValueError: y should be a 1d array, got an array of shape (74216, 2) instead

I am trying to apply Logistic Regression Models with text.
I Vectorized my data by TFIDF:
vectorizer = TfidfVectorizer(max_features=1500)
x = vectorizer.fit_transform(df['text_column'])
vectorizer_df = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names())
df.drop('text_column', axis=1, inplace=True)
result = pd.concat([df, vectorizer_df], axis=1)
I split my data:
x = result.drop('target', 1)
y = result['target']
and finally:
x_raw_train, x_raw_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
I build a classifier:
classifier = Pipeline([('clf', LogisticRegression(solver="liblinear"))])
classifier.fit(x_raw_train, y_train)
And I get this error:
ValueError: y should be a 1d array, got an array of shape (74216, 2) instead.
This is a strange thing because when I assign max_features=1000 it is working well, but when max_features=1500 I got an error.
Someone can help me please?
Basically, the text_column column in df contains at least one occurrence of the word target. This word becomes a column name when you convert the TF-IDF feature matrix to a dataframe with the parameter columns=vectorizer.get_feature_names(). Lastly, when you concatenate df with vectorized_df, you add both the target columns into the final dataframe.
Therefore, result['target'] will return two columns instead of one as there are effectively two target columns in the result dataframe. This will naturally lead to a ValueError, because, as specified in the error description, you need a 1d target array to fit your estimator, whereas your target array has two columns.
The reason why you are encountering this for a high max_features threshold is simply because the word target isn't making the cut with the lower threshold allowing the process to run as it should.
Unless you have a reason to vectorize separately, the best solution for this is to combine all your steps in a pipeline. It's as simple as:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=1500)),
('clf', LogisticRegression(solver="liblinear")),
])
pipeline.fit(x_train.text_column, y_train.target)

Issue with array dimensions during concatenation in python

I'm trying to preprocess data by scaling numeric data and transform categorical data using one hot encoder.
The function below applies this to train and test data, returning for each dataset concatenation of scaled numeric data and encoded categorical data.
But when executing it I keep having the following error on line
trainX = np.hstack([encoded_train, train_numeric_data])"
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 2 dimension(s)
I do not manage to fix this, would anybody have an idea?
Thanks in advance.
Seb
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
def process_data(train, test):
# perform min-max scaling for all numeric features (13 columns)
numeric = ["a","b","c","d","e","f","g","h","i","k","k","l","m"]
cs = MinMaxScaler()
train_numeric_data = cs.fit_transform(pd.DataFrame(train,columns=numeric))
test_numeric_data = cs.transform(pd.DataFrame(test,columns=numeric))
# one-hot encode categorical data (11 columns)
categorylist=["n","o","p","q","r","s","t","u","v","w","x"]
train_categorical_data = pd.DataFrame(train,columns=categorylist)
test_categorical_data= pd.DataFrame(test,columns=categorylist)
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(train_categorical_data)
encoded_train=encoder.transform(train_categorical_data)
encoded_test=encoder.transform(test_categorical_data)
# construct our training and testing data points by concatenating the categorical features with the numeric features
trainX = np.hstack([encoded_train, train_numeric_data])
testX = np.hstack([encoded_test,test_numeric_data])
#return the concatenated training and testing data
return (trainX, testX)
I have actually solved the issue by changing type of encoded_train with
encoded_train_array = pd.DataFrame(encoded_train.toarray())

Is there a way to add a 'sentiment' column after applying CountVectorizer or TfIdfTransformer to a dataframe?

I am working with app store reviews to classify them as class "0" or class "1" based on the text in the review and the sentiment the review carries.
In my classification steps I apply the following methods to my dataframe:
def get_sentiment(s):
vs = analyzer.polarity_scores(s)
if vs['compound'] >= 0.5:
return 1
elif vs['compound'] <= -0.5:
return -1
else:
return 0
df['sentiment'] = df['review'].apply(get_sentiment)
For simplicity sake, the data has already been labeled as either class '0' or '1', but I am training the model for the classification of new instances that have not been labeled yet. In short, the data I'm working with has already been labeled. They are in the classification column.
Then in my train test split method do the following:
msg_train, msg_test, label_train, label_test = train_test_split(df.drop('classification', axis=1), df['classification'], test_size=0.3, random_state=42)
So the dataframe for the X parameter has review and sentiment, and for the y parameter I only have the classification that I am training my model on.
Since the normalization is repetitive, I am running a pipeline like so for simplicity:
pipeline1 = Pipeline([
('bow', CountVectorizer(analyzer=clean_review)),
('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())
])
Where the clean_review function is as follows:
def clean_review(sentence):
no_punc = [c for c in sentence if c not in string.punctuation]
no_punc = ''.join(no_punc)
no_stopwords = [w.lower() for w in no_punc.split() if w not in stopwords_set]
stemmed_words = [ps.stem(w) for w in no_stopwords]
return stemmed_words
Where stopwords_set is the collection of english stopwords from the nltk library, and ps is from the PortStemmer module in the nltk library (for word stemming).
I get the following error: ValueError: Found input variables with inconsistent numbers of samples: [2, 505]
When I searched this error before, I saw that the likely issue could've been that there is a mismatch in the number of records for each attribute. I've found this not to be the case. All the records that I am using have values for every column.
Can someone else help me interpret what this error could mean?
My end goal is to have a dataframe that has the CountVectorizer and TfIdfTransformer applied to the text, but also retain the column for the sentiment of each review.
I would then like to be able to train the MultinomialNB classifier on this dataframe and apply this model to other tasks.
I'm not sure on what the error is due to since I don't know what the size of your dataframe should be. I would need more information. On which line is the error thrown?
Regarding the fact that you want to retain the sentiment column, you could apply CountVectorizer and TfIdfTransformer (by the way you could skip a step and directly apply TfidfVectorizer) only on the text data and then have another transformer in the pipeline which adds the original sentiment column before you feed the dataframe to the classifier.

How to normalize time series data with multiple features by using sklearn?

For data with the shape (num_samples,features), MinMaxScaler from sklearn.preprocessing can be used to normalize it easily.
However, when using the same method for time series data with the shape (num_samples, time_steps,features), sklearn will give an error.
from sklearn.preprocessing import MinMaxScaler
import numpy as np
#Making artifical time data
x1 = np.linspace(0,3,4).reshape(-1,1)
x2 = np.linspace(10,13,4).reshape(-1,1)
X1 = np.concatenate((x1*0.1,x2*0.1),axis=1)
X2 = np.concatenate((x1,x2),axis=1)
X = np.stack((X1,X2))
#Trying to normalize
scaler = MinMaxScaler()
X_norm = scaler.fit_transform(X) <--- error here
ValueError: Found array with dim 3. MinMaxScaler expected <= 2.
This post suggests something like
(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
Yet, it only works for data with only 1 feature. Since my data has more than 1 feature, this method doesn't work.
How to normalize time series data with multiple features?
To normalize a 3D tensor of shape (n_samples, timesteps, n_features) use the following:
(timeseries-timeseries.min(axis=2))/(timeseries.max(axis=2)-timeseries.min(axis=2))
Using the argument axis=2 will return the result of the tensor operation performed along the 3rd dimension i.e., the feature axis. Thus each feature will be normalized independently.

Apply softmax on a subset of neurons

I'm building a convolutional net in Keras that assigns multiple classes to an image. Given that the image has 9 points of interest that can be classified in one of the three ways I wanted to add 27 output neurons with softmax activation that would compute probability for each consecutive triple of neurons.
Is it possible to do that? I know I can simply add a big softmax layer but this would result in a probability distribution over all output neurons which is too broad for my application.
In the most naive implementation, you can reshape your data and you'll get exactly what you described: "probability for each consecutive triplet".
You take the output with 27 classes, shaped like (batch_size,27) and reshape it:
model.add(Reshape((9,3)))
model.add(Activation('softmax'))
Take care to reshape your y_true data as well. Or add yet another reshape in the model to restore the original form:
model.add(Reshape((27,))
In more elaborate solutions, you'd probably separate the 9 points of insterest according to their locations (if they have a roughly static location) and make parallel paths. For instance, suppose your 9 locations are evenly spaced rectangles, and you want to use the same net and classes for those segments:
inputImage = Input((height,width,channels))
#supposing the width and height are multiples of 3, for easiness in this example
recHeight = height//3
recWidth = width//3
#create layers here without calling them
someConv1 = Conv2D(...)
someConv2 = Conv2D(...)
flatten = Flatten()
classificator = Dense(..., activation='softmax')
outputs = []
for i in range(3):
for j in range(3):
fromH = i*recHeight
toH = fromH + recHeight
fromW = j*recWidth
toW = fromW + recWidth
imagePart = Lambda(
lambda x: x[:,fromH:toH, fromW:toW,:],
output_shape=(recHeight,recWidth,channels)
)(inputImage)
#using the same net and classes for all segments
#if this is not true, create new layers here instead of using the same
output = someConv1(imagePart)
output = someConv2(output)
output = flatten(output)
output = classificator(output)
outputs.append(output)
outputs = Concatenate()(outputs)
model = Model(inputImage,outputs)

Resources