Tokenize Review wise for sentiment analysis - nlp

In this Amazon dataset I've Product_Description , Product_Type & Sentiment column where I want to build classification model. keeping Product_Description & Product_Type as X and Sentiment as Y. but i receive few error still not able to find the solution. I want the sentence itself to be tokenize for tfidf not different words.
> amazon.head()
Link to data example
> `Z = amazon["Product_Description"]
> Y = amazon["Sentiment"]
> tfidf = TfidfVectorizer()
> tf = pd.DataFrame(tfidf.fit_transform(Z),columns = ["Product_Description"])
> X = pd.concat((tf,amazon["Product_Type"]),axis = 1)
> X.drop(X[X["Product_Description"].isnull()].index, inplace = True)
> test_size = 0.2
> seed = 45
> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)
> X_train.shape,X_test.shape # [((5081, 2), (1271, 2))] is the output
> model = LogisticRegression(max_iter = 500)
>
> rfe = RFE(model, n_features_to_select = 2)
> fit = rfe.fit(X, Y)
>
> fit.n_features_
> fit.support_
> fit.ranking_`
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'csr_matrix'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
Input In [57], in <cell line: 7>()
4 model = LogisticRegression(max_iter = 500)
6 rfe = RFE(model, n_features_to_select = 2)
----> 7 fit = rfe.fit(X, Y)
9 fit.n_features_
10 fit.support_
```ValueError: setting an array element with a sequence.
​

Related

multiple layer perceptron to classify mnist dataset

I need some help for a project I am working on for a data science course. In this project I classy the digits of the MNIST datasets in three ways:
using the dissimilarity matrices induced by the distances 1,2 and infinity
using a BallTree
using a neural network.
The first two parts are done, but I getting an error for the neural network code that I can't solve. This is the code.
#Upload the MNIST dataset
data = load('mnist.npz')
x_train = data['arr_0']
y_train = data['arr_1']
x_test = data['arr_2']
y_test = data['arr_3']
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
The output is
(60000, 28, 28) (60000,)
(10000, 28, 28) (10000,)
Then,
#Setting up the neural network and defining sigmoid function
#self.mtrx holds the neurons in each level
#self.weight, bias, grad hold weight, bias and gradient values between level L and L - 1
​
class NeuralNetwork:
​
def __init__(self, rows, columns=0):
self.mtrx = np.zeros((rows, 1))
self.weight = np.random.random((rows, columns)) / columns ** .5
self.bias = np.random.random((rows, 1)) * -1.0
self.grad = np.zeros((rows, columns))
​
def sigmoid(self):
return 1 / (1 + np.exp(-self.mtrx))
​
def sigmoid_derivative(self):
return self.sigmoid() * (1.0 - self.sigmoid())
#Initializing neural network levels
​
lvl_input = NeuralNetwork(784)
lvl_one = NeuralNetwork(200, 784)
lvl_two = NeuralNetwork(200, 200)
lvl_output = NeuralNetwork(10, 200)
#Forward and backward propagation functions
​
def forward_prop():
lvl_one.mtrx = lvl_one.weight.dot(lvl_input.mtrx) + lvl_one.bias
lvl_two.mtrx = lvl_two.weight.dot(lvl_one.sigmoid()) + lvl_two.bias
lvl_output.mtrx = lvl_output.weight.dot(lvl_two.sigmoid()) + lvl_output.bias
​
​
def back_prop(actual):
val = np.zeros((10, 1))
val[actual] = 1
​
delta_3 = (lvl_output.sigmoid() - val) * lvl_output.sigmoid_derivative()
delta_2 = np.dot(lvl_output.weight.transpose(), delta_3) * lvl_two.sigmoid_derivative()
delta_1 = np.dot(lvl_two.weight.transpose(), delta_2) * lvl_one.sigmoid_derivative()
​
lvl_output.grad = lvl_two.sigmoid().transpose() * delta_3
lvl_two.grad = lvl_one.sigmoid().transpose() * delta_2
lvl_one.grad = lvl_input.sigmoid().transpose() * delta_1
#Storing mnist data into np.array
​
def make_image(c):
lvl_input.mtrx = x_train[c]
#Evaluating cost function
​
def cost(actual):
val = np.zeros((10, 1))
val[actual] = 1
cost_val = (lvl_output.sigmoid() - val) ** 2
return np.sum(cost_val)
#Subtraction gradients from weights and initializing learning rate
​
learning_rate = .01
​
def update():
lvl_output.weight -= learning_rate * lvl_output.grad
lvl_two.weight -= learning_rate * lvl_two.grad
lvl_one.weight -= learning_rate * lvl_one.grad
And finally I train the neural network.
#Training neural network
#iter_1 equals number of batches
#iter_2 equals number of iterations in one batch
iter_1 = 50
iter_2 = 100
for batch_num in range(iter_1):
update()
counter=0
for batches in range(iter_2):
make_image(counter)
num = np.argmax(y_train[counter])
counter += 1
forward_prop()
back_prop(num)
print("actual: ", num, " guess: ", np.argmax(lvl_output.mtrx), " cost", cost(num))
I get the following error and I can't figure out what's wrong with my code.. can anybody help?
ValueError Traceback (most recent call last)
<ipython-input-12-8821054ddd29> in <module>
13 num = np.argmax(y_train[counter])
14 counter += 1
---> 15 forward_prop()
16 back_prop(num)
17 print("actual: ", num, " guess: ", np.argmax(lvl_output.mtrx), " cost", cost(num))
<ipython-input-6-e6875bcd1a03> in forward_prop()
2
3 def forward_prop():
----> 4 lvl_one.mtrx = lvl_one.weight.dot(lvl_input.mtrx) + lvl_one.bias
5 lvl_two.mtrx = lvl_two.weight.dot(lvl_one.sigmoid()) + lvl_two.bias
6 lvl_output.mtrx = lvl_output.weight.dot(lvl_two.sigmoid()) + lvl_output.bias
ValueError: shapes (200,784) and (28,28) not aligned: 784 (dim 1) != 28 (dim 0)
In your code:
def make_image(c):
lvl_input.mtrx = x_train[c]
althout you init lvl_input.mtrx with shape (row, 1), data with shape(28,28) then assign to lvl_input.mtrx later. Basically reshape() need to be done to training data

what type of error is this in keras and pleas give me a solution also?

dataset_total = concat((training_set['Milk'], test_set['Milk']), axis = 0)
inputs = dataset_total[len(dataset_total) - len(test_set) - 10:].values
inputs = inputs.reshape(-1,1)
inputs = sc.transform(inputs)
X_test = []
for i in range(10, 16):
X_test.append(inputs[i-10:i, 0])
X_test = np.array(X_test)
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
predicted_stock_price = regressor.predict(X_test)
predicted_stock_price = sc.inverse_transform(predicted_stock_price)
Here's is my error?
IndexError Traceback (most recent call last)
<ipython-input-26-93b179e798f0> in <module>
----> 1 dataset_total = concat((training_set['Milk'], test_set['Milk']),
axis = 0)
2 inputs = dataset_total[len(dataset_total) - len(test_set) -
10:].values
3 inputs = inputs.reshape(-1,1)
4 inputs = sc.transform(inputs)
5 X_test = []
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis
(`None`) and integer or boolean arrays are valid indices

Python - window size for machine learning model

i am working on a python task using logistic regression classifier and i am trying to set w window size = 2 for the input data before the fitting step. here is what i have tried
from itertools import islice
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
x_train = list(window(x_train))
y_train = list(window(y_train))
x_test = list(window(x_test))
y_test = list(window(y_test))
seed = 42
##LogisticRegressionCV Classifier
lr = LogisticRegression()
lr.fit(x_train,y_train)
y_pred1=lr.predict(x_test)
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(lr, x_train, y_train, cv=kfold)
here is i have used a function to apply a window size = 2 but in the fitting step the following error appears because the shape of the dataset after the windowing is edited for example like that ((1150731, 2, 3)) instead of (1150731,3)
ValueError: Found array with dim 3. Estimator expected <= 2.

ValueError: Error when checking : expected dense_1_input to have shape (9,) but got array with shape (1,)

I have the this dataset
step pos_x pos_y vel_x vel_y ship_lander_angle ship_lander_angular_vel leg_1_ground_contact leg_2_ground_contact action
0 0 -0.004053 0.937387 -0.410560 -0.215127 0.004703 0.092998 0.0 0.0 3
1 1 -0.008040 0.933774 -0.401600 -0.240878 0.007613 0.058204 0.0 0.0 3
2 2 -0.011951 0.929763 -0.392188 -0.267401 0.008632 0.020372 0.0 0.0 3
3 3 -0.015796 0.925359 -0.383742 -0.293582 0.007955 -0.013536 0.0 0.0 3
4 4 -0.019576 0.920563 -0.375744 -0.319748 0.005674 -0.045625 0.0 0.0 3
I split it as follows:
X = dataset[dataset.columns.difference(["action"])]
Y = dataset["action"]
# Use a range scaling to scale all variables to between 0 and 1
min_max_scaler = preprocessing.MinMaxScaler()
cols = X.columns
X = pd.DataFrame(min_max_scaler.fit_transform(X), columns = cols) # Watch out for putting back in columns here
# Perfrom split to train, validation, test
x_train_plus_valid, x_test, y_train_plus_valid, y_test = train_test_split(X, Y, random_state=0, test_size = 0.30, train_size = 0.7)
x_train, x_valid, y_train, y_valid = train_test_split(x_train_plus_valid, y_train_plus_valid, random_state=0, test_size = 0.199/0.7, train_size = 0.5/0.7)
# convert to numpy arrays
y_train_wide = keras.utils.to_categorical(np.asarray(y_train)) # convert the target classes to binary
y_train_plus_valid_wide = keras.utils.to_categorical(np.asarray(y_train_plus_valid))
y_valid_wide = keras.utils.to_categorical(np.asarray(y_valid))
And i use the Neural Network to train my data
model_mlp = Sequential()
model_mlp.add(Dense(input_dim=9, units=32))
model_mlp.add(Activation('relu'))
model_mlp.add(Dropout(0.2))
model_mlp.add(Dense(32))
model_mlp.add(Activation('relu'))
model_mlp.add(Dropout(0.2))
model_mlp.add(Dense(4))
model_mlp.add(Activation('softmax'))
#model.add(Dense(num_classes, activation='softmax'))
model_mlp.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
model_mlp.fit(np.asfarray(x_train), np.asfarray(y_train_wide), \
epochs=20, batch_size=32, verbose=1, \
validation_data=(np.asfarray(x_valid), np.asfarray(y_valid_wide)))
I almost got 93% accuracy. I save the model as follows
filepath = "first_model.mod"
model_mlp.save(filepath)
In another file where I need to load the model and calculate the reward I got above mention error
if __name__=="__main__":
# Load the Lunar Lander environment
env = LunarLander()
s = env.reset()
# Load and initialise the contrll model
ROWS = 64
COLS = 64
CHANNELS = 1
model = keras.models.load_model("first_model.mod")
# Run the game loop
total_reward = 0
steps = 0
while True:
# Get the model to make a prediction
a = model.predict_classes(s)
a = a[0]
# Step on the game
s, r, done, info = env.step(a)
env.render()
total_reward += r
if steps % 20 == 0 or done:
print(["{:+0.2f}".format(x) for x in s])
print("step {} total_reward {:+0.2f}".format(steps, total_reward))
steps += 1
if done: break
Error is at following line : a = model.predict_classes(s)
The problem is in this line:
X = dataset[dataset.columns.difference(["action"])]
First of all, it includes 9 columns, instead of 8, which makes the network incompatible with gym states returned from env.step. This causes the shape mismatch error.
Next, columns.difference also shuffles the input columns (they become sorted by name). The columns thus become:
Index(['leg_1_ground_contact', 'leg_2_ground_contact', 'pos_x', 'pos_y',
'ship_lander_angle', 'ship_lander_angular_vel', 'step', 'vel_x',
'vel_y'],
dtype='object')
The right way to split X and y is this:
X = dataset.iloc[:,1:-1]
Y = dataset.iloc[:,-1]

For loop and Linear regression

Good evening,
I would like to reiterate both a subsetting and a linear regression, over the same data frame.
#I get the unique codes of the articles
codes = np.unique(data["cod_id"])
#Split
X = data['price']
y = data["quantity"]
accuracy = []
for i in np.nditer(codes):
data = data.loc[df["cod_id"] == i]
#Arrange an if statement to avoid 0-element arrays, while splitting (80% train, 20% test)
if int(len(data)) <= 2:
X_train = X
y_train = y
# Test dataset
X_test = X
y_test = y
else:
t = 0.8
t = int(t*len(data))
#Split
t = int(t*len(data))
# Train dataset
X_train = X[:t]
y_train = y[:t]
# Test dataset
X_test = X[t:]
y_test = y[t:]
#Run the Algorithm
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
predicted_test_tr = lr.predict(X_test)
pred_cost = (X_test["price"] * predicted_test_tr).sum()
real_cost = (X_test["price"] * y_test).sum()
delta = (pred_cost - owner_cost)/owner_cost
accuracy.append(delta)
But it reports a list "accuracy", as long as the "codes" one, but with the same value at each position
print(accuracy)
5.43234
5.43234
5.43234
...
How can I fix this issue?
Thank you

Resources