ALS model - predicted full_u * v^t * v ratings are very high - apache-spark

I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v?
! rm -rf ml-1m.zip ml-1m
! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip
! unzip ml-1m.zip
! mv ml-1m/ratings.dat .
from pyspark.mllib.recommendation import Rating
ratingsRDD = sc.textFile('ratings.dat') \
.map(lambda l: l.split("::")) \
.map(lambda p: Rating(
user = int(p[0]),
product = int(p[1]),
rating = float(p[2]),
)).cache()
from pyspark.mllib.recommendation import ALS
rank = 50
numIterations = 20
lambdaParam = 0.1
model = ALS.train(ratingsRDD, rank, numIterations, lambdaParam)
Then extract the product features ...
import json
import numpy as np
pf = model.productFeatures()
pf_vals = pf.sortByKey().values().collect()
pf_keys = pf.sortByKey().keys().collect()
Vt = np.matrix(np.asarray(pf_vals))
full_u = np.zeros(len(pf_keys))
def set_rating(pf_keys, full_u, key, val):
try:
idx = pf_keys.index(key)
full_u.itemset(idx, val)
except:
pass
set_rating(pf_keys, full_u, 260, 9), # Star Wars (1977)
set_rating(pf_keys, full_u, 1, 8), # Toy Story (1995)
set_rating(pf_keys, full_u, 16, 7), # Casino (1995)
set_rating(pf_keys, full_u, 25, 8), # Leaving Las Vegas (1995)
set_rating(pf_keys, full_u, 32, 9), # Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
set_rating(pf_keys, full_u, 335, 4), # Flintstones, The (1994)
set_rating(pf_keys, full_u, 379, 3), # Timecop (1994)
set_rating(pf_keys, full_u, 296, 7), # Pulp Fiction (1994)
set_rating(pf_keys, full_u, 858, 10), # Godfather, The (1972)
set_rating(pf_keys, full_u, 50, 8) # Usual Suspects, The (1995)
recommendations = full_u*Vt*Vt.T
top_ten_ratings = list(np.sort(recommendations)[:,-10:].flat)
print("predicted rating value", top_ten_ratings)
top_ten_recommended_product_ids = np.where(recommendations >= np.sort(recommendations)[:,-10:].min())[1]
top_ten_recommended_product_ids = list(np.array(top_ten_recommended_product_ids))
print("predict rating prod_id", top_ten_recommended_product_ids)
However the predicted ratings seem way too high:
('predicted rating value', [313.67320347694897, 315.30874327316576, 317.1563289268388, 317.45475214423948, 318.19788673744563, 319.93044594688428, 323.92448427140653, 324.12553531632761, 325.41052886977582, 327.12199687047649])
('predict rating prod_id', [49, 287, 309, 558, 744, 802, 1839, 2117, 2698, 3111])
This appears to be incorrect. Any tips appreciated.

I think the approach mentioned would work if you only care about the ranking of the movies. If you want to get an actual rating there seem to be something of in terms dimension/scaling.
The idea here, is to guess the latent representation of your new user. Normally, for a user already in the factorization, user i, you have his latent representation u_i (the ith row in model.userFeatures()) and you get his rating for a given movie (movie j) using model.predict which basically multiply u_i by the latent representation of the product v_j. you can get all the predicted ratings at once if you multiply with the whole v: u_i*v.
For a new user you have to guess what is his latent representation u_new from full_u_new.
Basically you want 50 coefficients that represent your new user affinity towards each of the latent product factor.
For simplicity and since it was enough for my implicit feedback use case, I simply used the dot product, basically projecting the new user on the product latent factor: full_u_new*V^t gives you 50 coefficient, the coeff i being how much your new user looks like product latent factor i. and it works especially well with implicit feedback.
So, using the dot product will give you that but it won't be scaled and it explains the high scores you are seeing.
To get usable scores you need a more accurately scaled u_new, I think you could get that using the cosine similarity, like they did [here]https://github.com/apache/incubator-predictionio/blob/release/0.10.0/examples/scala-parallel-recommendation/custom-query/src/main/scala/ALSAlgorithm.scala
The approach mentioned by #ScottEdwards2000 in the comment is interesting too, but rather different. You could indeed look for the most similar user(s) in your training set. If there are more than one you could get the average. I don't think it would do too badly but it is a really different approach and you need the full rating matrix (to find the most similar user(s)). Getting one close user should definitely solve the scaling problem. If you manage to make both approach work you could compare the results!

Related

Identify best GridsearchCV scoring metric for food prediction in XGBoost

I am using GridSearchCV to find the best parameter that help me tune XGBoost for a food prediction algorithm.
I am struggling to identify the best scoring metric that would result in the best profit (sales margin minus wastage costs) as this is ultimately what I am looking for. In running the script below and plugging it into the data (I reserved some data for testing only), I noticed that a better R2 seems to be better than a better RMSE in obtaining a higher profit. But I am struggling to find an explanation which will help me guide to the best scoring method.
Here some infos on the situation:
It costs me 6 USD to produce the product and 9 USD to sell, so my margin is 3 USD. Therefore my wastage is 6 USD multiplied by (production minus sales quantities), whereas my earnings are sales quantities multiplied by 3.
Example: I produce 100, sell 70, waste 30 my earnings are 70*3 - 30*6 = 30
So I have an imbalance between sales and wastage.
Main Question: Which scoring metric puts a higher penalty weight on the over-prediction?
My current code:
X = consumption[feature_names]
y = consumption['Meal1']
data_dmatrix = xgb.DMatrix(data=X,label=y)
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
'min_child_weight':[1, 2],
'gamma': [0.05,0.06],
'reg_alpha':range(1, 2),
'colsample_bytree': [0.22, 0.23],
'n_estimators': range(28, 29),
'max_depth': range(3, 8),
'reg_alpha':range(1, 2),
'reg_lambda':range(1, 2),
'subsample': [0.7,0.8,0.9],
'learning_rate': [0.1,0.2],
}
fixed_params = {'objective':'reg:squarederror','booster':'gbtree' }
# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(**fixed_params)
# Perform grid search: grid_mse
grid_mse = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid, scoring="r2", cv=5, verbose=1)
# Fit grid_mse to the data
grid_mse.fit(X,y)
# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest Score found: ", np.sqrt(np.abs(grid_mse.best_score_)))

SPACY custom NER is not returning any entity

I am trying to train a Spacy model to recognize a few custom NERs, the training data is given below, it is mostly related to recognizing a few server models, date in the FY format and Types of HDD:
TRAIN_DATA = [('Send me the number of units shipped in FY21 for A566TY server', {'entities': [(39, 42, 'DateParse'),(48,53,'server')]}),
('Send me the number of units shipped in FY-21 for A5890Y server', {'entities': [(39, 43, 'DateParse'),(49,53,'server')]}),
('How many systems sold with 3.5 inch drives in FY20-Q2 for F567?', {'entities': [(46, 52, 'DateParse'),(58,61,'server'),(27,29,'HDD')]}),
('Total revenue in FY20Q2 for 3.5 HDD', {'entities': [(17, 22, 'DateParse'),(28,30,'HDD')]}),
('How many systems sold with 3.5 inch drives in FY20-Q2 for F567?', {'entities': [(46, 52, 'DateParse'),(58,61,'server'),(27,29,'HDD')]}),
('Total units shipped in FY2017-FY2021', {'entities': [(23, 28, 'DateParse'),(30,35,'DateParse')]}),
('Total units shipped in FY 18', {'entities': [(23, 27, 'DateParse')]}),
('Total units shipped between FY16 and FY2021', {'entities': [(28, 31, 'DateParse'),(37,42,'DateParse')]})
]
def train_spacy(data,iterations):
TRAIN_DATA = data
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)
return nlp
But on running the code even on training data no entity is being returned.
prdnlp = train_spacy(TRAIN_DATA, 100)
for text, _ in TRAIN_DATA:
doc = prdnlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
The Output is coming as below:
Spacy can currently only train from entity annotation that lines up with token boundaries. The main problem is that your span end characters are one character too short. The character start/end values should be just like string slices for the text:
text = "Send me the number of units shipped in FY21 for A566TY server"
# (39, 42, 'DateParse')
assert text[39:42] == "FY2"
You should have (39, 43, 'DateParse') instead.
A secondary problem is that you may also need to adjust the tokenizer for cases like FY2017-FY2021 because the default English tokenizer treats this as one token, so the annotations [(23, 28, 'DateParse'),(30,35,'DateParse')] would be ignored during training.
See a more detailed explanation here: https://github.com/explosion/spaCy/issues/4946#issuecomment-580663925

Keras layer for slicing image data into sliding windows

I have a set of images, all of varying widths, but with fixed height set to 100 pixels and 3 channels of depth. My task is to classify if each vertical line in the image is interesting or not. To do that, I look at the line in context of its 10 predecessor and successor lines. Imagine the algorithm sweeping from left to right of the image, detecting vertical lines containing points of interest.
My first attempt at doing this was to manually cut out these sliding windows using numpy before feeding the data into the Keras model. Like this:
# Pad left and right
s = np.repeat(D[:1], 10, axis = 0)
e = np.repeat(D[-1:], 10, axis = 0)
# D now has shape (w + 20, 100, 3)
D = np.concatenate((s, D, e))
# Sliding windows creation trick from SO question
idx = np.arange(21)[None,:] + np.arange(len(D) - 20)[:,None]
windows = D[indexer]
Then all windows and all ground truth 0/1 values for all vertical lines in all images would be concatenated into two very long arrays.
I have verified that this works, in principle. I fed each window to a Keras layer looking like this:
Conv2D(20, (5, 5), input_shape = (21, 100, 3), padding = 'valid', ...)
But the windowing causes the memory usage to increase 21 times so doing it this way becomes impractical. But I think my scenario is a very common in machine learning so there must be some standard method in Keras to do this efficiently? E.g I would like to feed Keras my raw image data (w, 100, 80) and tell it what the sliding window sizes are and let it figure out the rest. I have looked at some sample code but I'm a ml noob so I don't get it.
Unfortunately this isn't an easy problem because it can involve using a variable sized input for your Keras model. While I think it is possible to do this with proper use of placeholders that's certainly no place for a beginner to start. your other option is a data generator. As with many computationally intensive tasks there is often a trade off between compute speed and memory requirements, using a generator is more compute heavy and it will be done entirely on your cpu (no gpu acceleration), but it won't make the memory increase.
The point of a data generator is that it will apply the operation to images one at a time to produce the batch, then train on that batch, then free up the memory - so you only end up keeping one batch worth of data in memory at any time. Unfortunately if you have a time consuming generation then this can seriously affect performance.
The generator will be a python generator (using the 'yield' keyword) and is expected to produce a single batch of data, keras is very good at using arbitrary batch sizes, so you can always make one image yield one batch, especially to start.
Here is the keras page on fit_generator - I warn you, this starts to become a lot of work very quickly, consider buying more memory:
https://keras.io/models/model/#fit_generator
Fine I'll do it for you :P
import numpy as np
import pandas as pd
import keras
from keras.models import Model, model_from_json
from keras.layers import Dense, Concatenate, Multiply,Add, Subtract, Input, Dropout, Lambda, Conv1D, Flatten
from tensorflow.python.client import device_lib
# check for my gpu
print(device_lib.list_local_devices())
# make some fake image data
# 1000 random widths
data_widths = np.floor(np.random.random(1000)*100)
# producing 1000 random images with dimensions w x 100 x 3
# and a vector of which vertical lines are interesting
# I assume your data looks like this
images = []
interesting = []
for w in data_widths:
images.append(np.random.random([int(w),100,3]))
interesting.append(np.random.random(int(w))>0.5)
# this is a generator
def image_generator(images, interesting):
num = 0
while num < len(images):
windows = None
truth = None
D = images[num]
# this should look familiar
# Pad left and right
s = np.repeat(D[:1], 10, axis = 0)
e = np.repeat(D[-1:], 10, axis = 0)
# D now has shape (w + 20, 100, 3)
D = np.concatenate((s, D, e))
# Sliding windows creation trick from SO question
idx = np.arange(21)[None,:] + np.arange(len(D) - 20)[:,None]
windows = D[idx]
truth = np.expand_dims(1*interesting[num],axis=1)
yield (windows, truth)
num+=1
# the generator MUST loop
if num == len(images):
num = 0
# basic model - replace with your own
input_layer = Input(shape = (21,100,3), name = "input_node")
fc = Flatten()(input_layer)
fc = Dense(100, activation='relu',name = "fc1")(fc)
fc = Dense(50, activation='relu',name = "fc2")(fc)
fc = Dense(10, activation='relu',name = "fc3")(fc)
output_layer = Dense(1, activation='sigmoid',name = "output")(fc)
model = Model(input_layer,output_layer)
model.compile(optimizer="adam", loss='binary_crossentropy')
model.summary()
#and training
training_history = model.fit_generator(image_generator(images, interesting),
epochs =5,
initial_epoch = 0,
steps_per_epoch=len(images),
verbose=1
)

Implementing word dropout in pytorch

I want to add word dropout to my network so that I can have sufficient training examples for training the embedding of the "unk" token. As far as I'm aware, this is standard practice. Let's assume the index of the unk token is 0, and the index for padding is 1 (we can switch them if that's more convenient).
This is a simple CNN network which implements word dropout the way I would have expected it to work:
class Classifier(nn.Module):
def __init__(self, params):
super(Classifier, self).__init__()
self.params = params
self.word_dropout = nn.Dropout(params["word_dropout"])
self.pad = torch.nn.ConstantPad1d(max(params["window_sizes"])-1, 1)
self.embedding = nn.Embedding(params["vocab_size"], params["word_dim"], padding_idx=1)
self.convs = nn.ModuleList([nn.Conv1d(1, params["feature_num"], params["word_dim"] * window_size, stride=params["word_dim"], bias=False) for window_size in params["window_sizes"]])
self.dropout = nn.Dropout(params["dropout"])
self.fc = nn.Linear(params["feature_num"] * len(params["window_sizes"]), params["num_classes"])
def forward(self, x, l):
x = self.word_dropout(x)
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Don't mind the padding - pytorch doesn't have an easy way of using non zero padding in CNNs, much less trainable non-zero padding, so I'm doing it manually. Dropout also doesn't allow me to use non zero dropout, and I want to separate the padding token from the unk token. I'm keeping it in my example because it's the reason for this question's existence.
This doesn't work because dropout wants Float Tensors so that it can scale them properly, while my input is Long Tensors that don't need to be scaled.
Is there an easy way of doing this in pytorch? I essentially want to use LongTensor-friendly dropout (bonus: better if it will let me specify a dropout constant that isn't 0, so that I could use zero padding).
Actually I would do it outside of your model, before converting your input into a LongTensor.
This would look like this:
import random
def add_unk(input_token_id, p):
#random.random() gives you a value between 0 and 1
#to avoid switching your padding to 0 we add 'input_token_id > 1'
if random.random() < p and input_token_id > 1:
return 0
else:
return input_token_id
#than you have your input token_id
#for this example I take just a random number, lets say 127
input_token_id = 127
#let p be your probability for UNK
p = 0.01
your_input_tensor = torch.LongTensor([add_unk(input_token_id, p)])
Edit:
So there are two options which come to my mind which are actually GPU-friendly. In general both solutions should be much more efficient.
Option one - Doing computation directly in forward():
If you're not using torch.utils and don't have plans using it later this is probably the way to go.
Instead of doing the computation before we just do it in the forward() method of main PyTorch class. However I see no (simple) way doing this in torch 0.3.1., so you would need to upgrade to version 0.4.0:
So imagine x is your input vector:
>>> x = torch.tensor(range(10))
>>> x
tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
probs is a vector containing uniform probabilities for dropout so we can check later agains our probability for dropout:
>>> probs = torch.empty(10).uniform_(0, 1)
>>> probs
tensor([ 0.9793, 0.1742, 0.0904, 0.8735, 0.4774, 0.2329, 0.0074,
0.5398, 0.4681, 0.5314])
Now we apply the dropout probabilities probs on our input x:
>>> torch.where(probs > 0.2, x, torch.zeros(10, dtype=torch.int64))
tensor([ 0, 0, 0, 3, 4, 5, 0, 7, 8, 9])
Note: To see some effect I chose a dropout probability of 0.2 here. I reality you probably want it to be smaller.
You can pick for this any token / id you like, here is an example with 42 as unknown token id:
>>> unk_token = 42
>>> torch.where(probs > 0.2, x, torch.empty(10, dtype=torch.int64).fill_(unk_token))
tensor([ 0, 42, 42, 3, 4, 5, 42, 7, 8, 9])
torch.where comes with PyTorch 0.4.0:
https://pytorch.org/docs/master/torch.html#torch.where
I don't know about the shapes of your network, but your forward() should look something like this then (when using mini-batching you need to flatten the input before applying dropout):
def forward_train(self, x, l):
# probabilities
probs = torch.empty(x.size(0)).uniform_(0, 1)
# applying word dropout
x = torch.where(probs > 0.02, x, torch.zeros(x.size(0), dtype=torch.int64))
# continue like before ...
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Note: I named the function forward_train() so you should use another forward() without dropout for evaluation / predicting. But you could also use some if conditions with train().
Option two: using torch.utils.data.Dataset:
If you're using Dataset provided by torch.utils it is very easy to do this kind of pre-processing efficiently. Dataset uses strong multi-processing acceleration by default so the the code sample above just has to be executed in the __getitem__ method of your Dataset class.
This could look like this:
def __getitem__(self, index):
'Generates one sample of data'
# Select sample
ID = self.input_tokens[index]
# Load data and get label
# using add ink_unk function from code above
X = torch.LongTensor(add_unk(ID, p=0.01))
y = self.targets[index]
return X, y
This is a bit out of context and doesn't look very elegant but I think you get the idea. According to this blog post of Shervine Amidi at Stanford it should be no problem to do more complex pre-processing steps in this function:
Since our code [Dataset is meant] is designed to be multicore-friendly, note that you
can do more complex operations instead (e.g. computations from source
files) without worrying that data generation becomes a bottleneck in
the training process.
The linked blog post - "A detailed example of how to generate your data in parallel with PyTorch" - provides also a good guide for implementing the data generation with Dataset and DataLoader.
I guess you'll prefer option one - only two lines and it should be very efficient. :)
Good luck!

Using predict on new text with kmeans (sklearn)?

I have a very small list of short strings which I want to (1) cluster and (2) use that model to predict which cluster a new string belongs to.
Running the first part works fine, getting a prediction for the new string does not.
First Part
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# List of
documents_lst = ['a small, narrow river',
'a continuous flow of liquid, air, or gas',
'a continuous flow of data or instructions, typically one having a constant or predictable rate.',
'a group in which schoolchildren of the same age and ability are taught',
'(of liquid, air, gas, etc.) run or flow in a continuous current in a specified direction',
'transmit or receive (data, especially video and audio material) over the Internet as a steady, continuous flow.',
'put (schoolchildren) in groups of the same age and ability to be taught together',
'a natural body of running water flowing on or under the earth']
# 1. Vectorize the text
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents_lst)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
# 2. Get the number of clusters to make .. (find a better way than random)
num_clusters = 3
# 3. Cluster the defintions
km = KMeans(n_clusters=num_clusters, init='k-means++').fit(tfidf_matrix)
clusters = km.labels_.tolist()
print(clusters)
Which returns:
tfidf_matrix.shape: (8, 39)
[0, 1, 0, 2, 1, 0, 2, 0]
Second Part
The failing part:
predict_doc = ['A stream is a body of water with a current, confined within a bed and banks.']
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(predict_doc)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
km.predict(tfidf_matrix)
The error:
ValueError: Incorrect number of features. Got 7 features, expected 39
FWIW: I somewhat understand that the training and predict have a different amount of features after vectorizing ...
I am open to any solution including changing from kmeans to an algorithm more suitable for short text clustering.
Thanks in advance
For completeness I will answer my own question with an answer from here , that doesn't answer that question. But answers mine
from sklearn.cluster import KMeans
list1 = ["My name is xyz", "My name is pqr", "I work in abc"]
list2 = ["My name is xyz", "I work in abc"]
vectorizer = TfidfVectorizer(min_df = 0, max_df=0.5, stop_words = "english", charset_error = "ignore", ngram_range = (1,3))
vec = vectorizer.fit(list1) # train vec using list1
vectorized = vec.transform(list1) # transform list1 using vec
km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=1000, tol=0.0001, precompute_distances=True, verbose=0, random_state=None, n_jobs=1)
km.fit(vectorized)
list2Vec = vec.transform(list2) # transform list2 using vec
km.predict(list2Vec)
The credit goes to #IrshadBhat

Resources