I'm an sklearn dummy... I'm trying to predict the label for a given string from a RandomForestClassifier() fitted with text, labels.
It's obvious I don't know how to use predict() with a single string. The reason I'm using reshape() is because I got this error some time ago "Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
How can I predict the label of a single text string?
The script:
#!/usr/bin/env python
''' Read a txt file consisting of '<label>: <long string of text>'
to use as a model for predicting the label for a string
'''
from argparse import ArgumentParser
import json
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
def main(args):
'''
args: Arguments obtained by _Get_Args()
'''
print('Loading data...')
# Load data from args.txtfile and split the lines into
# two lists (labels, texts).
data = open(args.txtfile).readlines()
labels, texts = ([], [])
for line in data:
label, text = line.split(': ', 1)
labels.append(label)
texts.append(text)
# Print a list of unique labels
print(json.dumps(list(set(labels)), indent=4))
# Instantiate a CountVectorizer class and git the texts
# and labels into it.
cv = CountVectorizer(
stop_words='english',
strip_accents='unicode',
lowercase=True,
)
matrix = cv.fit_transform(texts)
encoder = LabelEncoder()
labels = encoder.fit_transform(labels)
rf = RandomForestClassifier()
rf.fit(matrix, labels)
# Try to predict the label for args.string.
prediction = Predict_Label(args.string, cv, rf)
print(prediction)
def Predict_Label(string, cv, rf):
'''
string: str() - A string of text
cv: The CountVectorizer class
rf: The RandomForestClassifier class
'''
matrix = cv.fit_transform([string])
matrix = matrix.reshape(1, -1)
try:
prediction = rf.predict(matrix)
except Exception as E:
print(str(E))
else:
return prediction
def _Get_Args():
parser = ArgumentParser(description='Learn labels from text')
parser.add_argument('-t', '--txtfile', required=True)
parser.add_argument('-s', '--string', required=True)
return parser.parse_args()
if __name__ == '__main__':
args = _Get_Args()
main(args)
The actual learning data text file is 43663 lines long but a sample is in small_list.txt which consists of lines each in the format: <label>: <long text string>
The error is noted in the Exception output:
$ ./learn.py -t small_list.txt -s 'This is a string that might have something to do with phishing or fraud'
Loading data...
[
"Vulnerabilities__Unknown",
"Vulnerabilities__MSSQL Browsing Service",
"Fraud__Phishing",
"Fraud__Copyright/Trademark Infringement",
"Attacks and Reconnaissance__Web Attacks",
"Vulnerabilities__Vulnerable SMB",
"Internal Report__SBL Notify",
"Objectionable Content__Russian Federation Objectionable Material",
"Malicious Code/Traffic__Malicious URL",
"Spam__Marketing Spam",
"Attacks and Reconnaissance__Scanning",
"Malicious Code/Traffic__Unknown",
"Attacks and Reconnaissance__SSH Brute Force",
"Spam__URL in Spam",
"Vulnerabilities__Vulnerable Open Memcached",
"Malicious Code/Traffic__Sinkhole",
"Attacks and Reconnaissance__SMTP Brute Force",
"Illegal content__Child Pornography"
]
Number of features of the model must match the input. Model n_features is 2070 and input n_features is 3
None
You need to get the vocabulary of the first CountVectorizer (cv) and use to transform the new single text before predict.
...
cv = CountVectorizer(
stop_words='english',
strip_accents='unicode',
lowercase=True,
)
matrix = cv.fit_transform(texts)
encoder = LabelEncoder()
labels = encoder.fit_transform(labels)
rf = RandomForestClassifier()
rf.fit(matrix, labels)
# Try to predict the label for args.string.
cv_new = CountVectorizer(
stop_words='english',
strip_accents='unicode',
lowercase=True,
vocabulary=cv.vocabulary_
)
prediction = Predict_Label(args.string, cv_new, rf)
print(prediction)
...
Related
I am really struggling to make things work with the new spacy v3 version. The documentation is full. However, I am trying to run a training loop in a script.
(I am also not able to perform text classification training with CLI approach).
Data are publically available here.
import pandas as pd
from spacy.training import Example
import random
TRAIN_DATA = pd.read_json('data.jsonl', lines = True)
nlp = spacy.load('en_core_web_sm')
config = {
"threshold": 0.5,
}
textcat = nlp.add_pipe("textcat", config=config, last=True)
label = TRAIN_DATA['label'].unique()
for label in label:
textcat.add_label(str(label))
nlp = spacy.blank("en")
nlp.begin_training()
# Loop for 10 iterations
for itn in range(100):
# Shuffle the training data
losses = {}
TRAIN_DATA = TRAIN_DATA.sample(frac = 1)
# Batch the examples and iterate over them
for batch in spacy.util.minibatch(TRAIN_DATA.values, size=4):
texts = [nlp.make_doc(text) for text, entities in batch]
annotations = [{"cats": entities} for text, entities in batch]
# uses an example object rather than text/annotation tuple
print(texts)
print(annotations)
examples = [Example.from_dict(a)]
nlp.update(examples, losses=losses)
if itn % 20 == 0:
print(losses)
I am trying to find the best c parameter following the instructions to a task that asks me to ' Define a function, fit_generative_model, that takes as input a training set (train_data, train_labels) and fits a Gaussian generative model to it. It should return the parameters of this generative model; for each label j = 0,1,...,9, where
pi[j]: the frequency of that label
mu[j]: the 784-dimensional mean vector
sigma[j]: the 784x784 covariance matrix
It is important to regularize these matrices. The standard way of doing this is to add cI to them, where c is some constant and I is the 784-dimensional identity matrix. c is now a parameter, and by setting it appropriately, we can improve the performance of the model.
%matplotlib inline
import sys
import matplotlib.pyplot as plt
import gzip, os
import numpy as np
from scipy.stats import multivariate_normal
if sys.version_info[0] == 2:
from urllib import urlretrieve
else:
from urllib.request import urlretrieve
# Downloads the dataset
def download(filename, source='http://yann.lecun.com/exdb/mnist/'):
print("Downloading %s" % filename)
urlretrieve(source + filename, filename)
# Invokes download() if necessary, then reads in images
def load_mnist_images(filename):
if not os.path.exists(filename):
download(filename)
with gzip.open(filename, 'rb') as f:
data = np.frombuffer(f.read(), np.uint8, offset=16)
data = data.reshape(-1,784)
return data
def load_mnist_labels(filename):
if not os.path.exists(filename):
download(filename)
with gzip.open(filename, 'rb') as f:
data = np.frombuffer(f.read(), np.uint8, offset=8)
return data
## Load the training set
train_data = load_mnist_images('train-images-idx3-ubyte.gz')
train_labels = load_mnist_labels('train-labels-idx1-ubyte.gz')
## Load the testing set
test_data = load_mnist_images('t10k-images-idx3-ubyte.gz')
test_labels = load_mnist_labels('t10k-labels-idx1-ubyte.gz')
train_data.shape, train_labels.shape
So I have written this code for three different C-values. they each give me the same error?
def fit_generative_model(x,y):
lst=[]
for c in [20,200, 4000]:
k = 10 # labels 0,1,...,k-1
d = (x.shape)[1] # number of features
mu = np.zeros((k,d))
sigma = np.zeros((k,d,d))
pi = np.zeros(k)
for label in range(0,k):
indices = (y == label)
mu[label] = np.mean(x[indices,:], axis=0)
sigma[label] = np.cov(x[indices,:], rowvar=0, bias=1) + c*np.identity(784) # I define the identity matrix
predictions = np.argmax(score, axis=1)
errors = np.sum(predictions != y)
lst.append(errors)
print(c,"Model makes " + str(errors) + " errors out of 10000", lst)
Then I fit it to the training data and get these same errors:
mu, sigma, pi = fit_generative_model(train_data, train_labels)
20 Model makes 1 errors out of 10000 [1]
200 Model makes 1 errors out of 10000 [1, 1]
4000 Model makes 1 errors out of 10000 [1, 1, 1]
and to the test data:
mu, sigma, pi = fit_generative_model(test_data, test_labels)
20 Model makes 9020 errors out of 10000 [9020]
200 Model makes 9020 errors out of 10000 [9020, 9020]
4000 Model makes 9020 errors out of 10000 [9020, 9020, 9020]
What is it I'm doing wrong? the correct answer is c=4000 which yields an error of ~4.3%.
I was trying out the NER tutorial Token Classification with W-NUT Emerging Entities (https://huggingface.co/transformers/custom_datasets.html#tok-ner) in google colab using the Annotated Corpus for Named Entity Recognition data on Kaggle (https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus?select=ner_dataset.csv).
I will outline my process in detail to facilitate an understanding of what I was doing and to let the community help me figure out the source of the indexing assignment error.
To load the data from google drive where I have saved it, I used the following code
# import pandas library
import pandas as pd
# columns to select
cols_to_select = ["Sentence #", "Word", "Tag"]
# google drive data path
data_path = '/content/drive/MyDrive/Colab Notebooks/ner/ner_dataset.csv'
# load the data from google colab
dataset = pd.read_csv(data_path, encoding="latin-1")[cols_to_select].fillna(method = 'ffill')
I run the following code to parse the sentences and tags
class SentenceGetter(object):
def __init__(self, data):
self.n_sent = 1
self.data = data
self.empty = False
agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),
s["Tag"].values.tolist())]
self.grouped = self.data.groupby("Sentence #").apply(agg_func)
self.sentences = [s for s in self.grouped]
def retrieve(self):
try:
s = self.grouped["Sentence: {}".format(self.n_sent)]
self.n_sent += 1
return s
except:
return None
# get full data
getter = SentenceGetter(dataset)
# get sentences
sentences = [[s[0] for s in sent] for sent in getter.sentences]
# get tags/labels
tags = [[s[1] for s in sent] for sent in getter.sentences]
# take a look at the data
print(sentences[0][0:5], tags[0][0:5], sep='\n')
I then split the data into train, val, and test sets
# import the sklearn module
from sklearn.model_selection import train_test_split
# split data in to temp and test sets
temp_texts, test_texts, temp_tags, test_tags = train_test_split(sentences,
tags,
test_size=0.20,
random_state=15)
# split data into train and validation sets
train_texts, val_texts, train_tags, val_tags = train_test_split(temp_texts,
temp_tags,
test_size=0.20,
random_state=15)
After splitting the data, I created encodings for tags and the tokens
unique_tags=dataset.Tag.unique()
# create tags to id
tag2id = {tag: id for id, tag in enumerate(unique_tags)}
# create id to tags
id2tag = {id: tag for tag, id in tag2id.items()}
I then installed the transformer library in colab
# install the transformers library
! pip install transformers
Next I imported the small bert model
# import the transformers module
from transformers import BertTokenizerFast
# import the small bert model
model_name = "google/bert_uncased_L-4_H-512_A-8"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
I then created the encodings for the tokens
# create train set encodings
train_encodings = tokenizer(train_texts,
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
max_length=128,
truncation=True)
# create validation set encodings
val_encodings = tokenizer(val_texts,
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
max_length=128,
truncation=True)
# create test set encodings
test_encodings = tokenizer(test_texts,
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
max_length=128,
truncation=True)
In the tutorial, it uses offset-mapping to handle the problem that arise with word-piece tokenization, specifically, the mismatch between tokens and labels. It is when running the offset-mapping code in the tutorial that I get the error. Below is the offset mapping function used in the tutorial:
# the offset function
import numpy as np
def encode_tags(tags, encodings):
labels = [[tag2id[tag] for tag in doc] for doc in tags]
encoded_labels = []
for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
# create an empty array of -100
doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
arr_offset = np.array(doc_offset)
# set labels whose first offset position is 0 and the second is not 0
doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
encoded_labels.append(doc_enc_labels.tolist())
return encoded_labels
# return the encoded labels
train_labels = encode_tags(train_tags, train_encodings)
val_labels = encode_tags(val_tags, val_encodings)
test_labels = encode_tags(test_tags, test_encodings)
After running the above code, it gives me the following error, and I can't figure out where the source of the error lies. Any help and pointers would be appreciated.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-afdff0186eb3> in <module>()
17
18 # return the encoded labels
---> 19 train_labels = encode_tags(train_tags, train_encodings)
20 val_labels = encode_tags(val_tags, val_encodings)
21 test_labels = encode_tags(test_tags, test_encodings)
<ipython-input-19-afdff0186eb3> in encode_tags(tags, encodings)
11
12 # set labels whose first offset position is 0 and the second is not 0
---> 13 doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
14 encoded_labels.append(doc_enc_labels.tolist())
15
ValueError: NumPy boolean array indexing assignment cannot assign 38 input values to the 37 output values where the mask is true
Looks like the previous paradigm of declaring Fields, Examples and using BucketIterator is deprecated and will move to legacy in 0.8. However, I don't seem to be able to find an example of the new paradigm for custom datasets (as in, not the ones included in torch.datasets) that doesn't use Field. Can anyone point me at an up-to-date example?
Reference for deprecation:
https://github.com/pytorch/text/releases
It took me a little while to find the solution myself. The new paradigm is like so for prebuilt datasets:
from torchtext.experimental.datasets import AG_NEWS
train, test = AG_NEWS(ngrams=3)
or like so for custom built datasets:
from torch.utils.data import DataLoader
def collate_fn(batch):
texts, labels = [], []
for label, txt in batch:
texts.append(txt)
labels.append(label)
return texts, labels
dataloader = DataLoader(train, batch_size=8, collate_fn=collate_fn)
for idx, (texts, labels) in enumerate(dataloader):
print(idx, texts, labels)
I've copied the examples from the Source
Browsing through torchtext's GitHub repo I stumbled over the README in the legacy directory, which is not documented in the official docs. The README links a GitHub issue that explains the rationale behind the change as well as a migration guide.
If you just want to keep your existing code running with torchtext 0.9.0, where the deprecated classes have been moved to the legacy module, you have to adjust your imports:
# from torchtext.data import Field, TabularDataset
from torchtext.legacy.data import Field, TabularDataset
Alternatively, you can import the whole torchtext.legacy module as torchtext as suggested by the README:
import torchtext.legacy as torchtext
There is a post regarding this. Instead of the deprecated Field and BucketIterator classes, it uses the TextClassificationDataset along with the collator and other preprocessing. It reads a txt file and builds a dataset, followed by a model. Inside the post, there is a link to a complete working notebook. The post is at: https://mmg10.github.io/pytorch/2021/02/16/text_torch.html. But you need the 'dev' (or nightly build) of PyTorch for it to work.
From the link above:
After tokenization and building vocabulary, you can build the dataset as follows
def data_to_dataset(data, tokenizer, vocab):
data = [(text, label) for (text, label) in data]
text_transform = sequential_transforms(tokenizer.tokenize,
vocab_func(vocab),
totensor(dtype=torch.long)
)
label_transform = sequential_transforms(lambda x: 1 if x =='1' else (0 if x =='0' else x),
totensor(dtype=torch.long)
)
transforms = (text_transform, label_transform)
dataset = TextClassificationDataset(data, vocab, transforms)
return dataset
The collator is as follows:
def __init__(self, pad_idx):
self.pad_idx = pad_idx
def collate(self, batch):
text, labels = zip(*batch)
labels = torch.LongTensor(labels)
text = nn.utils.rnn.pad_sequence(text, padding_value=self.pad_idx, batch_first=True)
return text, labels
Then, you can build the dataloader with the typical torch.utils.data.DataLoader using the collate_fn argument.
Well it seems like pipeline could be like that:
import torchtext as TT
import torch
from collections import Counter
from torchtext.vocab import Vocab
# read the data
with open('text_data.txt','r') as f:
data = f.readlines()
with open('labels.txt', 'r') as f:
labels = f.readlines()
tokenizer = TT.data.utils.get_tokenizer('spacy', 'en') # can remove 'spacy' and use a simple built-in tokenizer
train_iter = zip(labels, data)
counter = Counter()
for (label, line) in train_iter:
counter.update(tokenizer(line))
vocab = TT.vocab.Vocab(counter, min_freq=1)
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
# this is data-specific - adapt for your data
label_pipeline = lambda x: 1 if x == 'positive\n' else 0
class TextData(torch.utils.data.Dataset):
'''
very basic dataset for processing text data
'''
def __init__(self, labels, text):
super(TextData, self).__init__()
self.labels = labels
self.text = text
def __getitem__(self, index):
return self.labels[index], self.text[index]
def __len__(self):
return len(self.labels)
def tokenize_batch(batch, max_len=200):
'''
tokenizer to use in DataLoader
takes a text batch of text dataset and produces a tensor batch, converting text and labels though tokenizer, labeler
tokenizer is a global function text_pipeline
labeler is a global function label_pipeline
max_len is a fixed len size, if text is less than max_len it is padded with ones (pad number)
if text is larger that max_len it is truncated but from the end of the string
'''
labels_list, text_list = [], []
for _label, _text in batch:
labels_list.append(label_pipeline(_label))
text_holder = torch.ones(max_len, dtype=torch.int32) # fixed size tensor of max_len
processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int32)
pos = min(200, len(processed_text))
text_holder[-pos:] = processed_text[-pos:]
text_list.append(text_holder.unsqueeze(dim=0))
return torch.FloatTensor(labels_list), torch.cat(text_list, dim=0)
train_dataset = TextData(labels, data)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, collate_fn=tokenize_batch)
lbl, txt = iter(train_loader).next()
I have created a Gaussian Naive Bayes classifier on a email (spam/not spam) dataset and was able to run it successfully. I vectorized the data, divided in it train and test sets and then calculated the accuracy, all the features that are present in the sklearn-Gaussian Naive Bayes classifier.
Now I want to be able to use this classifier to predict "labels" for new emails - whether they are by spam or not.
For example say I have an email. I want to feed it to my classifier and get the prediction as to whether it is a spam or not. How can I achieve this? Please Help.
Code for classifier file.
#!/usr/bin/python
import sys
from time import time
import logging
# Display progress logs on stdout
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')
sys.path.append("../DatasetProcessing/")
from vectorize_split_dataset import preprocess
### features_train and features_test are the features
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("training time:", round(time() - t0, 3), "s")
print(clf.score(features_test, labels_test))
## Printing Metrics
for Training and Testing
print("No. of Testing Features:" + str(len(features_test)))
print("No. of Testing Features Label:" + str(len(labels_test)))
print("No. of Training Features:" + str(len(features_train)))
print("No. of Training Features Label:" + str(len(labels_train)))
print("No. of Predicted Features:" + str(len(pred)))
## Calculating Classifier Performance
from sklearn.metrics import classification_report
y_true = labels_test
y_pred = pred
labels = ['0', '1']
target_names = ['class 0', 'class 1']
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))
# How to predict label of a new text
new_text = "You won a lottery at UK lottery commission. Reply to claim it"
Code for Vectorization
#!/usr/bin/python
import os
import pickle
import numpy
numpy.random.seed(42)
path = os.path.dirname(os.path.abspath(__file__))
### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand
feature_data_file = path + "./createdDataset/dataSet.pkl"
label_data_file = path + "./createdDataset/dataLabel.pkl"
feature_data = pickle.load(open(feature_data_file, "rb"))
label_data = pickle.load(open(label_data_file, "rb"))
### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations
for compatibility with### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)#.toarray()
## feature selection to reduce dimensionality
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile = 5)
selector.fit(features_train, labels_train)
features_train_transformed_reduced = selector.transform(features_train).toarray()
features_test_transformed_reduced = selector.transform(features_test).toarray()
features_train = features_train_transformed_reduced
features_test = features_test_transformed_reduced
def preprocess():
return features_train, features_test, labels_train, labels_test
Code for dataset generation
#!/usr/bin/python
import os
import pickle
import re
import sys
# sys.path.append("../tools/")
""
"
Starter code to process the texts of accuate and inaccurate category to extract
the features and get the documents ready for classification.
The list of all the texts from accurate category are in the accurate_files list
likewise for texts of inaccurate category are in (inaccurate_files)
The data is stored in lists and packed away in pickle files at the end.
"
""
accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")
label_data = []
feature_data = []
### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker
temp_counter = 0
for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:
for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset
temp_counter = 1
if temp_counter < 200:
path = os.path.join('..', path[: -1])
print(path)
text = open(path, "r")
line = text.readline()
while line: ###use a
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text)
stem_text = text.readline().strip()
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data
feature_data.append(stem_text)### append a 0 to label_data
if text is from Sara, and 1
if text is from Chris
if (name == "accurate"):
label_data.append("0")
elif(name == "inaccurate"):
label_data.append("1")
line = text.readline()
text.close()
print("texts processed")
accurate_files.close()
inaccurate_files.close()
pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))
Also I want to know whether i can incrementally train the classifier meaning thereby that retrain a created model with newer data for refining the model over time?
I would be really glad if someone can help me out with this. I am really stuck at this point.
You are already using your model to predict labels of emails in your test set. This is what pred = clf.predict(features_test) does. If you want to see these labels, do print pred.
But perhaps you what to know how you can predict labels for emails that you discover in the future and that are not currently in your test set? If so, you can think of your new email(s) as a new test set. As with your previous test set, you will need to run several key processing steps on the data:
1) The first thing you need to do is to generate features for your new email data. The feature generation step is not included in your code above, but will need to occur.
2) You are using a Tfidf vectorizer, which converts a collection of documents to a matrix of Tfidf features based upon term frequency and inverse document frequency. You need to put your new email test feature data through the vectorizer that you fit on your training data.
3) Then your new email test feature data will need to go through dimensionality reduction using the same selector that you fit on your training data.
4) Finally, run predict on your new test data. Use print pred if you want to view the new label(s).
To respond to your final question about iteratively re-training your model, yes you definitely can do this. It's just a matter of selecting a frequency, producing a script that expands your data set with incoming data, then re-running all steps from there, from pre-processing to Tfidf vectorization, to dimensionality reduction, to fitting, and prediction.