IndicBART seq2seq model gives entirely blank predictions - nlp

I am using IndicBART for sequence-to-sequence prediction on tamil sentences.
I have trained the model on 100,000 samples of tamil data for 30 epochs, then tested predictions on a few sentences. These are the predictions given by the model- I have presented both the token IDs output by the model and the tokens decoded from the token IDs:
source: 'bullet es'
predicted: ''
predicted token IDs: [64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 2]
gold: 'bullet e s'
source: 'இங்க பழச த்துலேந்து பேசுறங்க'
predicted: ''
predicted token IDs: [64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 64000, 2]
gold: 'இங்க palladam துல இருந்து பேசறங்க'
And so on. All the predictions are blank lines.
Untrained IndicBART is not producing blank predictions, only the trained checkpoint.
Is this a feature of IndicBART, or am I doing something wrong?

Related

Optimize classifier for multiclass Brier score instead of accuracy

I am more interested in optimizing my multiclass problem with Brier score instead of accuracy. To achieve that, I am evaluating my classifiers with the results of predict_proba() like:
import numpy as np
probs = np.array(
[ [1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]]
)
targets = np.array(
[[0.9, 0.05, 0.05],
[0.1, 0.8, 0.1],
[0.7, 0.2, 0.1],
[0.1, 0.9, 0],
[0, 0, 1],
[0.5, 0.3, 0.2],
[0.1, 0.5, 0.4],
[0.34, 0.33, 0.33]]
)
def brier_multi(targets, probs):
return np.mean(np.sum((probs - targets) ** 2, axis=1))
brier_multi(targets, probs)
Is it possible to optimize scikit-learns classifier directly during training for multiclass Brier score instead of accuracy?
Edit:
...
pipe = Pipeline(
steps=[
("preprocessor", preprocessor),
("selector", None),
("classifier", model.get("classifier")),
]
)
def brier_multi(targets, probs):
ohe_targets = OneHotEncoder().fit_transform(targets.reshape(-1, 1))
return np.mean(np.sum(np.square(probs - ohe_targets), axis=1))
brier_multi_loss = make_scorer(
brier_multi,
greater_is_better=False,
needs_proba=True,
)
search = GridSearchCV(
estimator=pipe,
param_grid=model.get("param_grid"),
scoring=brier_multi_loss,
cv=3,
n_jobs=-1,
refit=True,
verbose=3,
)
search.fit(X_train, y_train)
...
leads to nan as score
/home/andreas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py:969: UserWarning: One or more of the test scores are non-finite: [nan nan nan nan nan nan nan nan nan]
warnings.warn(
You're already aware of the scoring parameter, so you just need to wrap your brier_multi into the format expected by GridSearchCV. There's a utility for that, make_scorer:
from sklearn.metrics import make_scorer
neg_mc_brier_score = make_scorer(
brier_multi,
greater_is_better=False,
needs_proba=True,
)
GridSearchCV(..., scoring=neg_mc_brier_score)
See the User Guide and the docs for make_scorer.
Unfortunately, that won't run, because your version of the scorer expects a one-hot-encoded targets array, whereas sklearn multiclass will send y_true as a 1d array. As a hack to make sure the rest works, you can modify:
def brier_multi(targets, probs):
ohe_targets = OneHotEncoder().fit_transform(targets.reshape(-1, 1))
return np.mean(np.sum(np.square(probs - ohe_targets), axis=1))
but I would encourage you to make this more robust (what if the classes aren't just 0, 1, ..., n_classes-1?).
For what it's worth, sklearn has a PR in progress to add multiclass Brier score: https://github.com/scikit-learn/scikit-learn/pull/22046 (be sure to see the linked PR18699, as it has the beginning of development and review).

Pytorch, sample given batch logits

Given logits like
# each row is a record of data
logits = np.array([ [0.1, 0.3, 0.5], [0.3, 0.1, 0.5], [0.1, 0.3, 0.0] ])
How can I use Pytorch to sample the index for the logits of each row? Current distribution APIs does not support such functions.
What I want is, for example
distribution = Categorical(logits=logits)
labels = distribution.sample(dim=1)

Why doesn't this simple neural network converge for XOR?

The code for the network below works okay, but it's too slow. This site implies that the network should get 99% accuracy after 100 epochs with a learning rate of 0.2, while my network never gets past 97% even after 1900 epochs.
Epoch 0, Inputs [0 0], Outputs [-0.83054376], Targets [0]
Epoch 100, Inputs [0 1], Outputs [ 0.72563824], Targets [1]
Epoch 200, Inputs [1 0], Outputs [ 0.87570863], Targets [1]
Epoch 300, Inputs [0 1], Outputs [ 0.90996706], Targets [1]
Epoch 400, Inputs [1 1], Outputs [ 0.00204791], Targets [0]
Epoch 500, Inputs [0 1], Outputs [ 0.93396672], Targets [1]
Epoch 600, Inputs [0 0], Outputs [ 0.00006375], Targets [0]
Epoch 700, Inputs [0 1], Outputs [ 0.94778227], Targets [1]
Epoch 800, Inputs [1 1], Outputs [-0.00149935], Targets [0]
Epoch 900, Inputs [0 0], Outputs [-0.00122716], Targets [0]
Epoch 1000, Inputs [0 0], Outputs [ 0.00457281], Targets [0]
Epoch 1100, Inputs [0 1], Outputs [ 0.95921556], Targets [1]
Epoch 1200, Inputs [0 1], Outputs [ 0.96001748], Targets [1]
Epoch 1300, Inputs [1 0], Outputs [ 0.96071742], Targets [1]
Epoch 1400, Inputs [1 1], Outputs [ 0.00110912], Targets [0]
Epoch 1500, Inputs [0 0], Outputs [-0.00012382], Targets [0]
Epoch 1600, Inputs [1 0], Outputs [ 0.9640324], Targets [1]
Epoch 1700, Inputs [1 0], Outputs [ 0.96431516], Targets [1]
Epoch 1800, Inputs [0 1], Outputs [ 0.97004973], Targets [1]
Epoch 1900, Inputs [1 0], Outputs [ 0.96616225], Targets [1]
The dataset I'm using is:
0 0 0
1 0 1
0 1 1
1 1 1
The training set is read using a function in a helper file, but that isn't relevant to the network.
import numpy as np
import helper
FILE_NAME = 'data.txt'
EPOCHS = 2000
TESTING_FREQ = 5
LEARNING_RATE = 0.2
INPUT_SIZE = 2
HIDDEN_LAYERS = [5]
OUTPUT_SIZE = 1
class Classifier:
def __init__(self, layer_sizes):
np.set_printoptions(suppress=True)
self.activ = helper.tanh
self.dactiv = helper.dtanh
network = list()
for i in range(1, len(layer_sizes)):
layer = dict()
layer['weights'] = np.random.randn(layer_sizes[i], layer_sizes[i-1])
layer['biases'] = np.random.randn(layer_sizes[i])
network.append(layer)
self.network = network
def forward_propagate(self, x):
for i in range(0, len(self.network)):
self.network[i]['outputs'] = self.network[i]['weights'].dot(x) + self.network[i]['biases']
if i != len(self.network)-1:
self.network[i]['outputs'] = x = self.activ(self.network[i]['outputs'])
else:
self.network[i]['outputs'] = self.activ(self.network[i]['outputs'])
return self.network[-1]['outputs']
def backpropagate_error(self, x, targets):
self.forward_propagate(x)
self.network[-1]['deltas'] = (self.network[-1]['outputs'] - targets) * self.dactiv(self.network[-1]['outputs'])
for i in reversed(range(len(self.network)-1)):
self.network[i]['deltas'] = self.network[i+1]['deltas'].dot(self.network[i+1]['weights'] * self.dactiv(self.network[i]['outputs']))
def adjust_weights(self, inputs, learning_rate):
self.network[0]['weights'] -= learning_rate * np.atleast_2d(self.network[0]['deltas']).T.dot(np.atleast_2d(inputs))
self.network[0]['biases'] -= learning_rate * self.network[0]['deltas']
for i in range(1, len(self.network)):
self.network[i]['weights'] -= learning_rate * np.atleast_2d(self.network[i]['deltas']).T.dot(np.atleast_2d(self.network[i-1]['outputs']))
self.network[i]['biases'] -= learning_rate * self.network[i]['deltas']
def train(self, inputs, targets, epochs, testfreq, lrate):
for epoch in range(epochs):
i = np.random.randint(0, len(inputs))
if epoch % testfreq == 0:
predictions = self.forward_propagate(inputs[i])
print('Epoch %s, Inputs %s, Outputs %s, Targets %s' % (epoch, inputs[i], predictions, targets[i]))
self.backpropagate_error(inputs[i], targets[i])
self.adjust_weights(inputs[i], lrate)
inputs, outputs = helper.readInput(FILE_NAME, INPUT_SIZE, OUTPUT_SIZE)
print('Input data: {0}'.format(inputs))
print('Output targets: {0}\n'.format(outputs))
np.random.seed(1)
nn = Classifier([INPUT_SIZE] + HIDDEN_LAYERS + [OUTPUT_SIZE])
nn.train(inputs, outputs, EPOCHS, TESTING_FREQ, LEARNING_RATE)
The main bug is that you are doing the forward pass only 20% of the time, i.e. when epoch % testfreq == 0:
for epoch in range(epochs):
i = np.random.randint(0, len(inputs))
if epoch % testfreq == 0:
predictions = self.forward_propagate(inputs[i])
print('Epoch %s, Inputs %s, Outputs %s, Targets %s' % (epoch, inputs[i], predictions, targets[i]))
self.backpropagate_error(inputs[i], targets[i])
self.adjust_weights(inputs[i], lrate)
When I take predictions = self.forward_propagate(inputs[i]) out of if, I get much better results faster:
Epoch 100, Inputs [0 1], Outputs [ 0.80317447], Targets 1
Epoch 105, Inputs [1 1], Outputs [ 0.96340466], Targets 1
Epoch 110, Inputs [1 1], Outputs [ 0.96057278], Targets 1
Epoch 115, Inputs [1 0], Outputs [ 0.87960599], Targets 1
Epoch 120, Inputs [1 1], Outputs [ 0.97725825], Targets 1
Epoch 125, Inputs [1 0], Outputs [ 0.89433666], Targets 1
Epoch 130, Inputs [0 0], Outputs [ 0.03539024], Targets 0
Epoch 135, Inputs [0 1], Outputs [ 0.92888141], Targets 1
Also, note that the term epoch usually means a single run of all of your training data, in your case 4. So, in fact, you are doing 4 times less epochs.
Update
I didn't pay attention to the details, as a result, missed few subtle yet important notes:
the training data in the question represents OR, not XOR, so my results above are for learning OR operation;
backward pass executes forward pass as well (so it's not a bug, rather a surprising implementation detail).
Knowing this, I've updated the data and checked the script once again. Running the training for 10000 iterations gave ~0.001 average error, so the model is learning, simply not so fast as it could.
A simple neural network (without embedded normalization mechanism) is pretty sensitive to particular hyperparameters, such as initialization and the learning rate. I tried various values manually and here's what I've got:
# slightly bigger learning rate
LEARNING_RATE = 0.3
...
# slightly bigger init variation of weights
layer['weights'] = np.random.randn(layer_sizes[i], layer_sizes[i-1]) * 2.0
This gives the following performance:
...
Epoch 960, Inputs [1 1], Outputs [ 0.01392014], Targets 0
Epoch 970, Inputs [0 0], Outputs [ 0.04342895], Targets 0
Epoch 980, Inputs [1 0], Outputs [ 0.96471654], Targets 1
Epoch 990, Inputs [1 1], Outputs [ 0.00084511], Targets 0
Epoch 1000, Inputs [0 0], Outputs [ 0.01585915], Targets 0
Epoch 1010, Inputs [1 1], Outputs [-0.004097], Targets 0
Epoch 1020, Inputs [1 1], Outputs [ 0.01898956], Targets 0
Epoch 1030, Inputs [0 0], Outputs [ 0.01254217], Targets 0
Epoch 1040, Inputs [1 1], Outputs [ 0.01429213], Targets 0
Epoch 1050, Inputs [0 1], Outputs [ 0.98293925], Targets 1
...
Epoch 1920, Inputs [1 1], Outputs [-0.00043072], Targets 0
Epoch 1930, Inputs [0 1], Outputs [ 0.98544288], Targets 1
Epoch 1940, Inputs [1 0], Outputs [ 0.97682002], Targets 1
Epoch 1950, Inputs [1 0], Outputs [ 0.97684186], Targets 1
Epoch 1960, Inputs [0 0], Outputs [-0.00141565], Targets 0
Epoch 1970, Inputs [0 0], Outputs [-0.00097559], Targets 0
Epoch 1980, Inputs [0 1], Outputs [ 0.98548381], Targets 1
Epoch 1990, Inputs [1 0], Outputs [ 0.97721286], Targets 1
The average accuracy is close to 98.5% after 1000 iterations and 99.1% after 2000 iterations. It's a bit slower than promised, but good enough. I'm sure it can be tuned further, but it's not the goal of this toy exercise. After all, tanh is not the best activation function, and classification problems should better be solved with cross-entropy loss (rather than L2 loss). So I wouldn't worry too much about performance of this particular network and go on to the logistic regression. That will be definitely better in terms of speed of learning.

Make a dictionary of term frequency values

I am trying to write a function that will take a list of words(strings), count how many times each specific word appears, and return a dictionary with the number of times the word appears in the list divided by the total number of words in the list (term frequency vector).
def makeTermFrequencyVector(wordList):
'''
makeTermFrequencyVector Takes a list of words as parameter and returns a dictionary representing the term frequency
vector of the word list, where words are keys and values are the frequency of occurrence of
each word in the document.
'''
tfDict = {}
for word in wordList:
for i in range(len(wordList)):
state = 0
if wordList[i] == word:
state += 1
tfv = state / (len(wordList))
tfDict[word] = tfv
return tfDict
If I inputted:
makeTermFrequencyVector(['cat', 'dog']):
the output should be:
{'cat': 0.5, 'dog': 0.5}
because each word appears once in a list of total length 2.
However, this code returns a dictionary with only the last word in the input list having the correct tf value, with all other words' values being 0. So if I try to input the above list in my current code, it returns:
{'dog': 0.5, 'cat': 0.0}
which is not correct.
How can I fix this so it iterates the value over each word in the list and not just the last one? I want to keep the fixed code as close to my current code as possible.
It's simpler if we make separate passes instead of nested passes through the words. On the first pass, we take of the word count. On the second pass, we replace the word counts with frequencies:
def makeTermFrequencyVector(wordList):
'''
Takes a list of words and returns a dictionary representing
the term frequency vector of the word list, where words are
keys and values are the frequency of occurrence.
'''
tfDict = dict()
for word in wordList:
tfDict[word] = tfDict.get(word, 0) + 1
word_count = len(wordList)
for word in tfDict:
tfDict[word] /= word_count
return tfDict
print(makeTermFrequencyVector(['cat', 'dog']))
word_list = [ \
'Takes', 'a', 'list', 'of', 'words', 'as', 'its', 'sole', 'parameter', \
'and', 'returns', 'a', 'dictionary', 'representing', 'the', 'term', \
'frequency', 'vector', 'of', 'the', 'word', 'list,', 'where', 'words', \
'are', 'keys', 'and', 'values', 'are', 'the', 'frequency', 'of', \
'occurrence', 'of', 'each', 'word', 'in', 'the', 'source', 'document', \
]
print(makeTermFrequencyVector(word_list))
OUTPUT
> python3 test.py
{'cat': 0.5, 'dog': 0.5}
{'Takes': 0.025, 'a': 0.05, 'list': 0.025, 'of': 0.1, 'words': 0.05, 'as': 0.025, 'its': 0.025, 'sole': 0.025, 'parameter': 0.025, 'and': 0.05, 'returns': 0.025, 'dictionary': 0.025, 'representing': 0.025, 'the': 0.1, 'term': 0.025, 'frequency': 0.05, 'vector': 0.025, 'word': 0.05, 'list,': 0.025, 'where': 0.025, 'are': 0.05, 'keys': 0.025, 'values': 0.025, 'occurrence': 0.025, 'each': 0.025, 'in': 0.025, 'source': 0.025, 'document': 0.025}
>
cdlane's use of a 2 pass approach is the way to go vs using nested for loops. The reason is that each pass will take O(n) time, where n is the length of the list. With two passes it'll be O(n) + O(n) = O(2n) time, but the constants are dropped to make an O(n) asymptotic run time.
Part of the reason your code isn't working is because state is placed in the inner loop, so with each iteration of that loop, state is reset to 0, rather than simply being incremented each time. If you take the line state = 0 and bump it out of the inner for loop, I think the logic should then work.

Preparing Ordinal and Nominal Features for classification using OneHotEncoder in scikit-learn

I want to prepare a dataset that contains continuous, nominal and ordinal features for classification. I have some workaround below, but I am wondering if there is a better way using scikit-learn's encoders?
Let's consider the following example dataset:
import pandas as pd
df = pd.DataFrame([['green', 'M', 10.1, 'class1'], ['red', 'L', 13.5, 'class2'], ['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'prize', 'class label']
df
Now, the class labels can be simply converted by a label encoder (the classifier ignores order in the class labels).
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
df['class label'] = class_le.fit_transform(df['class label'].values)
And I would convert the ordinal feature column size like so:
size_mapping = {
'XL': 3,
'L': 2,
'M': 1}
df['size'] = df['size'].apply(lambda x: size_mapping[x])
df
And finally the ordinal color feature:
color_mapping = {
'green': [0,0,1],
'red': [0,1,0],
'blue': [1,0,0]}
df['color'] = df['color'].apply(lambda x: color_mapping[x])
df
y = df['class label'].values
X = df.iloc[:, :-1].values
X = np.apply_along_axis(func1d= lambda x: np.array(x[0] + list(x[1:])), axis=1, arr=X)
X
array([[ 0. , 0. , 1. , 1. , 10.1],
[ 0. , 1. , 0. , 2. , 13.5],
[ 1. , 0. , 0. , 3. , 15.3]])
You can use DictVectorizer for the nominal encoding which makes the process cleaner. Also you can apply the 'size_maping' directly with .map().
import pandas as pd
df = pd.DataFrame([['green', 'M', 10.1, 'class1'], ['red', 'L', 13.5, 'class2'], ['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'prize', 'class label']
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
df['class label'] = class_le.fit_transform(df['class label'].values)
size_mapping = {
'XL': 3,
'L': 2,
'M': 1}
df['size'] = df['size'].map(size_mapping)
feats =df.transpose().to_dict().values()
from sklearn.feature_extraction import DictVectorizer
Dvec = DictVectorizer()
Dvec.fit_transform(feats).toarray()
returns:
array([[ 0. , 0. , 1. , 0. , 10.1, 1. ],
[ 1. , 0. , 0. , 1. , 13.5, 2. ],
[ 0. , 1. , 0. , 0. , 15.3, 3. ]])
Get feature names:
Dvec.get_feature_names()
['class label', 'color=blue', 'color=green', 'color=red', 'prize', 'size']

Resources