How can I add a feature using torchtext? - nlp

torchtext is able to read a file with some columns, each one corresponding to a field. What if I want to create a new column (which I will use as a feature)? For example, imagine the file has two columns, text and target, and I want to extract some information from the text and generate a new feature (e.g. if it contains certain words), can I do this directly with torchtext or do I need to do it in the file before?
Thanks!

It can be done.
def postprocessing(arr,vocab,pad_token):
# required to pad the sequence
max_len = max([len(a) for a in arr])
l = []
for a in arr:
res = max_len - len(a)
if res > 0:
a.extend([[pad_token]*len(a[0])]*res)
l.append(a)
return l
def featurization(text_list):
# creates character level features
# text_list is a list of characters.
features = []
for ch in text_list:
l = []
l.append(1 if ch.isupper() else 0)
l.append(1 if ch in string.digits else 0)
l.append(1 if ch in string.punctuation else 0)
features.append(l)
return features
temp_data = pd.read_csv("../data/processed/data.csv")
The below step is necessary to take only those columns which we want to process and the column order matters
temp_data.loc[:,["text","label"]].to_csv("temp.csv",index=False)
Create the Text, Feature, and Target fields. Here I am tokenizing a sentence into characters.
TEXT = torchtext.data.Field(sequential=True, use_vocab=True,
tokenize=lambda x: list(x), include_lengths=True,
batch_first=True)
LABEL_PAD_TOKEN=-1
FEAT = torchtext.data.LabelField(use_vocab=False,batch_first=True,preprocessing=featurization,
pad_token=None,postprocessing=lambda x, _:postprocessing(x,_,LABEL_PAD_TOKEN))
LABELS = torchtext.data.Field(use_vocab=False,pad_token=LABEL_PAD_TOKEN,unk_token=None,
batch_first=True,dtype=torch.int64,tokenize=lambda x: list(x),
preprocessing=lambda x:[eval(i) for i in x])
In the TabularDataset, the correct field order should be given matching the temp.csv column order.
train_data = torchtext.data.TabularDataset(path="temp.csv",format="csv",skip_header=True,
fields=[(("text","feat"),(TEXT,FEAT)),
("labels",LABELS)])
TEXT.build_vocab(train_data)
train_data,valid_data = train_data.split() # create train val
Build the iterator
train_iter,valid_iter=torchtext.data.BucketIterator.splits((train_data,valid_data,),batch_size=2,device=device ,sort_within_batch=True,sort_key=lambda x:len(x.text))
a = next(iter(train_iter))
a.feat.shape, a.text[0].shape # printing the shape
(torch.Size([2, 36, 3]), torch.Size([2, 36]))
Next, you can pass the text to the embedding layer whose input is [batch_size, seq_len]
which will output [batch_size, seq_len, emb_dim]
The features have the shape of [batch_size, seq_len,3] because we have 3 features
Concatenate both of these on last dimension giving [batch_size, seq_len, emb_dim+3] and pass it either to LSTM or CNN

Related

How to efficient batch-process in huggingface?

I am using Huggingface library and transformers to find whether a sentence is well-formed or not. I am using a masked language model called XLMR. I first tokenize my sentence, and then mask each word of the sentence one by one, and then process the masked sentences and find the probability that the predicted masked word is right.
def calculate_scores(sent, model, tokenizer, device, print_pred=False, maskval=False):
k = 0
dic = {}
ls = tokenizer.batch_encode_plus(sent)
input_list = ls.input_ids
h=0
with torch.no_grad():
for i in tqdm(range(len(input_list))):
item = input_list[i]
real_input = item
attmask = [1]*len(item)
seg = [0]*len(item)
seglist = [seg]
masked_list = [real_input]
attlist = [attmask]
for j in range(1, len(item)-1):
input = copy.deepcopy(real_input)
input[j] = 50264
masked_list.append(input)
attlist.append(attmask)
seglist.append(seg)
inid = torch.tensor(masked_list)
segtensor = torch.tensor(seglist)
atttensor = torch.tensor(attlist)
inid=inid.to(device)
segtensor=segtensor.to(device)
output = model(inid, segtensor)
predictions_logits = output.logits
predictions = torch.softmax(predictions_logits, dim=2)
ppscore = 0
for j in range(1, len(item)-1):
ppscore = ppscore+math.log(predictions[j, j, item[j]], 2)
try:
score = math.pow(2, (-1/(len(item)-2))*ppscore)
dic[sent[i]] = score
except:
print(sent[i])
dic[sent[i]] = 10000000
# dic[sent[i]]=10000000
return dic
I will explain my code quickly. The function calculate_scores has sent as an input which is a list of sentences. I first batch encode this list of sentences. And then for each encoded sentence that I get, I generate masked sentences where only one word is masked and the rest are un-masked. Then I input these generated sentences to output and get the probability. Then I compute perplexity.
But the way I'm using this is not a very good way of utilizing GPU. I want to process multiple sentences at once but at the same time, I also need to find the perplexity scores for each sentence. How would I go about doing this?

Pytorch build seq2seq MT model ,But how to get the translation results from the output tensor?

I am try to implement my own MT engine, i am following the steps in https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb
SRC = Field(tokenize=tokenize_en,
init_token='<sos>',
eos_token='<eos>',
lower=True)
TRG = Field(tokenize=tokenize_de,
init_token='<sos>',
eos_token='<eos>',
lower=True)
After training the model,the link only share a way to batch evaluate but i want to try single string and get the translation results. for example i want my model to translate the input "Boys" and get the German translations.
savedfilemodelpath='./pretrained_model/2020-09-27en-de.pth'
model.load_state_dict(torch.load(savedfilemodelpath))
model.eval()
inputstring = 'Boys'
processed=SRC.process([SRC.preprocess(inputstring)]).to(device)
output=model(processed,processed)
output_dim = output.shape[-1]
outputs = output[1:].view(-1, output_dim)
for item in outputs:
print('item shape is {} and item.argmax is {}, and words is {}'.format(item.shape,item.argmax(),TRG.vocab.itos[item.argmax()]))
So my question is that it it right to get the translation results by:
First: convert the string to tensor
inputstring = 'Boys'
processed=SRC.process([SRC.preprocess(inputstring)]).to(device)
Second: send the tensor to the model. As the model have a TRG param.I have to give the tensor,am i able not given the TRG tensor?
output=model(processed,processed)
output_dim = output.shape[-1]
outputs = output[1:].view(-1, output_dim)
Third:through the return tensor, i use the argmax to get the translation results? is it right?
Or how can i get the right translation results?
for item in outputs:
print('item shape is {} and item.argmax is {}, and words is {}'.format(item.shape,item.argmax(),TRG.vocab.itos[item.argmax()+1]))
i got the answer from the translate_sentence.really thanks #Aladdin Persson
def translate_sentence(model, sentence, SRC, TRG, device, max_length=50):
# print(sentence)
# sys.exit()
# Create tokens using spacy and everything in lower case (which is what our vocab is)
if type(sentence) == str:
tokens = [token.text.lower() for token in spacy_en(sentence)]
else:
tokens = [token.lower() for token in sentence]
# print(tokens)
# sys.exit()
# Add <SOS> and <EOS> in beginning and end respectively
tokens.insert(0, SRC.init_token)
tokens.append(SRC.eos_token)
# Go through each english token and convert to an index
text_to_indices = [SRC.vocab.stoi[token] for token in tokens]
# Convert to Tensor
sentence_tensor = torch.LongTensor(text_to_indices).unsqueeze(1).to(device)
# Build encoder hidden, cell state
with torch.no_grad():
hidden, cell = model.encoder(sentence_tensor)
outputs = [TRG.vocab.stoi["<sos>"]]
for _ in range(max_length):
previous_word = torch.LongTensor([outputs[-1]]).to(device)
with torch.no_grad():
output, hidden, cell = model.decoder(previous_word, hidden, cell)
best_guess = output.argmax(1).item()
outputs.append(best_guess)
# Model predicts it's the end of the sentence
if output.argmax(1).item() == TRG.vocab.stoi["<eos>"]:
break
translated_sentence = [TRG.vocab.itos[idx] for idx in outputs]
# remove start token
return translated_sentence[1:]
And the translation is not generated once. Acutualy it generate once a time and use several times.

Replace all indices in tensor within a range with 1s

def generate_mask(data : list, max_seq_len : int):
"""
Generates a mask for data where each element is expected to be max_seq_len length after padding
Args:
data : The data being forwarded through LSTM after being converted to a tensor
max_seq_len : The length of the names after being padded
"""
batch_sz = len(data)
ret = torch.zeros(1,batch_sz, max_seq_len, dtype=torch.bool)
for i in range(batch_sz):
name = data[i]
for letter_idx in range(len(name)):
ret[0][i][letter_idx] = 1
return ret
I have this code for generating a mask and I really hate how I'm doing it. Essentially as you can see I'm just going through every name and turning each index from 0 to name length to 1, I'd prefer a more elegant way to do this.
Well, you can simplify to something like this:
# [...]
for i in range(batch_sz):
ret[0, i, :len(data[i])] = 1

How do I extract x co-ordinate of a point using Python

I'm trying to build an NMF model for topic extraction. For re-training of the model, I've to pass a parameter to the nmf function, for which I need to pass the x co-ordinate from a given point that the algorithm returns, here is the code for reference:
no_features = 1000
no_topics = 9
print ('Old number of topics: ', no_topics)
tfidf_vectorizer = TfidfVectorizer(max_df = 0.95, min_df = 2, max_features = no_features, stop_words = 'english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
no_topics = tfidf.shape
print('New number of topics :', no_topics)
# nmf = NMF(n_components = no_topics, random_state = 1, alpha = .1, l1_ratio = .5, init = 'nndsvd').fit(tfidf)
On the third last line, the tfidf.shape returns a point (3,1000) to the variable 'no_topics', however I want that variable to be set to only the x co-ordinate, i.e (3).
How can I extract just the x co-ordinate from the point?
you can select the first values with no_topics[0]
print('New number of topics : {}'.format(no_topics[0]))
You can do a slicing on your numpy array tfidf with
topics = tfidf[0,:]

convert categorical features to numeric features using One hot encoding,

convert categorical features to numeric features using One hot encoding
dataset = pd.read_csv('bank.csv',index_col=0)
X = dataset.iloc[:,:].values
Z = pd.DataFrame(X)
print(Z)
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
Z = pd.DataFrame(X)
print(Z)
but for columns, it can only can be convert 1 single column. how can i conver tmore columns, like columns 1,2,3 and more together.
i was tried to change '0' to '0:', but then it comes with error that " ValueError: bad input shape (11162, 16)."
and if i change X[:,0] to X[:,1,2,3...],then it comes with erroe that "IndexError: too many indices for array"
I have a function that can do the job for you:
# Own implementation of One Hot Encoding - Data Transformation
def convert_to_binary(df, column_to_convert):
categories = list(df[column_to_convert].drop_duplicates())
for category in categories:
cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
col_name = column_to_convert[:5] + '_' + cat_name[:10]
df[col_name] = 0
df.loc[(df[column_to_convert] == category), col_name] = 1
return df
# One Hot Encoding
print("One Hot Encoding categorical data...")
columns_to_convert = [col1,col2]#Enter your column names here that you want to one hot encode.
for column in df_all.columns: #columns_to_convert
if df_all.column.dtype == 'category':
df_all = convert_to_binary(df=df_all, column_to_convert=column)
df_all.drop(column, axis=1, inplace=True)
print("One Hot Encoding categorical data...completed")
Make sure you enter your list of columns (if you dont want all categorical variables to be converted) in the columns_to_convert.

Resources