Slow and Fast tokenizer gives different outputs(sentencepiece tokenization)

Slow and Fast tokenizer gives different outputs(sentencepiece tokenization) - nlp

When i use T5TokenizerFast(Tokenizer of T5 architecture), the output is expected as follows:
['▁', '</s>', '▁Hello', '▁', '<sep>', '</s>']
But when i use the normal tokenizer, it starts to split special token "/s>" as follows:
['▁</', 's', '>', '▁Hello', '<sep>', '</s>']
And this is print of not fast tokenizer:
PreTrainedTokenizer(name_or_path='', vocab_size=60000, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})
For fast :
PreTrainedTokenizerFast(name_or_path='', vocab_size=60000, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})
Code that i am using to produce these outputs:
tokenizer = T5TokenizerFast('new_sp.model', extra_ids=0)
tokenizer.add_tokens(['<sep>'])
print(tokenizer.convert_ids_to_tokens(tokenizer.encode("</s> Hello <sep>")))
I would appreciate any help. Thanks.

Related

How to go around truncating long sentences with Hugginface Tokenizers?

I am new to tokenizers. My understanding is that the truncate attribute just cuts the sentences. But I need the whole sentence for context.
For example, my sentence is :
"Ali bin Abbas'ın Kitab Kamilü-s Sina adlı eseri daha sonra 980 yılında nasıl adlandırılmıştır? Ali bin Abbas'ın eseri Rezi'nin hangi isimli eserinden daha özlü ve daha sistematikdir? Ali bin Abbas'ın Kitab Kamilü-s Sina adlı eseri İbn-i Sina'nın hangi isimli eserinden daha uygulamalı bir biçimde yazılmıştır? Kitab el-Maliki Avrupa'da Constantinus Africanus tarafından hangi dile çevrilmiştir? Kitab el-Maliki'nin ilk bölümünde neye ağırlık verilmiştir?
But when I use max_length=64, truncation=True and pad_to_max_length=True for my encoder(as suggested in the internet), half of sentence is being gone:
▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁Kitab', '▁Kami', 'lü', '-', 's', '▁Sina', '▁ad', 'lı', '▁es', 'eri', '▁daha', '▁sonra', '▁980', '▁yıl', 'ında', '▁na', 'sıl', '▁adlandır', 'ılmıştır', '?', '▁', '<sep>', '▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁es', 'eri', '▁Rez', 'i', "'", 'nin', '▁', 'hangi', '▁is', 'imli', '▁es', 'erinden', '▁daha', '▁', 'özlü', '▁ve', '▁daha', '▁sistema', 'tik', 'dir', '?', '▁', '<sep>', '▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁Kitab', '▁Kami', 'lü', '</s>']
And when I increase max length, CUDA is running out of memory of course. What should be my approach for long texts in the dataset?
My code for encoding:
input_encodings = tokenizer.batch_encode_plus(
example_batch['context'],
max_length=512,
add_special_tokens=True,
truncation=True,
pad_to_max_length=True)
target_encodings = tokenizer.batch_encode_plus(
example_batch['questions'],
max_length=64,
add_special_tokens=True,
truncation=True,
pad_to_max_length=True)

Yes, the truncate attribute just keeps the given number of subwords from the left. The workaround depends on the task you are solving and the data that you use.
Are the long sequence frequent in your data? If not, you can just safely throw away the instances because it is unlikely that the model would learn to generalize for long sequences anyway.
If you really need the long context, you have plenty of options:
Decrease the batch size (and perhaps do updates once after several batches).
Make the model smaller: use either a smaller dimension or fewer layers.
Use a different architecture: Transformers need quadratic memory w.r.t sequence length. Wouldn't an LSTM or CNN do the job? What architectures for long sequences (e.g., Reformer, Longformer).
If you need to use a pre-trained BERT-like model and there is no model of the size that would fit your needs, you can distill a smaller model or a model with a more suitable architecture yourself.
Perhaps you can split the input. In tasks like answer span selection, you can split the text where you are looking for an answer into smaller chunks and search in the chunks independently.

Python - trying to get specific words as output however receiving error

Trying to match the skills as a patter and apply a groupby function however receiving as unhashable list.
Tec_skills=['MS SQL Server', 'Oracle','Machine Learning', 'Artificial Intelligence', 'Deep Neural Networks',
'Convolutional Neural Network', 'Sklearn Libraries', 'Keras','Tensor flow', 'SQL', 'C#','NoSQL',' Docker',
'Python','Shell','SQL/PLSQL','PLSQL','R','C','C++','AWS','Neural Networks,','CNN','RNN','Linear/Logistic Regression',
'Ensemble Trees, Gradient','Boosted trees, Bagging, Random forest','Time series','Data Visualization','Sentiment Analysis',
'Docker & Kubernetes','Classification','clustering','supervised','unsupervised']
def tech_skills(text):
word_tokens=word_tokenize(text)
filtered_stop_word = [word for word in word_tokens if word not in stopwords.words('english')]
all_combinations=' '.join,everygrams(filtered_stop_word,2,3)
#ext_skills=[]
ext_skills=re.findall(Tec_skills,all_combinations)
if ext_skills:
return (ext_skills.group(0))
return (ext_skills.group(0))
All_pdf_data_bygroup=All_pdf_data.groupby(All_pdf_data.index)
All_pdf_data_bygroup["text"].apply(lambda x: ' '.join(x)).apply(lambda x:tech_skills(x))
ERROR:TypeError: unhashable type: 'list'
Please suggets how to resolve the issue .

How to apply recursion over this problem and solve this problem

The Problem is:-
Given a digit string, return all possible letter combinations of each digits according to the buttons on a telephone, that the number could represent.
The returned strings must be lexicographically sorted.
Example-1 :-
Input : “23”
Output : ["ad", "ae", "af", "bd", "be", "bf", "cd", "ce", "cf"]
Example-2 :-
Input : “9”
Output: [“w”, “x”, “y”, “z”]
Example-3 :-
Input : “246”
Output : ["agm", "agn", "ago", "ahm", ..., "cho", "cim", "cin" "cio"] {27 elements}
I've squeezed my brain on this, and I've tried a lot but I'm not getting ahead of this part, what I've tried is to use a recursive function that zips the individual letters of each digit with each other letters and use itertools.combinations() over it, but I'm unable to complete this function and I'm unable to get ahead of this.
What I've tried is :-
times, str_res = 0, ""
def getval(lst, times):
if times==len(lst)-1:
for i in lst[times]:
yield i
else:
for i in lst[times]:
yield i + getval(lst, times+1)
dct = {"2":("a","b","c"), "3":("d","e","f"), "4":("g","h","i"),
"5":("j","k","l"), "6":("m","n","o"), "7":("p","q","r","s"),
"8":("t","u","v"), "9":("w","x","y","z"), "1":("")}
str1, res = "23", []
if len(str1)==1:
print(dct[str1[0]])
else:
temp = [dct[i] for i in str1]
str_res = getval(temp, times)
print(str_res)
Please suggest me your ideas over this problem or in completing the function...

It's not itertools.combinations that you need, it's itertools.product.
from itertools import product
def all_letter_comb(s, dct):
for p in product(*map(dct.get, s)):
yield ''.join(p)
dct = {"2":("a","b","c"), "3":("d","e","f"), "4":("g","h","i"),
"5":("j","k","l"), "6":("m","n","o"), "7":("p","q","r","s"),
"8":("t","u","v"), "9":("w","x","y","z"), "1":("")}
for s in ['23', '9', '246']:
print(s)
print(list(all_letter_comb(s, dct)))
print()
Output:
23
['ad', 'ae', 'af', 'bd', 'be', 'bf', 'cd', 'ce', 'cf']
9
['w', 'x', 'y', 'z']
246
['agm', 'agn', 'ago', 'ahm', 'ahn', 'aho', 'aim', 'ain', 'aio', 'bgm', 'bgn', 'bgo', 'bhm', 'bhn', 'bho', 'bim', 'bin', 'bio', 'cgm', 'cgn', 'cgo', 'chm', 'chn', 'cho', 'cim', 'cin', 'cio']

If I am not wrong this is leet code problem. You can find multiple answers there.

TF-IDF Stopwords are not removed multiple times

I´m trying to remove self-defined stopwords with Tfidf, but although using different approaches, the stopwords I defined are not removed multiple times - it seems as they are removed only once.
stop_words = ["er","sie","es","sehr", "geehrte","geehrter","herr","frau","ihre","ihrem","ihren","der","die", "das","viele", "gruesse","gruessen","mit", "von", "auf", "unter","ab", "fuer", "von", "gmbh", "und", "oder","email", "am", "ist","nicht", "wir", "hiermit", "unser", "unsere", "unseren","ohne", "bitten", "uns", "bis", "zur","am","bei", "des", "dessen", "deren", "dem", "nach","zu", "eines", "einen", "einer", "einem", "dies", "des", "den", "dank", "wurde", "wird", "war", "sein","in", "als", "gerne", "gerne", "wieder","welcher", "welche", "welchem","welchen","welches", "hat","hatte","freundlich", "freundliche", "freundlichen", "freundliches", "wenn", "wuerden", "durch"]
vectorizer = TfidfVectorizer(ngram_range=[1,1], stop_words=stop_words)
X_text_set = vectorizer.fit_transform(X_text_set)
These are the results without using Stopwords:
y_train_text size:(1872,)
y_val_text size:(401,)
y_test_text size:(402,)
X_train_text size:(1872, 35941)
X_val_text size:(401, 35941)
X_test_text size:(402, 35941)
These are the results after using stopwords:
y_train_text size:(1872,)
y_val_text size:(401,)
y_test_text size:(402,)
X_train_text size:(1872, 35867)
X_val_text size:(401, 35867)
X_test_text size:(402, 35867)
As you can see, every word is only removed once. Since these are common words, I would expect rather hundreds of occurrences to be removed.
Could anyone please help me?

Unable to print Dependent vowels

I am reading the text file consisting of bengali words. But I am unable to print the dependent vowels like KA,KI etc...
Here is my sample code and output
import unicodedata
bengali_phoneme_maplist={u'অ':'A',u'আ':'AA',u'ই':'I',u'ঈ':'II',u'উ':'U',u'ঊ ':'UU',u'ঋ ':'R',u'ঌ ':'L',u'এ ':'E',u'ঐ ':'AI',u'ও ':'O',u'ঔ ':'AU',u'ক':'KA',u'খ ':'KHA',u'গ ':'GA',u'ঘ':'GHA',u'ঙ ':'NGA',u'চ ':'CA',u'ছ':'CHA',u'জ ':'JA',u'ঝ':'JHA',u'ঞ':'NYA',u'ট ':'TTA',u'ঠ':'TTHA',u'ড ':'DDA',u'ঢ':'DDHA',u'ণ ':'NNA',u'ত ':'TA',u'ত ':'THA',u'দ':'DA',u'ধ':'DHA',u'ন':'NA',u'প':'PA',u'ফ':'PHA',u'ব':'BA',u'ভ':'BHA',u'ম ':'MA',u'য ':'YA',u'র':'RA',u'ল ':'LA',u'শ ':'SHA',u'ষ':'SSA',u'স ':'SA',u'হ':'ha',u' া ':'AAV',u' ি':'IV',u'ী':'IIV',u'ু':'UV',u'ূ':'UUV',u'ৃ':'RRV',u'ৄ ':'RR',u'ৄ':'EV',u' ৈ':'EV',u'়':'NUKTHA',u'ঽ':'AVAGRAHA'}
bengali_phoneme_maplist_normalise={unicodedata.normalize('NFKD',k):v
for k,v in bengali_phoneme_maplist.items()}
with open('bengali.txt','r')as infile:
lines=infile.readlines()
for index,line in enumerate(lines):
print('Phonemes in line{0}.total{1} symbols'.format(index,len(line)))
unknown=[]
words=line.split()
for word in words:
print(word,':',sep=' ', end='')
for character in word:
c=unicodedata.normalize('NFKD',character).casefold()
try:
print(bengali_phoneme_maplist_normalise[c],sep='',end='')
except KeyError:
print('_',sep='',end='')
if c not in unknown:
unknown.append(c)
print()
if unknown:
print('Unrecognised symbols:{0},total {1} symbols'.format(','.join(unknown),len(unknown)))
Sample input:
শিল্পাঞ্চলে ঢোকার মুখে, স্ন্যাক্সবারে খাবার কিনছিলেন, বহুজাতিক তথ্যপ্রযুক্তি সংস্থার কর্মী, শুভময় বন্দ্যোপাধ্যায়
Sample output:
Phonemes in line0.total129 symbols
text_000002 :___________
"শিল্পাঞ্চলে :_____PA_NYA____
ঢোকার :DDHA_KA_RA
মুখে, :_UV___
স্ন্যাক্সবারে :__NA___KA__BA_RA_
খাবার :__BA_RA
কিনছিলেন, :KA_NACHA___NA_
Unrecognisedsymbols:t,e,x,_,0,2,",শ,ি,ল,্,া,চ,ে,ো,ম,খ,,,স,য,জ,ত,থ,ং,য়,),

(Note that I know nohting about Bengali. :)
There are a few problems in your code:
There are many extra SPACE chars in the bengali_phoneme_maplist definition. For example, u'ঊ ' should be u'ঊ'. And it seems like it's not easy to input chars like u'া' in an text editor so I suggest you directly use unicode in the code, like '\u09be':'AAV'. (Actually I'd suggest you use '\uxxxx' for all chars and write the real chars in comments.)
u'ত':'TA',u'ত':'THA' should change to u'ত':'TA',u'থ':'THA'.
The chars in bengali_phoneme_maplist are not complete. For example there's no ো , ৌ , ্ and ং
After fixing these errors you will get the correct result.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Slow and Fast tokenizer gives different outputs(sentencepiece tokenization) - nlp

Related

How to go around truncating long sentences with Hugginface Tokenizers?

Python - trying to get specific words as output however receiving error

How to apply recursion over this problem and solve this problem

TF-IDF Stopwords are not removed multiple times

Unable to print Dependent vowels

Categories

Resources