Related
I am new to tokenizers. My understanding is that the truncate attribute just cuts the sentences. But I need the whole sentence for context.
For example, my sentence is :
"Ali bin Abbas'ın Kitab Kamilü-s Sina adlı eseri daha sonra 980 yılında nasıl adlandırılmıştır? Ali bin Abbas'ın eseri Rezi'nin hangi isimli eserinden daha özlü ve daha sistematikdir? Ali bin Abbas'ın Kitab Kamilü-s Sina adlı eseri İbn-i Sina'nın hangi isimli eserinden daha uygulamalı bir biçimde yazılmıştır? Kitab el-Maliki Avrupa'da Constantinus Africanus tarafından hangi dile çevrilmiştir? Kitab el-Maliki'nin ilk bölümünde neye ağırlık verilmiştir?
But when I use max_length=64, truncation=True and pad_to_max_length=True for my encoder(as suggested in the internet), half of sentence is being gone:
▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁Kitab', '▁Kami', 'lü', '-', 's', '▁Sina', '▁ad', 'lı', '▁es', 'eri', '▁daha', '▁sonra', '▁980', '▁yıl', 'ında', '▁na', 'sıl', '▁adlandır', 'ılmıştır', '?', '▁', '<sep>', '▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁es', 'eri', '▁Rez', 'i', "'", 'nin', '▁', 'hangi', '▁is', 'imli', '▁es', 'erinden', '▁daha', '▁', 'özlü', '▁ve', '▁daha', '▁sistema', 'tik', 'dir', '?', '▁', '<sep>', '▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁Kitab', '▁Kami', 'lü', '</s>']
And when I increase max length, CUDA is running out of memory of course. What should be my approach for long texts in the dataset?
My code for encoding:
input_encodings = tokenizer.batch_encode_plus(
example_batch['context'],
max_length=512,
add_special_tokens=True,
truncation=True,
pad_to_max_length=True)
target_encodings = tokenizer.batch_encode_plus(
example_batch['questions'],
max_length=64,
add_special_tokens=True,
truncation=True,
pad_to_max_length=True)
Yes, the truncate attribute just keeps the given number of subwords from the left. The workaround depends on the task you are solving and the data that you use.
Are the long sequence frequent in your data? If not, you can just safely throw away the instances because it is unlikely that the model would learn to generalize for long sequences anyway.
If you really need the long context, you have plenty of options:
Decrease the batch size (and perhaps do updates once after several batches).
Make the model smaller: use either a smaller dimension or fewer layers.
Use a different architecture: Transformers need quadratic memory w.r.t sequence length. Wouldn't an LSTM or CNN do the job? What architectures for long sequences (e.g., Reformer, Longformer).
If you need to use a pre-trained BERT-like model and there is no model of the size that would fit your needs, you can distill a smaller model or a model with a more suitable architecture yourself.
Perhaps you can split the input. In tasks like answer span selection, you can split the text where you are looking for an answer into smaller chunks and search in the chunks independently.
Trying to match the skills as a patter and apply a groupby function however receiving as unhashable list.
Tec_skills=['MS SQL Server', 'Oracle','Machine Learning', 'Artificial Intelligence', 'Deep Neural Networks',
'Convolutional Neural Network', 'Sklearn Libraries', 'Keras','Tensor flow', 'SQL', 'C#','NoSQL',' Docker',
'Python','Shell','SQL/PLSQL','PLSQL','R','C','C++','AWS','Neural Networks,','CNN','RNN','Linear/Logistic Regression',
'Ensemble Trees, Gradient','Boosted trees, Bagging, Random forest','Time series','Data Visualization','Sentiment Analysis',
'Docker & Kubernetes','Classification','clustering','supervised','unsupervised']
def tech_skills(text):
word_tokens=word_tokenize(text)
filtered_stop_word = [word for word in word_tokens if word not in stopwords.words('english')]
all_combinations=' '.join,everygrams(filtered_stop_word,2,3)
#ext_skills=[]
ext_skills=re.findall(Tec_skills,all_combinations)
if ext_skills:
return (ext_skills.group(0))
return (ext_skills.group(0))
All_pdf_data_bygroup=All_pdf_data.groupby(All_pdf_data.index)
All_pdf_data_bygroup["text"].apply(lambda x: ' '.join(x)).apply(lambda x:tech_skills(x))
ERROR:TypeError: unhashable type: 'list'
Please suggets how to resolve the issue .
The Problem is:-
Given a digit string, return all possible letter combinations of each digits according to the buttons on a telephone, that the number could represent.
The returned strings must be lexicographically sorted.
Example-1 :-
Input : “23”
Output : ["ad", "ae", "af", "bd", "be", "bf", "cd", "ce", "cf"]
Example-2 :-
Input : “9”
Output: [“w”, “x”, “y”, “z”]
Example-3 :-
Input : “246”
Output : ["agm", "agn", "ago", "ahm", ..., "cho", "cim", "cin" "cio"] {27 elements}
I've squeezed my brain on this, and I've tried a lot but I'm not getting ahead of this part, what I've tried is to use a recursive function that zips the individual letters of each digit with each other letters and use itertools.combinations() over it, but I'm unable to complete this function and I'm unable to get ahead of this.
What I've tried is :-
times, str_res = 0, ""
def getval(lst, times):
if times==len(lst)-1:
for i in lst[times]:
yield i
else:
for i in lst[times]:
yield i + getval(lst, times+1)
dct = {"2":("a","b","c"), "3":("d","e","f"), "4":("g","h","i"),
"5":("j","k","l"), "6":("m","n","o"), "7":("p","q","r","s"),
"8":("t","u","v"), "9":("w","x","y","z"), "1":("")}
str1, res = "23", []
if len(str1)==1:
print(dct[str1[0]])
else:
temp = [dct[i] for i in str1]
str_res = getval(temp, times)
print(str_res)
Please suggest me your ideas over this problem or in completing the function...
It's not itertools.combinations that you need, it's itertools.product.
from itertools import product
def all_letter_comb(s, dct):
for p in product(*map(dct.get, s)):
yield ''.join(p)
dct = {"2":("a","b","c"), "3":("d","e","f"), "4":("g","h","i"),
"5":("j","k","l"), "6":("m","n","o"), "7":("p","q","r","s"),
"8":("t","u","v"), "9":("w","x","y","z"), "1":("")}
for s in ['23', '9', '246']:
print(s)
print(list(all_letter_comb(s, dct)))
print()
Output:
23
['ad', 'ae', 'af', 'bd', 'be', 'bf', 'cd', 'ce', 'cf']
9
['w', 'x', 'y', 'z']
246
['agm', 'agn', 'ago', 'ahm', 'ahn', 'aho', 'aim', 'ain', 'aio', 'bgm', 'bgn', 'bgo', 'bhm', 'bhn', 'bho', 'bim', 'bin', 'bio', 'cgm', 'cgn', 'cgo', 'chm', 'chn', 'cho', 'cim', 'cin', 'cio']
If I am not wrong this is leet code problem. You can find multiple answers there.
I´m trying to remove self-defined stopwords with Tfidf, but although using different approaches, the stopwords I defined are not removed multiple times - it seems as they are removed only once.
stop_words = ["er","sie","es","sehr", "geehrte","geehrter","herr","frau","ihre","ihrem","ihren","der","die", "das","viele", "gruesse","gruessen","mit", "von", "auf", "unter","ab", "fuer", "von", "gmbh", "und", "oder","email", "am", "ist","nicht", "wir", "hiermit", "unser", "unsere", "unseren","ohne", "bitten", "uns", "bis", "zur","am","bei", "des", "dessen", "deren", "dem", "nach","zu", "eines", "einen", "einer", "einem", "dies", "des", "den", "dank", "wurde", "wird", "war", "sein","in", "als", "gerne", "gerne", "wieder","welcher", "welche", "welchem","welchen","welches", "hat","hatte","freundlich", "freundliche", "freundlichen", "freundliches", "wenn", "wuerden", "durch"]
vectorizer = TfidfVectorizer(ngram_range=[1,1], stop_words=stop_words)
X_text_set = vectorizer.fit_transform(X_text_set)
These are the results without using Stopwords:
y_train_text size:(1872,)
y_val_text size:(401,)
y_test_text size:(402,)
X_train_text size:(1872, 35941)
X_val_text size:(401, 35941)
X_test_text size:(402, 35941)
These are the results after using stopwords:
y_train_text size:(1872,)
y_val_text size:(401,)
y_test_text size:(402,)
X_train_text size:(1872, 35867)
X_val_text size:(401, 35867)
X_test_text size:(402, 35867)
As you can see, every word is only removed once. Since these are common words, I would expect rather hundreds of occurrences to be removed.
Could anyone please help me?
I am reading the text file consisting of bengali words. But I am unable to print the dependent vowels like KA,KI etc...
Here is my sample code and output
import unicodedata
bengali_phoneme_maplist={u'অ':'A',u'আ':'AA',u'ই':'I',u'ঈ':'II',u'উ':'U',u'ঊ ':'UU',u'ঋ ':'R',u'ঌ ':'L',u'এ ':'E',u'ঐ ':'AI',u'ও ':'O',u'ঔ ':'AU',u'ক':'KA',u'খ ':'KHA',u'গ ':'GA',u'ঘ':'GHA',u'ঙ ':'NGA',u'চ ':'CA',u'ছ':'CHA',u'জ ':'JA',u'ঝ':'JHA',u'ঞ':'NYA',u'ট ':'TTA',u'ঠ':'TTHA',u'ড ':'DDA',u'ঢ':'DDHA',u'ণ ':'NNA',u'ত ':'TA',u'ত ':'THA',u'দ':'DA',u'ধ':'DHA',u'ন':'NA',u'প':'PA',u'ফ':'PHA',u'ব':'BA',u'ভ':'BHA',u'ম ':'MA',u'য ':'YA',u'র':'RA',u'ল ':'LA',u'শ ':'SHA',u'ষ':'SSA',u'স ':'SA',u'হ':'ha',u' া ':'AAV',u' ি':'IV',u'ী':'IIV',u'ু':'UV',u'ূ':'UUV',u'ৃ':'RRV',u'ৄ ':'RR',u'ৄ':'EV',u' ৈ':'EV',u'়':'NUKTHA',u'ঽ':'AVAGRAHA'}
bengali_phoneme_maplist_normalise={unicodedata.normalize('NFKD',k):v
for k,v in bengali_phoneme_maplist.items()}
with open('bengali.txt','r')as infile:
lines=infile.readlines()
for index,line in enumerate(lines):
print('Phonemes in line{0}.total{1} symbols'.format(index,len(line)))
unknown=[]
words=line.split()
for word in words:
print(word,':',sep=' ', end='')
for character in word:
c=unicodedata.normalize('NFKD',character).casefold()
try:
print(bengali_phoneme_maplist_normalise[c],sep='',end='')
except KeyError:
print('_',sep='',end='')
if c not in unknown:
unknown.append(c)
print()
if unknown:
print('Unrecognised symbols:{0},total {1} symbols'.format(','.join(unknown),len(unknown)))
Sample input:
শিল্পাঞ্চলে ঢোকার মুখে, স্ন্যাক্সবারে খাবার কিনছিলেন, বহুজাতিক তথ্যপ্রযুক্তি সংস্থার কর্মী, শুভময় বন্দ্যোপাধ্যায়
Sample output:
Phonemes in line0.total129 symbols
text_000002 :___________
"শিল্পাঞ্চলে :_____PA_NYA____
ঢোকার :DDHA_KA_RA
মুখে, :_UV___
স্ন্যাক্সবারে :__NA___KA__BA_RA_
খাবার :__BA_RA
কিনছিলেন, :KA_NACHA___NA_
Unrecognisedsymbols:t,e,x,_,0,2,",শ,ি,ল,্,া,চ,ে,ো,ম,খ,,,স,য,জ,ত,থ,ং,য়,),
(Note that I know nohting about Bengali. :)
There are a few problems in your code:
There are many extra SPACE chars in the bengali_phoneme_maplist definition. For example, u'ঊ ' should be u'ঊ'. And it seems like it's not easy to input chars like u'া' in an text editor so I suggest you directly use unicode in the code, like '\u09be':'AAV'. (Actually I'd suggest you use '\uxxxx' for all chars and write the real chars in comments.)
u'ত':'TA',u'ত':'THA' should change to u'ত':'TA',u'থ':'THA'.
The chars in bengali_phoneme_maplist are not complete. For example there's no ো , ৌ , ্ and ং
After fixing these errors you will get the correct result.