Iterating through Huggingface tokenizer with remainder

Iterating through Huggingface tokenizer with remainder - huggingface-tokenizers

Transformer models have maximum token limits. If I want to substring my text to fit within that limit, what is the generally accepted way?
Due to the treatment of special characters, it isn't the case that the tokenizer maps its tokens to something amenable to looping. Naively:
subst = " ".join(mytext.split(" ")[0:MAX_LEN])
would let me loop through chunks with something like:
START = 0
i = 0
substr = []
while START+MAX_LEN < len(mytext.split(" ")):
substr[i] = " ".join(mytext.split(" ")[START:START+MAX_LEN])
START = START + MAX_LEN
i = i + 1
tokens = tokenizer(text)
However, " ".join(mytext.split(" ")[0:MAX_LEN]) is not equal to the length given by tokenizer(text).
You can see the difference below:
>>> from transformers import LongformerTokenizer
>>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
>>> mytext = "This is a long sentence. " * 2000 # about 10k tokens
>>> len(mytext.split(" "))
10001
>>> encoded_input = tokenizer(mytext)
Token indices sequence length is longer than the specified maximum sequence length for this model (12003 > 4096). Running this sequence through the model will result in indexing errors
What is the function argument to tokenizer or if none available, the generally accepted iteration procedure for longer documents?

Related

What is the best way to remove rare words from a large text?

The dataset contains 2.14M words
The following is my code.
uni = get_unique(ds) #to get all unique words
c = Counter(uni) #using counter from Collections to create a dictionary
v = list(c.values()) #dict value
ky = list(c.keys()) #dicks keys
junk = [] #indexes of rare words (words that appear less than 20 times)
num = 0 #the number words that appear more than 20 times
for i in range(len(v)):
if(v[i] >= 20):
num += 1
else:
junk.append(i)
rare_words = []
for i in junk:
rare_words.append(ky[i]) #selecting rare words from the keys
A function to remove the rare words
def remove_jnk(dataset,rare_words):
ds = []
for i in dataset:
repl_wrd = " "
res = " ".join([repl_wrd if idx in rare_words else idx for idx in i[0].split()])
ds.append([res])
return ds
ds = remove_jnk(ds, rare_words)
this is too slow and it's taking hours to run

Maybe try importing a library suck as NLTK and do something like:
import nltk
tokens = [] #your word list
freq_dist = nltk.FreqDist(tokens)
FreqDist() is used to get the distribution of the terms in the corpus, selecting the rarest one into a list
rarewords = freq_dist.keys()[-5:]
after_rare_words = [ word for word in token not in rarewords]

how to decrypt a string that is encrypted using XOR

I have tried to encrypt a string using a XOR operator and took the output in alphabets. Now when I am trying to decrypt it I'm not getting the string again.
Encryption code:
string= "Onions"
keyword = "MELLON"
def xor(string, key):
st=[]
ke=[]
xored=[]
for i in string:
asc= (ord(i))
st.append(int(asc))
print(st)
for i in key:
asc= (ord(i))
ke.append(int(asc))
print(ke)
for i in range(len(string)):
s1=st[i]
k1=ke[i]
abc = s1^k1
le = ord('A')+abc
ch = chr(le)
if le> 90:
le= le-26
ch = chr(le)
print(s1,k1)
print('XOR =',abc)
print(ch)
xored.append(ch)
print(xored)
return("" . join(xored))
Need help!!

The algorithm does not perform a pure XOR, but maps values conditionally to another value, leading to a relation that is no longer bijective.
To illustrate this point. See what this script outputs:
keyword = "MELLON"
print(xor("Onions", keyword) == xor("OTGEHs", keyword))
It will output True!
So this means you have two words that are encrypted to the same string. This also means that if you need to do the reverse, there is no way to know which of these is the real original word.
If you want to decryption to be possible, make sure to only use operations that lead to a bijective mapping. For instance, if you only use a XOR, without adding or subtracting values, it will be OK.
Here is an approach where only lower and uppercase letters of the Latin alphabet are allowed (for both arguments):
def togglecrypt(string, key):
mapper = "gUMtuAqhaEDcsGjBbreSNJYdFTiOmHKwnXWxzClQLRVyvIkfPpoZ"
res = []
for i, ch in enumerate(string):
shift = mapper.index(key[i % len(key)]) % 26
i = mapper.index(ch)
if i < 26:
j = 26 + (i + shift) % 26
else:
j = (i - shift) % 26
res.append(mapper[j])
return("".join(res))
keyword = "MELLON"
encoded = togglecrypt("Onions", keyword)
print(encoded) # TdsDAn
print(togglecrypt(encoded, keyword)) # Onions

Tensorflow Dataset: Accessing row values to preprocess text data

I have used the tf.data.experimental.CsvDataset to read CSV data. the CSV has 2 different lang for the transformer model.
train_examples = tf.data.experimental.CsvDataset("./Data/training.csv", [tf.string, tf.string], header=True)
#printing 'train_examples'
<CsvDatasetV2 shapes: ((), ()), types: (tf.string, tf.string)>
I am trying to preprocess data for each column of text data before training the transformer model. How would I pass a function like the below on the 2 columns of the data? What structure is the output from tf.data.experimental.CsvDataset?
def preprocess_sentence(sentence):
sentence = sentence.lower().strip()
# creating a space between a word and the punctuation following it
# eg: "he is a boy." => "he is a boy ."
sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
sentence = re.sub(r'[" "]+', " ", sentence)
# replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
sentence = re.sub(r"[^a-zA-Z?.!,]+", " ", sentence)
sentence = sentence.strip()
# adding a start and an end token to the sentence
return sentence
If I apply the above function, the CsvDataset object cannot handle any operations.
AttributeError: 'CsvDatasetV2' object has no attribute 'lower'

What structure is the output from tf.data.experimental.CsvDataset?
CsvDataset returns a tensorflow dataset which is a custom object representing an arbitrarily large dataset.
If I apply the above function, the CsvDataset object cannot handle any operations
That's because datasets are evaluated lazily by default (with good reason, as I mentioned above they can represent huge, even infinite, datasets) so, by default, mapping operations need to be done using tensor operations.
Usefully, however, there is a tensorflow operation that allows you to call python code from tf so you could do something like this:
pre_processed_dataset = my_dataset.map(lambda x: tf.py_function(preprocess_sentence, x, tf.string))
(though you should make sure preprecess_sentence actually takes 2 sentences as an argument in common with your dataset which is a dataset of string pairs).
Having said that, it would be much more optimal is if you could just translate your preprocessing function into tensor operations. Maybe something like this:
def preprocess(sentence1, sentence2):
def preprocess_sentence(sentence):
ret = tf.strings.lower(sentence)
ret = tf.strings.strip(ret)
ret = tf.strings.regex_replace(ret, "([?.!,])", " \1 ")
ret = tf.strings.regex_replace(ret, '[" "]+', " ")
ret = tf.strings.regex_replace(ret, "[^a-zA-Z?.!,]+", " ")
ret = tf.strings.strip(ret)
return ret
return preprocess_sentence(sentence1), preprocess_sentence(sentence2)
then you can map your dataset like this:
my_preprocessed_dataset = my_dataset.map(preprocess)

How to find set of shortest subsequences with minimal collisions from set of strings

I've got a list of strings like
Foobar
Foobaron
Foot
barstool
barfoo
footloose
I want to find the set of shortest possible sub-sequences that are unique to each string in the set; the characters in each sub-sequence do not need to be adjacent, just in order as they appear in the original string. For the example above, that would be (along other possibilities)
Fb (as unique to Foobar as it gets; collision with Foobaron unavoidable)
Fn (unique to Foobaron, no other ...F...n...)
Ft (Foot)
bs (barstool)
bf (barfoo)
e (footloose)
Is there an efficient way to mine such sequences and minimize the number of colliding strings (when collisions can't be avoided, e.g. when strings are substrings of other strings) from a given array of strings? More precisely, chosing the length N, what is the set of sub-sequences of up to N characters each that identify the original strings with the fewest number of collisions.

I would'nt really call that 'efficient', but you can do better than totally dumb like that:
words = ['Foobar', 'Foobaron', 'Foot', 'barstool', 'barfoo', 'footloose']
N = 2
n = len(words)
L = max([len(word) for word in words])
def generate_substrings(word, max_length=None):
if max_length is None:
max_length = len(word)
set_substrings = set()
set_substrings.add('')
for charac in word:
new_substr_list = []
for substr in set_substrings:
new_substr = substr + charac
if len(new_substr) <= max_length:
new_substr_list.append(new_substr)
set_substrings.update(new_substr_list)
return set_substrings
def get_best_substring_for_each(string_list=words, max_length=N):
all_substrings = {}
best = {}
for word in string_list:
for substring in generate_substrings(word, max_length=max_length):
if substring not in all_substrings:
all_substrings[substring] = 0
all_substrings[substring] = all_substrings[substring] + 1
for word in string_list:
best_score = len(string_list) + 1
best[word] = ''
for substring in generate_substrings(word=word, max_length=max_length):
if all_substrings[substring] < best_score:
best[word] = substring
best_score = all_substrings[substring]
return best
print(get_best_substring_for_each(words, N))
This program prints the solution:
{'barfoo': 'af', 'Foobar': 'Fr', 'Foobaron': 'n', 'footloose': 'os', 'barstool': 'al', 'Foot': 'Ft'}
This can still be improved easily by a constant factor, for instance by storing the results of generate_substringsinstead of computing it twice.
The complexity is O(n*C(N, L+N)), where n is the number of words and L the maximum length of a word, and C(n, k) is the number of combinations with k elements out of n.
I don't think (not sure though) that you can do much better in the worst case, because it seems hard not to enumerate all possible substrings in the worst case (the last one to be evaluated could be the only one with no redundancy...). Maybe in average you can do better...

You could use a modification to the longest common subsequence algorithm. In this case you are seeking the shortest unique subsequence. Shown below is part of a dynamic programming solution which is more efficient than a recursive solution. The modifications to the longest common subsequence algorithm are described in the comments below:
for (int i = 0; i < string1.Length; i++)
for (int j = 0; j < string2.Length; j++)
if (string1[i-1] != string2[j-1]) // find characters in the strings that are distinct
SUS[i][j] = SUS[i-1][j-1] + 1; // SUS: Shortest Unique Substring
else
SUS[i][j] = min(SUS[i-1][j], SUS[i][j-1]); // find minimum size of distinct strings
You can then put this code in a function and call this function for each string in your set to find the length of the shortest unique subsequence in the set.
Once you have the length of the shortest unique subsequence, you can backtrack to print the subsequence.

You should use modified Trie structure, insert strings to a trie in a way that :
Foo-bar-on
-t
bar-stool
-foo
The rest is straightforward, just choose correct compressed node[0] char
That Radix tree should help

How to calculate word co-occurence

I have a string of characters of length 50 say representing a sequence abbcda.... for alphabets taken from the set A={a,b,c,d}.
I want to calculate how many times b is followed by another b (n-grams) where n=2.
Similarly, how many times a particular character is repeated thrice n=3 consecutively, say in the input string abbbcbbb etc so here the number of times b occurs in a sequence of 3 letters is 2.

To find the number of non-overlapping 2-grams you can use
numel(regexp(str, 'b{2}'))
and for 3-grams
numel(regexp(str, 'b{3}'))
to count overlapping 2-grams use positive lookahead
numel(regexp(str, '(b)(?=b{1})'))
and for overlapping n-grams
numel(regexp(str, ['(b)(?=b{' num2str(n-1) '})']))
EDIT
In order to find number of occurrences of an arbitrary sequence use the first element in first parenthesis and the rest after equality sign, to find ba use
numel(regexp(str, '(b)(?=a)'))
to find bda use
numel(regexp(str, '(b)(?=da)'))

Building on the proposal by Magla:
str = 'abcdabbcdaabbbabbbb'; % for example
index_single = ismember(str, 'b');
index_digram = index_single(1:end-1)&index_single(2:end);
index_trigram = index_single(1:end-2)&index_single(2:end-1)&index_single(3:end);

You may try this piece of code that uses ismember (doc).
%generate string (50 char, 'a' to 'd')
str = char(floor(97 + (101-97).*rand(1,50)))
%digram case
index_digram = ismember(str, 'aa');
%trigram case
index_trigram = ismember(str, 'aaa');
EDIT
Probabilities can be computed with
proba = sum(index_digram)/length(index_digram);

this will find all n-grams and count them:
numberOfGrams = 5;
s = char(floor(rand(1,1000)*4)+double('a'));
ngrams = cell(1);
for n = 2:numberOfGrams
strLength = size(s,2)-n+1;
indices = repmat((1:strLength)',1,n)+repmat(1:n,strLength,1)-1;
grams = s(indices);
gramNumbers = (double(grams)-double('a'))*((ones(1,n)*n).^(0:n-1))';
[uniqueGrams, gramInd] = unique(gramNumbers);
count=hist(gramNumbers,uniqueGrams);
ngrams(n) = {struct('gram',grams(gramInd,:),'count',count)};
end
edit:
the result will be:
ngrams{n}.gram %a list of all n letter sequences in the string
ngrams{n}.count(x) %the number of times the sequence ngrams{n}.gram(x) appears

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Iterating through Huggingface tokenizer with remainder - huggingface-tokenizers

Related

What is the best way to remove rare words from a large text?

how to decrypt a string that is encrypted using XOR

Tensorflow Dataset: Accessing row values to preprocess text data

How to find set of shortest subsequences with minimal collisions from set of strings

How to calculate word co-occurence

Categories

Resources