How to change tokenization (huggingface)? - nlp

In NER task we want to classification sentence tokens with using different approaches (BIO, for example). But we cant join any subtokens when tokenizer divides sentences stronger.
I would like to classificate 'weight 40.5 px' sentence with custom tokenization (by space in this example)
But after tokenization
tokenizer.convert_ids_to_tokens(tokenizer(['weight', '40.5', 'px'], is_split_into_words=True)['input_ids'])
i had
['[CLS]', 'weight', '40', '.', '5', 'p', '##x', '[SEP]']
when '40.5' splitted into another tokens '40', '.', '5'. Its problem for me, because i want to classificate 3 tokens ('weight', '40.5', 'px'), but it not merge automaticaly, because '40', '.', '5' not looks like '40', '##.', '##5'.
What can i do to solve this problem?

you can get the relation between raw text and tokenized tokens through “offset_mapping”

Related

Identifying phrases which contrast two corpora

I would like to identify compound phrases in one corpus (e.g. (w_1, w_2) in Corpus 1) which not only appear significantly more often than their constituents (e.g. (w_1),(w_2) in Corpus 1) within the corpus but also more than they do in a second corpus (e.g. (w_1, w_2) in Corpus 2). Consider the following informal example. I have the two corpora each consisting of a set of documents:
[['i', 'live', 'in', 'new', 'york'], ['new', 'york', 'is', 'busy'], ...]
[['los', 'angeles', 'is', 'sunny'], ['los', 'angeles', 'has', 'bad', 'traffic'], ...].
In this case, I would like new_york to be detected as a compound phrase. However, when corpus 2 is replaced by
[['i', 'go', 'to', 'new', york'], ['i', 'like', 'new', 'york'], ...],
I would like new_york to be relatively disregarded.
I could just use a ratio between n-gram scores between corresponding phrases in corpora, but I don't see how to scale to general n. Normally, phrase detection for n-grams with n>2 is done by recursing on n and gradually inserting compound phrases into the documents by thresholding a score function. This insures that at step n, if you want to score the n-gram (w_1, ..., w_n), then you can always normalize by the constituent m-grams for m<n. But with a different corpus, these are not guaranteed to appear.
A reference to the literature or a relevant hack will be appreciated.

What is the input format of fastText and why does my model doesn't give me a meaningful similar output?

My goal is to find similarities between a word and a document. For example, I want to find the similarity between "new" and a document, for simplicity, say "Hello World!".
I used word2vec from gensim, but the problem is it does not find the similarity for an unseen word. Thus, I tried to use fastText from gensim as it can find similarity for words that are out of vocabulary.
Here is a sample of my document data:
[['This', 'is', 'the', 'only', 'rule', 'of', 'our', 'household'],
['If',
'you',
'feel',
'a',
'presence',
'standing',
'over',
'you',
'while',
'you',
'sleep',
'do'],
['NOT', 'open', 'your', 'eyes'],
['Ignore', 'it', 'and', 'try', 'to', 'fall', 'asleep'],
['This',
'may',
'sound',
'a',
'bit',
'like',
'the',
'show',
'Bird',
'Box',
'from',
'Netflix']]
I simply train data like this:
from gensim.models.fasttext import FastText
model = FastText(sentences_cleaned)
Consequently, I want to find the similarity between say, "rule" and this document.
model.wv.most_similar("rule")
However, fastText gives me this:
[('the', 0.1334390938282013),
('they', 0.12790171802043915),
('in', 0.12731242179870605),
('not', 0.12656228244304657),
('and', 0.11071767657995224),
('of', 0.08563747256994247),
('I', 0.06609072536230087),
('that', 0.05195673555135727),
('The', 0.002402491867542267),
('my', -0.009009800851345062)]
Obviously, it must have "rule" as the top similarity since the word "rule" appears in the first sentence of the document. I also tried stemming/lemmatization, but it doesn't work either.
Was my input format correct? I've seen lots of documents are using .cor or .bin format and I don't know what are those.
Thanks for any reply!
model.wv.most_similar('rule') asks for that's model's set-of-word-vectors (.wv) to return the words most-similar to 'rule'. That is, you've provided neither any document (multiple words) as a query, nor is there any way for the FastText model to return either a document itself, or a name of any documents. Only words, as it has done.
While FastText trains on texts – lists of word-tokens – it only models words/subwords. So it's unclear what you expected instead: the answer is of the proper form.
Those don't look like words very-much like 'rule', but you'll only get good results from FastText (and similar word2vec-algorithms) if you train them with lots of varied data showing many subtly-contrasting realistic uses of the relevant words.
How many texts, with how many words, are in your sentences_cleaned data? (How many uses of 'rule' and related words?)
In any real FastText/Word2Vec/etc model, trained with asequate data/parameters, no single sentence (like your 1st sentence) can tell you much about what the results "should" be. That only emerged from the full rich dataset.

How to sort list of strings without using any pre-defined function?

I am new to python and I am stuck to find solution for one problem.
I have a list like ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again'] which I want to sort without using any pre defined function.
I thought a lot but not able to solve it properly.
Is there any short and elegant way to sort such list of string without using pre-defined functions.
Which algorithm will be best suitable to sort list of strings?
Thanks.
This sounds like you're learning about sorting algorithms. One of the simplest sorting methods is bubblesort. Basically, it's just making passes through the list and looking at each neighboring pair of values. If they're not in the right order, we swap them. Then we keep making passes through the list until there are no more swaps to make, then we're done. This is not the most efficient sort, but it is very simple to code and understand:
values = ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again']
def bubblesort(values):
'''Sort a list of values using bubblesort.'''
sorted = False
while not sorted:
sorted = True
# take a pass through every pair of values in the list
for index in range(0, len(values)-1):
if values[index] > values[index+1]:
# if the left value is greater than the right value, swap them
values[index], values[index+1] = values[index+1], values[index]
# also, this means the list was NOT fully sorted during this pass
sorted = False
print(f'Original: {values}')
bubblesort(values)
print(f'Sorted: {values}')
## OUTPUT ##
# Original: ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again']
# Sorted: ['again', 'and', 'hello', 'makes', 'perfect', 'practice', 'world']
There are lots more sorting algorithms to learn about, and they each have different strengths and weaknesses - some are faster than others, some take up more memory, etc. It's fascinating stuff and worth it to learn more about Computer Science topics. But if you're a developer working on a project, unless you have very specific needs, you should probably just use the built-in Python sorting algorithms and move on:
values = ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again']
print(f'Original: {values}')
values.sort()
print(f'Sorted: {values}')
## OUTPUT ##
# Original: ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again']
# Sorted: ['again', 'and', 'hello', 'makes', 'perfect', 'practice', 'world']

Arangodb delimitter Analyzer

I want to create an analyzer to tokenize characters instead of words.
for example, Foo will be tokenized to [ 'F', 'o', 'o'] so that the TFIDF search will be based on the frequency of characters instead of words...
I tried out the below but doesn't seem to work.
a.save('emailAnalyzer1', 'delimiter', {local : 'en.UTF-8', case: 'upper', delimiter: '' , stopwords: ['#','+','.']})
any help is much appreciated.

Convert string in dictionary format to dictionary using python

I have a string in dictionary format like:
{'fossils': [{Synset('dodo.n.01'): {'of', 'is', 'out', 'someone', 'fashion',
'whose', 'style'},Synset('fossil.n.02'): {'age', 'that', 'an', 'has', 'past',
'from', 'excavated', 'plant', '(',')', 'and', 'animal', 'in', 'remains',
'geological', 'soil', 'existed', 'impression', 'of', 'or', 'been', 'the', 'a'},
Synset('fossil.a.01'): {'of', 'a', 'fossil', 'characteristic'}}],
'disturbing': [{Synset('disturb.v.01'): {'me', 'This', 'upset', 'thought','book',
'deeply','move', 'troubling', 'A'}, Synset('agitate.v.06'): {'of', 'or',
'arrangement', 'the', 'position', 'change'}, Synset('touch.v.11'): {'my', '!',
'touch', 'Do', 'tamper', 'with', 'CDs', "n't"}, Synset('interrupt.v.02'): {'of',
'or', 'peace', 'tranquility', 'destroy', 'the', 'I', 'me', 'interrupt', 'Do',
'reading', 'when', "'m", "n't"}}]}
I want to convert this into a dictionary. The format of the dictionary is
{key: list of dictionaries as value}
Please help me to sort this out
Thanks
you have "Objects" (called Synset). usually, you can do this sort of convert with json.loads(str) to get the value of the dictionary.
because you have the objects, you need to fix this manually (pre-process the string prior to the json.loads) and later post-process to have the objects back.
edit: moreover, you have multiple types of parantheses which might humper things when loading the string.
edit2: There is of course another option (if your Synset class is known and you trust it's source):
import Synset # for swift conversion
dict_val = eval(dict_like_str)
I'll also add these lines in my original

Resources