Separate the words in the sentence for text classification problem - python-3.x

I am solving a text classification problem and while annotating my data I found very long words which are sentence itself but are not separated by space.
One of the example which I found while annotating my data point is:
Throughnumerousacquisitionsandtransitions,Anacompstillexiststodaywithagreaterfocusondocumentmanagement
Desired output:
Through numerous acquisitions and transitions, Anacomp still exists today with a greater focus on document management.
I have looked upon various frameworks such as Keras, PyTorch to see if they provide any functionality to solve this issue but I couldn't find anything.

The problem that you are trying to solve is text/word segmentation. It is possible to approach this based on ML using a sequence model (such as LSTM) and a word embedding (such as BERT).
This link details such an approach for Chinese language. Chinese language does not adopt spaces, so this sort of approach is necessary as a preprocessing component in Chinese NLP processing tasks.
I would like to describe an automaton based approach using Aho-Corasick Algorithm.
First do a pip install pyahocorasick
I'm resorting to use only the words in your input string for the sake of demonstration. In a real world scenario you could just use a dictionary of words from something like Wordnet.
import ahocorasick
automaton = ahocorasick.Automaton()
input = 'Throughnumerousacquisitionsandtransitions, Anacompstillexiststodaywithagreaterfocusondocumentmanagement'
# Replace this with a large dictionary of words
word_dictionary = ['Through', 'numerous', 'acquisition', 'acquisitions', 'and', 'transitions', 'Anacomp', 'still',
'exists', 'today', 'with', 'a', 'greater', 'focus', 'on', 'document', 'management']
# add dictionary words to automaton
for idx, key in enumerate(word_dictionary):
automaton.add_word(key, (idx, key))
# Build aho-corasick automaton for search
automaton.make_automaton()
# to check for ambiguity, if there is a longer match then prefer that
previous_rng = range(0, 0)
previous_rs = set(previous_rng)
# Holds the end result dictionary
result = {}
# search the inputs using automaton
for end_index, (insert_order, original_value) in automaton.iter(input):
start_index = end_index - len(original_value) + 1
current_rng = range(start_index, end_index)
current_rs = set(current_rng)
# ignore previous as there is a longer match available
if previous_rs.issubset(current_rs):
# remove ambiguous short entry in favour of the longer entry
if previous_rng in result:
del result[previous_rng]
result[current_rng] = (insert_order, original_value)
previous_rng = current_rng
previous_rs = current_rs
# if there is no overlap of indices, then its a new token, add to result
elif previous_rs.isdisjoint(current_rs):
previous_rng = current_rng
previous_rs = current_rs
result[current_rng] = (insert_order, original_value)
# ignore current as it is a subset of previous
else:
continue
assert input[start_index:start_index + len(original_value)] == original_value
for x in result:
print(x, result[x])
Produces results :
range(0, 6) (0, 'Through')
range(7, 14) (1, 'numerous')
range(15, 26) (3, 'acquisitions')
range(27, 29) (4, 'and')
range(30, 40) (5, 'transitions')
range(43, 49) (6, 'Anacomp')
range(50, 54) (7, 'still')
range(55, 60) (8, 'exists')
range(61, 65) (9, 'today')
range(66, 69) (10, 'with')
range(71, 77) (12, 'greater')
range(78, 82) (13, 'focus')
range(83, 84) (14, 'on')
range(85, 92) (15, 'document')
range(93, 102) (16, 'management')

Related

Getting similarity score with spacy and a transformer model

I've been using the spacy en_core_web_lg and wanted to try out en_core_web_trf (transformer model) but having some trouble wrapping my head around the difference in the model/pipeline usage.
My use case looks like the following:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_trf")
s1 = nlp("Running for president is probably hard.")
s2 = nlp("Space aliens lurk in the night time.")
s1.similarity(s2)
Output:
The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements.
(0.0, Space aliens lurk in the night time.)
Looking at this post, the transformer model does not have a word vector in the same way en_core_web_lg does, but you can get the embedding via s1._.trf_data.tensors. Which looks like:
sent1._.trf_data.tensors[0].shape
(1, 9, 768)
sent1._.trf_data.tensors[1].shape
(1, 768)
So I tried to manually take the cosine similarity (using this post as ref):
def similarity(obj1, obj2):
(v1, t1), (v2, t2) = obj1._.trf_data.tensors, obj2._.trf_data.tensors
try:
return ((1 - cosine(v1, v2)) + (1 - cosine(t1, t2))) / 2
except:
return 0.0
But this does not work.
As #polm23 mentioned, using sentence-transformers is a better approach to get sentence similarity.
First install the package: pip install sentence-transformers
Then use this code:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["Running for president is probably hard.","Space aliens lurk in the night time."]
embedded_list = model.encode(sentences)
similarity = cos_sim(embedded_list[0],embedded_list[1])
But if you are determined to use spacy for sentence similarity be aware that the reason that your code does not work is that v1 and v2 don't have the same shape, as you can see:
s1._.trf_data.tensors[0].shape --> (1, 9, 768)
s2._.trf_data.tensors[0].shape --> (1, 11, 768)
So it's not possible to get similarity between these two arrays.
s1._.trf_data.tensors is a tuple consists of two arrays:
s1._.trf_data.tensors[0] gives an array of size (1, 9, 768) which is consists of 9 arrays of size (1, 768) for each token.
s1._.trf_data.tensors[1] gives an array of size (1, 768) for the whole sentence
So you can get similarity as follows:
similarity = cosine(s1._.trf_data.tensors[1], s2._.trf_data.tensors[1])

Is there a way to combine two 2D arrays using a stored key?

I'm writing a python script to read in a netCDF file in Ugrid format. This requires reading in two 2D arrays:
x_coordinate = [[0,0],[1,200],[2,400],[3,600],[4,800]...]
y_coordinate = [[0,0],[1,5],[2,10],[3,15],[4,20]...]
and outputting an array:
coordinates = [[0,0],[200,5],[400,10],[600,15],[800,20]...]
so that I can then display it through mpl. Is there a way to do this efficiently without iterating through with comparative if statements?
In my opinion, zip() with list comprehension should solve your problem.
Example as follows:
>>>list(zip([el[1] for el in y_coordinate], [el[1] for el in x_coordinate]))
[(0, 0), (5, 200), (10, 400), (15, 600), (20, 800)]
Can you try:
[[x[0][1], x[1][1]] for x in zip(x_coordinate, y_coordinate)]

Implementing word dropout in pytorch

I want to add word dropout to my network so that I can have sufficient training examples for training the embedding of the "unk" token. As far as I'm aware, this is standard practice. Let's assume the index of the unk token is 0, and the index for padding is 1 (we can switch them if that's more convenient).
This is a simple CNN network which implements word dropout the way I would have expected it to work:
class Classifier(nn.Module):
def __init__(self, params):
super(Classifier, self).__init__()
self.params = params
self.word_dropout = nn.Dropout(params["word_dropout"])
self.pad = torch.nn.ConstantPad1d(max(params["window_sizes"])-1, 1)
self.embedding = nn.Embedding(params["vocab_size"], params["word_dim"], padding_idx=1)
self.convs = nn.ModuleList([nn.Conv1d(1, params["feature_num"], params["word_dim"] * window_size, stride=params["word_dim"], bias=False) for window_size in params["window_sizes"]])
self.dropout = nn.Dropout(params["dropout"])
self.fc = nn.Linear(params["feature_num"] * len(params["window_sizes"]), params["num_classes"])
def forward(self, x, l):
x = self.word_dropout(x)
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Don't mind the padding - pytorch doesn't have an easy way of using non zero padding in CNNs, much less trainable non-zero padding, so I'm doing it manually. Dropout also doesn't allow me to use non zero dropout, and I want to separate the padding token from the unk token. I'm keeping it in my example because it's the reason for this question's existence.
This doesn't work because dropout wants Float Tensors so that it can scale them properly, while my input is Long Tensors that don't need to be scaled.
Is there an easy way of doing this in pytorch? I essentially want to use LongTensor-friendly dropout (bonus: better if it will let me specify a dropout constant that isn't 0, so that I could use zero padding).
Actually I would do it outside of your model, before converting your input into a LongTensor.
This would look like this:
import random
def add_unk(input_token_id, p):
#random.random() gives you a value between 0 and 1
#to avoid switching your padding to 0 we add 'input_token_id > 1'
if random.random() < p and input_token_id > 1:
return 0
else:
return input_token_id
#than you have your input token_id
#for this example I take just a random number, lets say 127
input_token_id = 127
#let p be your probability for UNK
p = 0.01
your_input_tensor = torch.LongTensor([add_unk(input_token_id, p)])
Edit:
So there are two options which come to my mind which are actually GPU-friendly. In general both solutions should be much more efficient.
Option one - Doing computation directly in forward():
If you're not using torch.utils and don't have plans using it later this is probably the way to go.
Instead of doing the computation before we just do it in the forward() method of main PyTorch class. However I see no (simple) way doing this in torch 0.3.1., so you would need to upgrade to version 0.4.0:
So imagine x is your input vector:
>>> x = torch.tensor(range(10))
>>> x
tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
probs is a vector containing uniform probabilities for dropout so we can check later agains our probability for dropout:
>>> probs = torch.empty(10).uniform_(0, 1)
>>> probs
tensor([ 0.9793, 0.1742, 0.0904, 0.8735, 0.4774, 0.2329, 0.0074,
0.5398, 0.4681, 0.5314])
Now we apply the dropout probabilities probs on our input x:
>>> torch.where(probs > 0.2, x, torch.zeros(10, dtype=torch.int64))
tensor([ 0, 0, 0, 3, 4, 5, 0, 7, 8, 9])
Note: To see some effect I chose a dropout probability of 0.2 here. I reality you probably want it to be smaller.
You can pick for this any token / id you like, here is an example with 42 as unknown token id:
>>> unk_token = 42
>>> torch.where(probs > 0.2, x, torch.empty(10, dtype=torch.int64).fill_(unk_token))
tensor([ 0, 42, 42, 3, 4, 5, 42, 7, 8, 9])
torch.where comes with PyTorch 0.4.0:
https://pytorch.org/docs/master/torch.html#torch.where
I don't know about the shapes of your network, but your forward() should look something like this then (when using mini-batching you need to flatten the input before applying dropout):
def forward_train(self, x, l):
# probabilities
probs = torch.empty(x.size(0)).uniform_(0, 1)
# applying word dropout
x = torch.where(probs > 0.02, x, torch.zeros(x.size(0), dtype=torch.int64))
# continue like before ...
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Note: I named the function forward_train() so you should use another forward() without dropout for evaluation / predicting. But you could also use some if conditions with train().
Option two: using torch.utils.data.Dataset:
If you're using Dataset provided by torch.utils it is very easy to do this kind of pre-processing efficiently. Dataset uses strong multi-processing acceleration by default so the the code sample above just has to be executed in the __getitem__ method of your Dataset class.
This could look like this:
def __getitem__(self, index):
'Generates one sample of data'
# Select sample
ID = self.input_tokens[index]
# Load data and get label
# using add ink_unk function from code above
X = torch.LongTensor(add_unk(ID, p=0.01))
y = self.targets[index]
return X, y
This is a bit out of context and doesn't look very elegant but I think you get the idea. According to this blog post of Shervine Amidi at Stanford it should be no problem to do more complex pre-processing steps in this function:
Since our code [Dataset is meant] is designed to be multicore-friendly, note that you
can do more complex operations instead (e.g. computations from source
files) without worrying that data generation becomes a bottleneck in
the training process.
The linked blog post - "A detailed example of how to generate your data in parallel with PyTorch" - provides also a good guide for implementing the data generation with Dataset and DataLoader.
I guess you'll prefer option one - only two lines and it should be very efficient. :)
Good luck!

What is the tfidf matrix giving ideally

when i ran the tfidf for a set of documents it returned me a tfidf matrix which looked like this.
(1, 12) 0.656240233446
(1, 11) 0.754552023393
(2, 6) 1.0
(3, 13) 1.0
(4, 2) 1.0
(7, 9) 1.0
(9, 4) 0.742540927053
(9, 5) 0.66980069547
(11, 19) 0.735138466738
(11, 7) 0.677916982176
(12, 18) 1.0
(13, 14) 0.697455191865
(13, 11) 0.716628394177
(14, 5) 1.0
(15, 8) 1.0
(16, 17) 1.0
(18, 1) 1.0
(19, 17) 1.0
(22, 13) 1.0
(23, 3) 1.0
(25, 6) 1.0
(26, 19) 0.476648253537
(26, 7) 0.879094103268
(28, 10) 0.532672175403
(28, 7) 0.523456282204
I want to know what is this, I am not able to understand how this is provided.
when i was in debug mode I got to know about indices, indptr and data... this things are some where co-relating with the data given. what are these?
there are lots of confusion in the numbers, I dont see 0th, 5th, 6th document if I say the first element in the parenthesis is the documents based on my prediction.
Kindly help me figure out how it is working here. However I know the general working of tfidf from wiki, taking the log on inverse documents and other stuff. I just want to know what are these 3 different kind of numbers here, what are the refering it to?
The source code is :
#This contains the list of file names
_filenames =[]
#This conatains the list if contents/text in the file
_contents = []
#This is a dict of filename:content
_file_contents = {}
class KmeansClustering():
def kmeansClusters(self):
global _report
self.num_clusters = 5
km = KMeans(n_clusters=self.num_clusters)
vocab_frame = TokenizingAndPanda().createPandaVocabFrame()
self.tfidf_matrix, self.terms, self.dist = TfidfProcessing().getTfidFPropertyData()
km.fit(self.tfidf_matrix)
self.clusters = km.labels_.tolist()
joblib.dump(km, 'doc_cluster2.pkl')
km = joblib.load('doc_cluster2.pkl')
class TokenizingAndPanda():
def tokenize_only(self,text):
'''
This function tokenizes the text
:param text: Give the text that you want to tokenize
:return: it gives the filter tokes
'''
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
def tokenize_and_stem(self,text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
stems = [_stemmer.stem(t) for t in filtered_tokens]
return stems
def getFilnames(self):
'''
:return:
'''
global _path
global _filenames
path = _path
_filenames = FileAccess().read_all_file_names(path)
def getContentsForFilenames(self):
global _contents
global _file_contents
for filename in _filenames:
content = FileAccess().read_the_contents_from_files(_path, filename)
_contents.append(content)
_file_contents[filename] = content
def createPandaVocabFrame(self):
global _totalvocab_stemmed
global _totalvocab_tokenized
#Enable this if you want to load the filenames and contents from a file structure.
# self.getFilnames()
# self.getContentsForFilenames()
# for name, i in _file_contents.items():
# print(name)
# print(i)
for i in _contents:
allwords_stemmed = self.tokenize_and_stem(i)
_totalvocab_stemmed.extend(allwords_stemmed)
allwords_tokenized = self.tokenize_only(i)
_totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': _totalvocab_tokenized}, index=_totalvocab_stemmed)
print(vocab_frame)
return vocab_frame
class TfidfProcessing():
def getTfidFPropertyData(self):
tfidf_vectorizer = TfidfVectorizer(max_df=0.4, max_features=200000,
min_df=0.02, stop_words='english',
use_idf=True, tokenizer=TokenizingAndPanda().tokenize_and_stem, ngram_range=(1, 1))
# print(_contents)
tfidf_matrix = tfidf_vectorizer.fit_transform(_contents)
terms = tfidf_vectorizer.get_feature_names()
dist = 1 - cosine_similarity(tfidf_matrix)
return tfidf_matrix, terms, dist
Result of tfidf applied to data is usually a 2D matrix A, where A_ij is the normalized j'th term (word) frequency in i'th document. What you see in your output is sparse representation of this matrix, in other words - only elements which are non-zero are printed out, so:
(1, 12) 0.656240233446
means that 12th word (according to some vocabulary that has been built by sklearn) has a normalized frequency 0.656240233446 in the first document. The "missing" bits are zero, meaning that for example 3rd word cannot be find in 1st document (since there is no (1,3)) and so on.
The fact that some documents are missing is a result of your particular code/data (which you did not include), maybe you set the vocabulary by hand? Or maximum number of features considered? There are many parameters in TfidfVectorizer that can cause that, but without your exact code (and some exemplary data) nothing else can be said. For example setting min_df can cause that (as it drops very rare words) similarly max_features (the same effect)

python3 use map to convert list of strings to tuples

I am converting a user input list of strings into tuples. The user inputs a list of fractions ie: (Please no "import fractions" suggestions)
fractions = ["1/2","3/5","4/3","3/8","1/9","4/7"]
I normally would use the following code that works:
user_input = 0
list_frac = []
print('Enter fractions into a list until you type "stop" in lower case:')
while user_input != 'stop':
user_input = input('Enter a fraction ie: "1/2" >>>')
list_frac.append(user_input)
list_frac.pop() # pop "stop" off the list
result = []
for i in list_frac:
result.append(tuple(i.split('/')))
print(result)
The result is a list of tuples:
fractions = [('1','2'),('3','5')('4','3'),('3','8'),('1','9'),('4','7')]
I want to change the values in the tuples to integers as well and I dont know how
However I also wish to learn lambda functions so I am practicing on simple code like this. This is my attempt at the same code using lambda function syntax:
tup_result = tuple(map(lambda i: result.append(i.split('/')), result))
But the result is an empty list and no error codes to help me.
The Question: How to change the strings in the list of tuples to ints, and then accomplish all this with the lambda function one liner.
Any suggestions, I have the general concept pf a lambda function down but actually implementing this is a little confusing, thanks for the help folks!
I used comprehensitions to solve the task:
fractions = ["1/2","3/5","4/3","3/8","1/9","4/7"]
print([(int(x),int(y)) for (x,y) in [k.split('/') for k in fractions]])
>>>[(1, 2), (3, 5), (4, 3), (3, 8), (1, 9), (4, 7)]
I started with python not long time ago myself and was confused how to use lambda in the beginning as well. Then I read, that Guido van Rossum had suggested, that lambda forms would disappear in Python3.0 AlternateLambdaSyntax, since then I have not used lambda at all and have no problem with the issue at all. You have to understand how it works when you see it in some code, but you can almost always can write more readable code without using lambda (though I can be wrong). I hope, it helped.
Update
there is a solution solution with map() and lambda, though I would not wish to see it in my code on my worst enemy:
print([(int(x),int(y)) for [x,y] in list(map(lambda frac: frac.split('/'),fractions))])
>>>[(1, 2), (3, 5), (4, 3), (3, 8), (1, 9), (4, 7)]

Resources