Multiprocessing error for NLP application - python-3.x

I'm working on an NLP project. I have a massive dataset of 180 million words. Before I begin training I want to correct the spelling of words. To do this I use TextBlob's spell correct. Because TextBlob takes a while to process anyways, it would be an insanely long amount of time to correct the spelling of 180 million words. So here is my approach (code will follow after this):
Load corpus
Split the corpus into list of sentences using nltk
tokenizer
Multiprocessing: apply function to every iterable item of list generated from step 2
Here is my code:
import codecs
import multiprocessing
import nltk
from textblob import TextBlob
from nltk.tokenize import sent_tokenize
class SpellCorrect():
def __init__(self):
pass
def load_data(self, path):
with codecs.open(path, "r", "utf-8") as file:
data = file.read()
return sent_tokenize(data)
def correct_spelling(self, data):
data = TextBlob(data)
return str(data.correct())
def merge_cleaned_corpus(self, result, path):
result = " ".join(temp for temp in result)
with codecs.open(path, "a", "utf-8") as file:
file.write(result)
if __name__ == "__main__":
SpellCorrect = SpellCorrect()
data = SpellCorrect.load_data(path)
correct_spelling = SpellCorrect.correct_spelling
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
result = pool.apply_async(correct_spelling, (data, ))
result = result.get()
SpellCorrect.merge_cleaned_corpus(tuple(result), path)
When I run this, I get the following error:
_pickle.PicklingError: Can't pickle <class '__main__.SpellCorrect'>: it's not the same object as __main__.SpellCorrect
This error is generated at the line in my code that says result = result.get()
From my probably wrong guess, I'm guessing that the parallel processing component completed successfully and was able to apply my clean up to every iterable sentence. However, I'm unable to retrieve those results.
Can someone tell my why this error is being generated, and what can I do to fix it. Thanks in advance!

Related

How to create a custom parallel corpus for machine translation with recent versions of pytorch and torchtext?

I am trying to train a model for NMT on a custom dataset. I found this great tutorial on youtube along with the accompanying repo, but it uses an old version of PyTorch and torchtext. More recent versions of torchtext have removed the Field and BucketIterator classes.
I looked for more recent tutorials. The closest thing I could find was this medium post (again with the accompanying code) which worked with a custom dataset for text classification. I tried to replicate the code with my problem and got this far:
from os import PathLike
from torch.utils.data import Dataset
from torchtext.vocab import Vocab
import pandas as pd
from .create_vocab import tokenizer
class ParallelCorpus(Dataset):
"""A parallel corpus for training a machine translation model"""
def __init__(self,
corpus_path: str | PathLike,
source_vocab: Vocab,
target_vocab: Vocab
):
super().__init__()
self.corpus = pd.read_csv(corpus_path)
self.source_vocab = source_vocab
self.target_vocab = target_vocab
def __len__(self):
return len(self.corpus)
def __getitem__(self, index: int):
source_sentence = self.corpus.iloc[index, 0]
source = [self.source_vocab["<sos>"]]
source.extend(
self.source_vocab.lookup_indices(tokenizer(source_sentence))
)
source.append(self.source_vocab["<eos>"])
target_sentence = self.corpus.iloc[index, 1]
target = [self.target_vocab["<sos>"]]
target.extend(
self.target_vocab.lookup_indices(tokenizer(target_sentence))
)
target.append(self.target_vocab["<eos>"])
return source, target
My question is: is this the correct way to implement parallel corpora for pytorch? And where can I find more information about this since the documentation wasn't much help.
Thank you in advance and sorry if this is against the rules.

Deal with Out of vocabulary word with Gensim pretrained GloVe

I am working on an NLP assignment and loaded the GloVe vectors provided by Gensim:
import gensim.downloader
glove_vectors = gensim.downloader.load('glove-twitter-25')
I am trying to get the word embedding for each word in a sentence, but some of them are not in the vocabulary.
What is the best way to deal with it working with the Gensim API?
Thanks!
Load the model:
import gensim.downloader as api
model = api.load("glove-twitter-25") # load glove vectors
# model.most_similar("cat") # show words that similar to word 'cat'
There is a very simple way to find out if the words exist in the model's vocabulary.
result = print('Word exists') if word in model.wv.vocab else print('Word does not exist")
Apart from that, I had used the following logic to create sentence embedding (25 dim) with N tokens:
from __future__ import print_function, division
import os
import re
import sys
import regex
import numpy as np
from functools import partial
from fuzzywuzzy import process
from Levenshtein import ratio as lev_ratio
import gensim
import tempfile
def vocab_check(model, word):
similar_words = model.most_similar(word)
match_ratio = 0.
match_word = ''
for sim_word, sim_score in similar_words:
ratio = lev_ratio(word, sim_word)
if ratio > match_ratio:
match_word = sim_word
if match_word == '':
return similar_words[0][1]
return model.similarity(word, match_word)
def sentence2vector(model, sent, dim=25):
words = sent.split(' ')
emb = [model[w.strip()] for w in words]
weights = [1. if w in model.wv.vocab else vocab_check(model, w) for w in words]
if len(emb) == 0:
sent_vec = np.zeros(dim, dtype=np.float16)
else:
sent_vec = np.dot(weights, emb)
sent_vec = sent_vec.astype("float16")
return sent_vec

How to pre-process data before pandas.read_csv()

I have a slightly broken CSV file that I want to pre-process before reading it with pandas.read_csv(), i.e. do some search/replace on it.
I tried to open the file and and do the pre-processing in a generator, that I then hand over to read_csv():
def in_stream():
with open("some.csv") as csvfile:
for line in csvfile:
l = re.sub(r'","',r',',line)
yield l
df = pd.read_csv(in_stream())
Sadly, this just throws a
ValueError: Invalid file path or buffer object type: <class 'generator'>
Although, when looking at Panda's source, I'd expect it to be able to work on iterators, thus generators.
I only found this [article] (Using a custom object in pandas.read_csv()), outlining how to wrap a generator into a file-like object, but it seems to only work on files in byte-mode.
So in the end I'm looking for a pattern to build a pipeline that opens a file, reads it line-by-line, allows pre-processing and then feeds it into e.g. pandas.read_csv().
After further investigation of Pandas' source, it became apparent, that it doesn't simply require an iterable, but also wants it to be a file, expressed by having a read method (is_file_like() in inference.py).
So, I built a generator the old way
class InFile(object):
def __init__(self, infile):
self.infile = open(infile)
def __next__(self):
return self.next()
def __iter__(self):
return self
def read(self, *args, **kwargs):
return self.__next__()
def next(self):
try:
line: str = self.infile.readline()
line = re.sub(r'","',r',',line) # do some fixing
return line
except:
self.infile.close()
raise StopIteration
This works in pandas.read_csv():
df = pd.read_csv(InFile("some.csv"))
To me this looks super complicated and I wonder if there is any better (→ more elegant) solution.
Here's a solution that will work for smaller CSV files. All lines are first read into memory, processed, and concatenated. This will probably perform badly for larger files.
import re
from io import StringIO
import pandas as pd
with open('file.csv') as file:
lines = [re.sub(r'","', r',', line) for line in file]
df = pd.read_csv(StringIO('\n'.join(lines)))

Type Error when Lemmatizing words using NLTK

I have parsed 30 excel files and created a pandas dataframe. I have tokenized the words, taken out stop words and made bigrams. However when I try to lemmatize it gives me this error: TypeError: unhashable type: 'list'
Here's my code:
# Use simple pre-proces to clean up data and tokenize
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))
# Define Function for Removing stopwords
def remove_stopwords(texts):
return[[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
# Define function for bigrams
def make_bigrams(texts):
return[bigram_mod[doc] for doc in texts]
#Remove Stop Words
data_words_nostops = remove_stopwords(data_words)
# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)
#Define function for lemmatizing
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma(word):
return WordNetLemmatizer().lemmatize(word)
#Lemmatize words
data_lemmatized = get_lemma(data_words_bigrams)
This is exactly where I get the error. How should I adjust my code to resolve this issue? Thank you in advance
as suggested, the first few lines of the dataframe
df.head()
dataframe snap

Python Unittest for big arrays

I am trying to put together a unittest to test whether my function that reads in big data files, produces the correct result in shape of an numpy array. However, these files and arrays are huge and can not be typed in. I believe I need to save input and output files and test using them. This is how my testModule looks like:
import numpy as np
from myFunctions import fun1
import unittest
class TestMyFunctions(unittest.TestCase):
def setUp(self):
self.inputFile1 = "input1.txt"
self.inputFile2 = "input2.txt"
self.outputFile = "output.txt"
def test_fun1(self):
m1 = np.genfromtxt(self.inputFile1)
m2 = np.genfromtxt(self.inputFile2)
R = np.genfromtxt(self.outputFile)
self.assertEqual(fun1(m1,m2),R)
if __name__ =='__main__':
unittest.main(exit=False)
I'm not sure if there is a better/neater way of testing huge results.
Edit:
Also getting an attribute error now:
AttributeError: TestMyFunctions object has no attribute '_testMethodName'
Update - AttributeError Solved - 'def init()' is not allowed. Changed with def setUp()!

Resources