Cannot pickle dateparser using cloudpickle - python-3.x

I'm using the dateparser library to parse some strings and return potential dates. I need to use cloudpickle for distributed use but am receiving an error:
import dateparser
class DateParser:
def __init__(self,
threshold: float = 0.5,
pos_label: str = 'date'):
self.threshold = threshold
self.pos_label = pos_label
def __call__(self):
dateparser.parse('20/12/2022')
date_parser = DateParser()
with open('/path/parser.cloudpickle', 'wb+') as fout:
cloudpickle.dump(date_parser, fout, protocol=4)
TypeError: can't pickle _thread.lock objects
However when i use plain pickle it works just fine:
import pickle
with open('/path/parser.pickle', 'wb+') as fout:
pickle.dump(date_parser, fout, protocol=4)
# also loads just fine:
with open('/path/parser.pickle', 'rb+') as fin:
pickle.load(fin)
I can get around this issue by importing dateparser in the init of Dateparser but I'm not sure why this should be the fix.
class DateParser:
def __init__(self,
threshold: float = 0.5,
pos_label: str = 'date'):
import dateparser
self.threshold = threshold
self.pos_label = pos_label
I looked online and it seems this threadlock complaint is most common to multiprocessing calls but as far as i can tell this doesn't happen in the underlying dateparser library. And this should've broken plain pickling anyway?

Related

torch dataloader for large csv file - incremental loading

I am trying to write a custom torch data loader so that large CSV files can be loaded incrementally (by chunks).
I have a rough idea of how to do that. However, I keep getting some PyTorch error that I do not know how to solve.
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
# Create dummy csv data
nb_samples = 110
a = np.arange(nb_samples)
df = pd.DataFrame(a, columns=['data'])
df.to_csv('data.csv', index=False)
# Create Dataset
class CSVDataset(Dataset):
def __init__(self, path, chunksize, nb_samples):
self.path = path
self.chunksize = chunksize
self.len = nb_samples / self.chunksize
def __getitem__(self, index):
x = next(
pd.read_csv(
self.path,
skiprows=index * self.chunksize + 1, #+1, since we skip the header
chunksize=self.chunksize,
names=['data']))
x = torch.from_numpy(x.data.values)
return x
def __len__(self):
return self.len
dataset = CSVDataset('data.csv', chunksize=10, nb_samples=nb_samples)
loader = DataLoader(dataset, batch_size=10, num_workers=1, shuffle=False)
for batch_idx, data in enumerate(loader):
print('batch: {}\tdata: {}'.format(batch_idx, data))
I get 'float' object cannot be interpreted as an integer error
The error is caused by this line:
self.len = nb_samples / self.chunksize
When dividing using / the result is always a float. But you can only return an integer in the __len__() function. Therefore you have to round self.len and/or convert it to an integer. For example by simply doing this:
self.len = nb_samples // self.chunksize
the double slash (//) rounds down and converts to integer.
Edit:
You acutally CAN return a float in __len__() but when calling len(dataset) the error will occur. So I guess len(dataset) is called somewhere inside the DataLoader class.

Attribute Error : pickle.load() Seldon Deployment

I am doing a seldon deployment. I have created custom pipelines using sklearn and it is in the directory MyPipelines/CustomPipelines.py. The main code ie. my_prediction.py is the file which seldon will execute by default (based on my configuration). In this file I am importing the custom pipelines. If I execute my_prediction.py in my local (PyCharm) it executes fine. But If I deploy it using Seldon I get the error : Attribute Error: Can't get Attribute 'MyEncoder'
It is unable to load modules in CustomPipelines.py. I tried all the solutions from Unable to load files using pickle and multiple modules None of them worked.
MyPipelines/CustomPipelines.py
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
class MyEncoder(BaseEstimator, TransformerMixin):
def __init__(self):
super().__init__()
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
df = X
vars_cat = [var for var in df.columns if df[var].dtypes == 'O']
cat_with_na = [var for var in vars_cat if df[var].isnull().sum() > 0]
df[cat_with_na] = df[cat_with_na].fillna('Missing')
return df
my_prediction.py
import pickle
import pandas as pd
import dill
from MyPipelines.CustomPipelines import MyEncoder
from MyPipelines.CustomPipelines import *
import MyPipelines.CustomPipelines
class my_prediction:
def __init__(self):
file_name = 'model.sav'
with open(file_name, 'rb') as model_file:
self.model = pickle.load(model_file)
def predict(self, request):
data = request.get('ndarray')
columns = request.get('names')
X = pd.DataFrame(data, columns = columns)
predictions = self.model.predict(X)
return predictions
Error:
File microservice/my_prediction.py in __init__
self.model = pickle.load(model_file)
Attribute Error: Can't get Attribute 'MyEncoder' on <module '__main__' from 'opt/conda/bin/seldon-core-microservice'
One of the constrains of the pickle module is that it expects that the same classes (under the same module) are available in the environment where the artifact gets unpickled. In this case, it seems like your my_prediction class is trying to unpickle a MyEncoder artifact, but that class is not available on that environment.
As a quick workaround, you could try to make your MyEncoder class available on the environment where my_prediction runs on (i.e. having the same folders / files present there as well). Otherwise, you could look at alternatives to pickle, like cloudpickle or dill, which can serialise your custom code as well (although these also come with their own set of caveats).

Multiprocessing error for NLP application

I'm working on an NLP project. I have a massive dataset of 180 million words. Before I begin training I want to correct the spelling of words. To do this I use TextBlob's spell correct. Because TextBlob takes a while to process anyways, it would be an insanely long amount of time to correct the spelling of 180 million words. So here is my approach (code will follow after this):
Load corpus
Split the corpus into list of sentences using nltk
tokenizer
Multiprocessing: apply function to every iterable item of list generated from step 2
Here is my code:
import codecs
import multiprocessing
import nltk
from textblob import TextBlob
from nltk.tokenize import sent_tokenize
class SpellCorrect():
def __init__(self):
pass
def load_data(self, path):
with codecs.open(path, "r", "utf-8") as file:
data = file.read()
return sent_tokenize(data)
def correct_spelling(self, data):
data = TextBlob(data)
return str(data.correct())
def merge_cleaned_corpus(self, result, path):
result = " ".join(temp for temp in result)
with codecs.open(path, "a", "utf-8") as file:
file.write(result)
if __name__ == "__main__":
SpellCorrect = SpellCorrect()
data = SpellCorrect.load_data(path)
correct_spelling = SpellCorrect.correct_spelling
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
result = pool.apply_async(correct_spelling, (data, ))
result = result.get()
SpellCorrect.merge_cleaned_corpus(tuple(result), path)
When I run this, I get the following error:
_pickle.PicklingError: Can't pickle <class '__main__.SpellCorrect'>: it's not the same object as __main__.SpellCorrect
This error is generated at the line in my code that says result = result.get()
From my probably wrong guess, I'm guessing that the parallel processing component completed successfully and was able to apply my clean up to every iterable sentence. However, I'm unable to retrieve those results.
Can someone tell my why this error is being generated, and what can I do to fix it. Thanks in advance!

Python Unittest for big arrays

I am trying to put together a unittest to test whether my function that reads in big data files, produces the correct result in shape of an numpy array. However, these files and arrays are huge and can not be typed in. I believe I need to save input and output files and test using them. This is how my testModule looks like:
import numpy as np
from myFunctions import fun1
import unittest
class TestMyFunctions(unittest.TestCase):
def setUp(self):
self.inputFile1 = "input1.txt"
self.inputFile2 = "input2.txt"
self.outputFile = "output.txt"
def test_fun1(self):
m1 = np.genfromtxt(self.inputFile1)
m2 = np.genfromtxt(self.inputFile2)
R = np.genfromtxt(self.outputFile)
self.assertEqual(fun1(m1,m2),R)
if __name__ =='__main__':
unittest.main(exit=False)
I'm not sure if there is a better/neater way of testing huge results.
Edit:
Also getting an attribute error now:
AttributeError: TestMyFunctions object has no attribute '_testMethodName'
Update - AttributeError Solved - 'def init()' is not allowed. Changed with def setUp()!

Computing precision and recall for two sets of keywords in NLTK and Scikit for sets of different sizes

I am trying to compute precision and recall for two sets of keywords. The gold_standard has 823 terms and the test has 1497 terms.
Using nltk.metrics's version of precision and recall, I am able to provide the two sets just fine. But doing the same for Scikit is throwing me an error:
ValueError: Found arrays with inconsistent numbers of samples: [ 823 1497]
How do I resolve this?
#!/usr/bin/python3
from nltk.metrics import precision, recall
from sklearn.metrics import precision_score
from sys import argv
from time import time
import numpy
import csv
def readCSVFile(filename):
termList = set()
with open(filename, 'rt', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
termList.update(row)
return termList
def readDocuments(gs_file, fileToProcess):
print("Reading CSV files...")
gold_standard = readCSVFile(gs_file)
test = readCSVFile(fileToProcess)
print("All files successfully read!")
return gold_standard, test
def calcPrecisionScipy(gs, test):
gs = numpy.array(list(gs))
test = numpy.array(list(test))
print("Precision Scipy: ",precision_score(gs, test, average=None))
def process(datasest):
print("Processing input...")
gs, test = dataset
print("Precision: ", precision(gs, test))
calcPrecisionScipy(gs, test)
def usage():
print("Usage: python3 generate_stats.py gold_standard.csv termlist_to_process.csv")
if __name__ == '__main__':
if len(argv) != 3:
usage()
exit(-1)
t0 = time()
process(readDocuments(argv[1], argv[2]))
print("Total runtime: %0.3fs" % (time() - t0))
I referred to the following pages for coding:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score
=================================Update===================================
Okay, so I tried to add 'non-sensical' data to the list to make them equal length:
def calcPrecisionScipy(gs, test):
if len(gs) < len(test):
gs.update(list(range(len(test)-len(gs))))
gs = numpy.array(list(gs))
test = numpy.array(list(test))
print("Precision Scipy: ",precision_score(gs, test, average=None))
Now I have another error:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
seems scientifically not possible to compute precision or recall of two sets of different lengths.
I guess what nltk must do is to truncate the sets to the same lengths, you can do the same in your script.
import numpy as np
import sklearn.metrics
set1 = [True,True]
set2 = [True,False,False]
length = np.amin([len(set1),len(set2)])
set1 = set1[:length]
set2 = set2[:length]
print sklearn.metrics.precision_score(set1,set2))

Resources