Parallelizing fastText.get_sentence_vector with dask gives pickling error - python-3.x

I was trying to get fastText sentence embeddings for 80 Million English tweets using the parallelizing mechanism using dask as described in this answer: How do you parallelize apply() on Pandas Dataframes making use of all cores on one machine?
Here is my full code:
import dask.dataframe as dd
from dask.multiprocessing import get
import fasttext
import fasttext.util
import pandas as pd
print('starting langage: ' + 'en')
lang_output = pd.DataFrame()
lang_input = full_input.loc[full_input.name == 'en'] # 80 Million English tweets
ddata = dd.from_pandas(lang_input, npartitions = 96)
print('number of lines to compute: ' + str(len(lang_input)))
fasttext.util.download_model('en', if_exists='ignore') # English
ft = fasttext.load_model('cc.'+'en'+'.300.bin')
fasttext.util.reduce_model(ft, 20)
lang_output['sentence_embedding'] = ddata.map_partitions(lambda lang_input: lang_input.apply((lambda x: get_fasttext_sentence_embedding(x.tweet_text, ft)), axis = 1)).compute(scheduler='processes')
print('finished en')
This is the get_fasttext_sentence_embedding function:
def get_fasttext_sentence_embedding(row, ft):
if pd.isna(row):
return np.zeros(20)
return ft.get_sentence_vector(row)
But, I get a pickling error on this line:
lang_output['sentence_embedding'] = ddata.map_partitions(lambda lang_input: lang_input.apply((lambda x: get_fasttext_sentence_embedding(x.tweet_text, ft)), axis = 1)).compute(scheduler='processes')
This is the error I get:
TypeError: can't pickle fasttext_pybind.fasttext objects
Is there a way to parallelize fastText model get_sentence_vector with dask (or anything else)? I need to parallelize because getting sentence embeddings for 80 Million tweets takes two much time and one row of my data frame is completely independent of the other.

The problem here is that fasttext objects apparently can't be pickled, and Dask doesn't know how to serialize and deserialize this data structure without pickling.
The simplest way to use Dask here (but likely not the most efficient), would be to have each process define the ft model itself, which would avoid the need to transfer it (and thus avoid the attempted pickling). Something like the following would work. Notice that ft is defined inside the function being mapped across partitions.
First, some example data.
import dask.dataframe as dd
import fasttext
import pandas as pd
import dask
import numpy as np
df = pd.DataFrame({"text":['this is a test sentence', None, 'this is another one.', 'one more']})
ddf = dd.from_pandas(df, npartitions=2)
ddf
Dask DataFrame Structure:
text
npartitions=2
0 object
2 ...
3 ...
Dask Name: from_pandas, 2 tasks
Next, we can tweak your functions to define ft within each process. This duplicates effort, but avoids the need to transfer the model. With that, we can smoothly run it via map_partitions.
def get_embeddings(sent, model):
return model.get_sentence_vector(sent) if not pd.isna(sent) else np.zeros(10)
def func(df):
ft = fasttext.load_model("amazon_review_polarity.bin") # arbitrary model
res = df['text'].apply(lambda x: get_embeddings(x, model=ft))
return res
ddf['sentence_vector'] = ddf.map_partitions(func)
ddf.compute(scheduler='processes')
text sentence_vector
0 this is a test sentence [-0.01934033, 0.03729743, -0.04679677, -0.0603...
1 None [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2 this is another one. [-0.0025579212, 0.0353713, -0.027139299, -0.05...
3 one more [-0.014522496, 0.10396308, -0.13107553, -0.198...
Note that this nested data structure (list in a column) is probably not the optimal way to handle these vectors, but it will depend on your use case. Also, there is probably a way to do this computation in batches using fastext rather than one row at a time (in Python), but I'm not well versed in the nuances of fastext.

I had the same problem, but I found a solution using Multiprocessing - Python's Standard Library.
First step - wrap
model = fasttext.load_model(file_name_model)
def get_vec(txt):
'''
First tried to put model.get_sentence_vector into map (look below), but it resulted in pickle error.
This works, lol.
'''
return model.get_sentence_vector(txt)
Then, I'm doing this:
from multiprocessing import Pool
text = ["How to sell drugs (fast)", "House of Cards", "The Crown"]
with Pool(40) as p: # I have 40 cores
result = p.map(get_vec, text)
With 40 cores processing 10M short texts took me ~80s.

Related

Dask memory usage exploding even for simple computations

I have a parquet folder created with dask containing multiple files of about 100MB each. When I load the dataframe with df = dask.dataframe.read_parquet(path_to_parquet_folder), and run any sort of computation (such as df.describe().compute()), my kernel crashes.
Things I have noticed:
CPU usage (about 100%) indicates that multithreading is not used
memory usage shoots way past the size of a single file
the kernel crashes after system memory usage approaches 100%
EDIT:
I tried to create a reproducible example, without success, but I discovered some other oddities, seemingly all related to the newer pandas dtypes that I'm using:
import pandas as pd
from dask.diagnostics import ProgressBar
ProgressBar().register()
from dask.diagnostics import ResourceProfiler
rprof = ResourceProfiler(dt=0.5)
import dask.dataframe as dd
# generate dataframe with 3 different nullable dtypes and n rows
n = 10000000
test = pd.DataFrame({
1:pd.Series(['a', pd.NA]*n, dtype = pd.StringDtype()),
2:pd.Series([1, pd.NA]*n, dtype = pd.Int64Dtype()),
3:pd.Series([0.56, pd.NA]*n, dtype = pd.Float64Dtype())
})
dd_df = dd.from_pandas(test, npartitions = 2) # convert to dask df
dd_df.to_parquet('test.parquet') # save as parquet directory
dd_df = dd.read_parquet('test.parquet') # load files back
dd_df.mean().compute() # compute something
dd_df.describe().compute() # compute something
dd_df.count().compute() # compute something
dd_df.max().compute() # compute something
Output, respectively:
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
Kernel appears to have died.
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
It seems that the dtypes are preserved even throughout the parquet IO, but dask has some trouble actually doing anything with these columns.
Python version: 3.9.7
dask version: 2021.11.2
It seems the main error is due to NAType which is not yet fully supported by numpy (version 1.21.4):
~/some_env/python3.8/site-packages/numpy/core/_methods.py in _var(a, axis, dtype, out, ddof, keepdims, where)
240 # numbers and complex types with non-native byteorder
241 else:
--> 242 x = um.multiply(x, um.conjugate(x), out=x).real
243
244 ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)
TypeError: loop of ufunc does not support argument 0 of type NAType which has no callable conjugate method
As a workaround, casting columns to float will compute the descriptives. Note that to avoid KeyError the column names are given as strings rather than int.
import pandas as pd
from dask.diagnostics import ProgressBar
ProgressBar().register()
from dask.diagnostics import ResourceProfiler
rprof = ResourceProfiler(dt=0.5)
import dask.dataframe as dd
# generate dataframe with 3 different nullable dtypes and n rows
n = 1000
# note that column names are changed to strings rather than ints
test = pd.DataFrame(
{
"1": pd.Series(["a", pd.NA] * n, dtype=pd.StringDtype()),
"2": pd.Series([1, pd.NA] * n, dtype=pd.Int64Dtype()),
"3": pd.Series([0.56, pd.NA] * n, dtype=pd.Float64Dtype()),
}
)
dd_df = dd.from_pandas(test, npartitions=2) # convert to dask df
dd_df.to_parquet("test.parquet", engine="fastparquet") # save as parquet directory
dd_df = dd.read_parquet("test.parquet", engine="fastparquet") # load files back
dd_df.mean().compute() # compute something
dd_df.astype({"2": "float"}).describe().compute() # compute something
dd_df.count().compute() # compute something
dd_df.max().compute() # compute something

Multiprocessing getting stuck with ARMAX while refitting

I am trying to train multiple time series models using the below code in Jupyter Notebook.
import statsmodels.api as sm
import multiprocessing
import tqdm
train_dict = dict() # A dictionary of dataframes
test_dict = dict() # A dictionary of dataframes
def train_arma(key):
endog = list(train_dict[key].endog)
exog = list(train_dict[key].exog)
fut_endog = list(train_dict[key].endog)
fut_exog = list(test_dict[key].exog)
model = sm.tsa.arima.ARIMA(endog, order=(2, 0, 2), exog=exog,
enforce_stationarity=False,
enforce_invertibility=False).fit()
predictions = list()
yhat = model.forecast(exog=[fut_exog[0]])[0]
predictions.append(yhat)
for i in tqdm.tqdm_notebook(range(len(fut_vol))[:-1]):
model = model.append([fut_vol[i]], exog=[fut_exog[i]], refit=True) #code gets stuck here
predictions.append(model.forecast(exog=[fut_exog[i+1]])
return predictions
secs = list(train_dict.keys())
p = multiprocessing.Pool(10)
output = p.map(train_arma, secs)
p.terminate()
When len(endog) == 1006, the code keeps getting stuck on the 17th iteration in the for loop. If I decrease the endog by 20, then it gets stuck on 37th iteration.
There are some other things I have tried already:
Passing dataframes directly instead of letting the function acess train_dict and test_dict from outer scope.
Reducing the number of maximum processes in multiprocessing.
Shuffling my input list.
Defining a new class instance in the for loop while appending the values from fut_endog and fut_exog lists in endog and exog lists respectively.
I did a top in my linux terminal and the observed the cpu usage while processes were getting created and executed. Initially when the processes spawn, they use up cpu and when the processes gets stuck %CPU allocation becomes 0.
There are some instances when the code does work:
When I call the function directly, without multiprocessing, it works. But using multiprocessing even with processes = 1 makes the code stop.
When I don't pass any exogenous variable and train a simple ARMA model it works.
I am using statsmodels v0.12.1 and python version is 3.7.3. Thanks
This issue must be due to usage of tqdm alongside multiprocessing.
https://github.com/tqdm/tqdm/issues/461 addresses this issue.
I resolved it by using
from tqdm import tqdm
tqdm.get_lock().locks = []

Computing dask delayed objects stored in dataframe

I am looking for the best way to compute many dask delayed obejcts stored in a dataframe. I am unsure if the pandas dataframe should be converted to a dask dataframe with delayed objects within, or if the compute call should be called on all values of the pandas dataframe.
I would appreciate any suggestions in general, as I am having trouble with the logic of passing delayed object across nested for loops.
import numpy as np
import pandas as pd
from scipy.stats import hypergeom
from dask import delayed, compute
steps = 5
sample = [int(x) for x in np.linspace(5, 100, num=steps)]
enr_df = pd.DataFrame()
for N in sample:
enr = []
for i in range(20):
k = np.random.randint(1, 200)
enr.append(delayed(hypergeom.sf)(k=k, M=10000, n=20, N=N, loc=0))
enr_df[N] = enr
I cannot call compute on this dataframe without applying the function across all cells like so: enr_df.applymap(compute) (which I believe calls compute on each value individually).
However if I convert to a dask dataframe the delayed objects I want to compute are layered in the dask dataframe structure:
enr_dd = dd.from_pandas(enr_df, npartitions=1)
enr_dd.compute()
And the computation output I expect does not proceed.
You can pass a list of delayed objects into dask.compute
results = dask.compute(*list_of_delayed_objects)
So you need to get a list from your Pandas dataframe. This is something you can do with normal Python code.

Key Error: nan when applying KNN classifier

I am trying to test my KNN classifier against some data that I sourced from UCI's Machine Learning Repository. When running the classifier I keep getting the same KeyError
train_set[i[-1]].append(i[:-1])
KeyError: NaN
I am not sure why this keeps happening because if I comment out the classifier and just print the first 10 lines or so, the data shows up fine with no corruption or duplication of any kind.
Here is a link to the data that I am using, I just simply downloaded it and added the column ID's (note: in this link the column ID's have not been added)
Here is what some of the code looks like:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import warnings
from math import sqrt
from collections import Counter
import pandas as pd
import random
style.use('fivethirtyeight')
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features)-np.array(predict))
distances.append([euclidean_distance,group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?',-99999, inplace=True)
df.drop(['id'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
test_size = 0.2
train_set = {2:[], 4:[]}
test_set = {2:[], 4:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy:', correct/total)
I am completely stumped as to why this KeyError keeps showing up, (it also happens on the
test_set[i[-1]].append(i[:-1]) line as well.
I tried looking for people who experienced similar issues but have since found nobody with the same issue as me. As always any assistance is greatly appreciated, thank you.
I figured out that the error was caused by a spacing issue. When typing in the classes for the data after I downloaded it, I forgot to input the classes on their own line. I instead typed my classes right in front of the first data point causing the error to occur.

Dask: How would I parallelize my code with dask delayed?

This is my first venture into parallel processing and I have been looking into Dask but I am having trouble actually coding it.
I have had a look at their examples and documentation and I think dask.delayed will work best. I attempted to wrap my functions with the delayed(function_name), or add an #delayed decorator, but I can't seem to get it working properly. I preferred Dask over other methods since it is made in python and for its (supposed) simplicity. I know dask doesn't work on the for loop, but they say it can work inside a loop.
My code passes files through a function that contains inputs to other functions and looks like this:
from dask import delayed
filenames = ['1.csv', '2.csv', '3.csv', etc. etc. ]
for count, name in enumerate(filenames)"
name = name.split('.')[0]
....
then do some pre-processing ex:
preprocess1, preprocess2 = delayed(read_files_and_do_some_stuff)(name)
then I call a constructor and pass the pre_results in to the function calls:
fc = FunctionCalls()
Daily = delayed(fc.function_runs)(filename=name, stringinput='Daily',
input_data=pre_result1, model1=pre_result2)
What i do here is I pass the file into the for loop, do some pre-processing and then pass the file into two models.
Thoughts or tips on how to do parallelize this? I began getting odd errors and I had no idea how to fix the code. The code does work as is. I use a bunch of pandas dataframes, series, and numpy arrays, and I would prefer not to go back and change everything to work with dask.dataframes etc.
The code in my comment may be difficult to read. Here it is in a more formatted way.
In the code below, when I type print(mean_squared_error) I just get: Delayed('mean_squared_error-3009ec00-7ff5-4865-8338-1fec3f9ed138')
from dask import delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = ['file1.csv']
for count, name in enumerate(filenames):
file1 = pd.read_csv(name)
df = pd.DataFrame(file1)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = delayed(mse)(observed, prediction)
You need to call dask.compute to eventually compute the result. See dask.delayed documentation.
Sequential code
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
results = []
for count, name in enumerate(filenames):
file1 = pd.read_csv(name)
df = pd.DataFrame(file1) # isn't this already a dataframe?
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = mse(observed, prediction)
results.append(mean_squared_error)
Parallel code
import dask
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
delayed_results = []
for count, name in enumerate(filenames):
df = dask.delayed(pd.read_csv)(name)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = dask.delayed(mse)(observed, prediction)
delayed_results.append(mean_squared_error)
results = dask.compute(*delayed_results)
A much clearer solution, IMO, than the accepted answer is this snippet.
from dask import compute, delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
def compute_mse(file_name):
df = pd.read_csv(file_name)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
return mse(observed, prediction)
delayed_results = [delayed(compute_mse)(file_name) for file_name in filenames]
mean_squared_errors = compute(*delayed_results, scheduler="processes")

Resources