Dask memory usage exploding even for simple computations - python-3.x

I have a parquet folder created with dask containing multiple files of about 100MB each. When I load the dataframe with df = dask.dataframe.read_parquet(path_to_parquet_folder), and run any sort of computation (such as df.describe().compute()), my kernel crashes.
Things I have noticed:
CPU usage (about 100%) indicates that multithreading is not used
memory usage shoots way past the size of a single file
the kernel crashes after system memory usage approaches 100%
EDIT:
I tried to create a reproducible example, without success, but I discovered some other oddities, seemingly all related to the newer pandas dtypes that I'm using:
import pandas as pd
from dask.diagnostics import ProgressBar
ProgressBar().register()
from dask.diagnostics import ResourceProfiler
rprof = ResourceProfiler(dt=0.5)
import dask.dataframe as dd
# generate dataframe with 3 different nullable dtypes and n rows
n = 10000000
test = pd.DataFrame({
1:pd.Series(['a', pd.NA]*n, dtype = pd.StringDtype()),
2:pd.Series([1, pd.NA]*n, dtype = pd.Int64Dtype()),
3:pd.Series([0.56, pd.NA]*n, dtype = pd.Float64Dtype())
})
dd_df = dd.from_pandas(test, npartitions = 2) # convert to dask df
dd_df.to_parquet('test.parquet') # save as parquet directory
dd_df = dd.read_parquet('test.parquet') # load files back
dd_df.mean().compute() # compute something
dd_df.describe().compute() # compute something
dd_df.count().compute() # compute something
dd_df.max().compute() # compute something
Output, respectively:
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
Kernel appears to have died.
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
It seems that the dtypes are preserved even throughout the parquet IO, but dask has some trouble actually doing anything with these columns.
Python version: 3.9.7
dask version: 2021.11.2

It seems the main error is due to NAType which is not yet fully supported by numpy (version 1.21.4):
~/some_env/python3.8/site-packages/numpy/core/_methods.py in _var(a, axis, dtype, out, ddof, keepdims, where)
240 # numbers and complex types with non-native byteorder
241 else:
--> 242 x = um.multiply(x, um.conjugate(x), out=x).real
243
244 ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)
TypeError: loop of ufunc does not support argument 0 of type NAType which has no callable conjugate method
As a workaround, casting columns to float will compute the descriptives. Note that to avoid KeyError the column names are given as strings rather than int.
import pandas as pd
from dask.diagnostics import ProgressBar
ProgressBar().register()
from dask.diagnostics import ResourceProfiler
rprof = ResourceProfiler(dt=0.5)
import dask.dataframe as dd
# generate dataframe with 3 different nullable dtypes and n rows
n = 1000
# note that column names are changed to strings rather than ints
test = pd.DataFrame(
{
"1": pd.Series(["a", pd.NA] * n, dtype=pd.StringDtype()),
"2": pd.Series([1, pd.NA] * n, dtype=pd.Int64Dtype()),
"3": pd.Series([0.56, pd.NA] * n, dtype=pd.Float64Dtype()),
}
)
dd_df = dd.from_pandas(test, npartitions=2) # convert to dask df
dd_df.to_parquet("test.parquet", engine="fastparquet") # save as parquet directory
dd_df = dd.read_parquet("test.parquet", engine="fastparquet") # load files back
dd_df.mean().compute() # compute something
dd_df.astype({"2": "float"}).describe().compute() # compute something
dd_df.count().compute() # compute something
dd_df.max().compute() # compute something

Related

How to use python-multiprocessing to concat many files/dataframes?

I'm relatively new to python and programming and just use it for the analysis of simulation data.
I have a directory "result_1/" with over 150000 CSV files with simulation data I want to concat into one pandas-dataFrame. To evade problems with readdir() only reading 32K of directory entries at a time, I prepared "files.csv" - listing all the files in the directory.
("sim", "det", and "run" are pieces of information I read from the filenames and insert as Series into the dataFrame. For better overlook, I took their definition out of the concat.)
My problem is as follows:
The program takes too much time to run and I would like to use multiprocessing/-threading to speed up the for-loop, but as I never used mp/mt before, I don't even know if or how it may be used here.
Thank you in advance and have a great day!
import numpy as np
import pandas as pd
import os
import multiprocessing as mp
df = pd.DataFrame()
path = 'result_1/'
list = pd.read_csv('files.csv', encoding='utf_16_le', names=['f'])['f'].values.tolist()
for file in list:
dftemp = pd.read_csv(r'{}'.format(os.path.join(path, file)), skiprows=8, names=['x', 'y', 'z', 'dos'], sep=',').drop(['y', 'z'], axis=1)
sim = pd.Series(int(file.split('Nr')[1].split('_')[0]) * np.ones((300,), dtype=int))
det = pd.Series(int(file.split('Nr')[0]) * np.ones((300,), dtype=int))
run = pd.Series(int(file[-8:-4]) * np.ones((300,), dtype=int))
dftemp = pd.concat([sim, det, run, dftemp], axis=1)
df = pd.concat([df, dftemp], axis=0)
df.rename({0:'sim', 1:'det', 2:'run', 3:'x', 4:'dos'}, axis=1).to_csv(r'df.csv')
The CSV files look like this: "193Nr6_Run_0038.csv" (f.e.)
#(8 lines of things I don't need.)
0, 0, 0, 4.621046656438921e-09
1, 0, 0, 4.600856584602298e-09
(... 300 lines of data [x, y, z, dose])
Processing DataFrames in parallel can be difficult due to CPU and RAM limitations. I don't know the specs of your hardware nor the details of your DataFrames. However, I would use multiprocessing to "parse/make" the DataFrames, and then concatenate them afterwards. Here is an example:
import numpy as np
import pandas as pd
import os
from multiprocessing import Pool
path = 'result_1/'
list_of_files = pd.read_csv('files.csv', encoding='utf_16_le', names=['f'])['f'].values.tolist()
#make a function to replace the for-loop:
def my_custom_func(file):
dftemp = pd.read_csv(r'{}'.format(os.path.join(path, file)), skiprows=8, names=['x', 'y', 'z', 'dos'], sep=',').drop(['y', 'z'], axis=1)
sim = pd.Series(int(file.split('Nr')[1].split('_')[0]) * np.ones((300,), dtype=int))
det = pd.Series(int(file.split('Nr')[0]) * np.ones((300,), dtype=int))
run = pd.Series(int(file[-8:-4]) * np.ones((300,), dtype=int))
return pd.concat([sim, det, run, dftemp], axis=1)
#use multiprocessing to process multiple files at once
with Pool(8) as p: #8 processes simultaneously. Avoid using more processes than cores in your CPU
dataframes = p.map(my_custom_func, list_of_files)
#Finally, concatenate them all
df = pd.concat(dataframes)
df.rename({0:'sim', 1:'det', 2:'run', 3:'x', 4:'dos'}, axis=1).to_csv(r'df.csv')
Have a look at multiprocessing.Pool() for more info.

Parallelizing fastText.get_sentence_vector with dask gives pickling error

I was trying to get fastText sentence embeddings for 80 Million English tweets using the parallelizing mechanism using dask as described in this answer: How do you parallelize apply() on Pandas Dataframes making use of all cores on one machine?
Here is my full code:
import dask.dataframe as dd
from dask.multiprocessing import get
import fasttext
import fasttext.util
import pandas as pd
print('starting langage: ' + 'en')
lang_output = pd.DataFrame()
lang_input = full_input.loc[full_input.name == 'en'] # 80 Million English tweets
ddata = dd.from_pandas(lang_input, npartitions = 96)
print('number of lines to compute: ' + str(len(lang_input)))
fasttext.util.download_model('en', if_exists='ignore') # English
ft = fasttext.load_model('cc.'+'en'+'.300.bin')
fasttext.util.reduce_model(ft, 20)
lang_output['sentence_embedding'] = ddata.map_partitions(lambda lang_input: lang_input.apply((lambda x: get_fasttext_sentence_embedding(x.tweet_text, ft)), axis = 1)).compute(scheduler='processes')
print('finished en')
This is the get_fasttext_sentence_embedding function:
def get_fasttext_sentence_embedding(row, ft):
if pd.isna(row):
return np.zeros(20)
return ft.get_sentence_vector(row)
But, I get a pickling error on this line:
lang_output['sentence_embedding'] = ddata.map_partitions(lambda lang_input: lang_input.apply((lambda x: get_fasttext_sentence_embedding(x.tweet_text, ft)), axis = 1)).compute(scheduler='processes')
This is the error I get:
TypeError: can't pickle fasttext_pybind.fasttext objects
Is there a way to parallelize fastText model get_sentence_vector with dask (or anything else)? I need to parallelize because getting sentence embeddings for 80 Million tweets takes two much time and one row of my data frame is completely independent of the other.
The problem here is that fasttext objects apparently can't be pickled, and Dask doesn't know how to serialize and deserialize this data structure without pickling.
The simplest way to use Dask here (but likely not the most efficient), would be to have each process define the ft model itself, which would avoid the need to transfer it (and thus avoid the attempted pickling). Something like the following would work. Notice that ft is defined inside the function being mapped across partitions.
First, some example data.
import dask.dataframe as dd
import fasttext
import pandas as pd
import dask
import numpy as np
df = pd.DataFrame({"text":['this is a test sentence', None, 'this is another one.', 'one more']})
ddf = dd.from_pandas(df, npartitions=2)
ddf
Dask DataFrame Structure:
text
npartitions=2
0 object
2 ...
3 ...
Dask Name: from_pandas, 2 tasks
Next, we can tweak your functions to define ft within each process. This duplicates effort, but avoids the need to transfer the model. With that, we can smoothly run it via map_partitions.
def get_embeddings(sent, model):
return model.get_sentence_vector(sent) if not pd.isna(sent) else np.zeros(10)
def func(df):
ft = fasttext.load_model("amazon_review_polarity.bin") # arbitrary model
res = df['text'].apply(lambda x: get_embeddings(x, model=ft))
return res
ddf['sentence_vector'] = ddf.map_partitions(func)
ddf.compute(scheduler='processes')
text sentence_vector
0 this is a test sentence [-0.01934033, 0.03729743, -0.04679677, -0.0603...
1 None [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2 this is another one. [-0.0025579212, 0.0353713, -0.027139299, -0.05...
3 one more [-0.014522496, 0.10396308, -0.13107553, -0.198...
Note that this nested data structure (list in a column) is probably not the optimal way to handle these vectors, but it will depend on your use case. Also, there is probably a way to do this computation in batches using fastext rather than one row at a time (in Python), but I'm not well versed in the nuances of fastext.
I had the same problem, but I found a solution using Multiprocessing - Python's Standard Library.
First step - wrap
model = fasttext.load_model(file_name_model)
def get_vec(txt):
'''
First tried to put model.get_sentence_vector into map (look below), but it resulted in pickle error.
This works, lol.
'''
return model.get_sentence_vector(txt)
Then, I'm doing this:
from multiprocessing import Pool
text = ["How to sell drugs (fast)", "House of Cards", "The Crown"]
with Pool(40) as p: # I have 40 cores
result = p.map(get_vec, text)
With 40 cores processing 10M short texts took me ~80s.

Dask client.persist returns AssertionError when I try to use HashingVectorizer

I am trying to vectorize the dask.dataframe with dask HashingVectorizer. I want the vectorization results to stay in the cluster (distributed system). That's why I am using client.persist when I try to transform the data. But for some reason, I am getting the error below.
Traceback (most recent call last):
File "/home/dodzilla/my_project/components_with_adapter/vectorizers/base_vectorizer.py", line 112, in hybrid_feature_vectorizer
CLUSTERING_FEATURES=self.clustering_features)
File "/home/dodzilla/my_project/components_with_adapter/vectorizers/text_vectorizer.py", line 143, in vectorize
X = self.client.persist(fitted_vectorizer.transform, combined_data)
File "/home/dodzilla/.local/lib/python3.6/site-packages/distributed/client.py", line 2860, in persist
assert all(map(dask.is_dask_collection, collections))
AssertionError
I can't share the data but all of the necessary information about the data is as below:
>>>type(combined_data)
<class 'dask.dataframe.core.Series'>
>>>type(combined_data.compute())
<class 'pandas.core.series.Series'>
>>>combined_data.compute().shape
12
A minimal working example can be found below. Below, in the code snippet, combined_data holds the merged columns. Meaning: all of the columns are merged into 1 column. Data has 12 rows. All of the values inside the rows are string. This is the code where I am getting the error:
from stop_words import get_stop_words
from dask_ml.feature_extraction.text import HashingVectorizer as daskHashingVectorizer
import pandas as pd
import dask
import dask.dataframe as dd
from dask.distributed import Client
def convert_dataframe_to_single_text(documents):
"""
Combine all of the columns into 1 column.
"""
if type(documents) is dask.dataframe.core.DataFrame:
cols = documents.columns
documents['combined'] = documents[cols].apply(func=(lambda row: ' '.join(row.values.astype(str))), axis=1,
meta=('str'))
document_texts = documents.drop(cols, axis=1)
else:
raise TypeError('Wrong type of data. Expected Pandas DF or Dask DF but received ', type(documents))
return document_texts
# Init the client.
client = Client('localhost:8786')
# Get stopwords
stopwords = get_stop_words(language="english")
# Create dask dataframe from pandas dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':["twenty", "twentyone", "nineteen", "eighteen"]}
df = pd.DataFrame(data)
df = dd.from_pandas(df, npartitions=1)
# Init the vectorizer
vectorizer = daskHashingVectorizer(stop_words=stopwords, alternate_sign=False,
norm=None, binary=False,
n_features=10000)
# Combine all of to columns into 1 column.
combined_data = convert_dataframe_to_single_text(df)
# Fit the vectorizer.
fitted_vectorizer = client.persist(vectorizer.fit(combined_data))
# Transform the data.
X = client.persist(fitted_vectorizer.transform, combined_data)
I hope the information is enough.
Important note: I am not getting any kind of error when I say client.compute but from what I understand this doesn't work in the cluster of machines and instead it runs in the local machine. And it returns a csr matrix instead of a lazily evaluated dask.array.
This is not how I was supposed to use client.persist. Functions I was looking for are client.submit and client.map... In my case client.submit solved my issue.

While creating dummy variables getting memory error

I am working on a project and While getting the dummies value i was getting memory exception
I have tried using .astype(np.int8) and i have also tried writing exception handling code by importing psutil
i am using below code
dummy_cols = ['emp_title','grade','home_ownership','verification_status','addr_state','pub_rec','application_type']
df_dummies = pd.get_dummies(df[dummy_cols], drop_first = True)
It's not working and throwing an error
pandas.get_dummies creates a dense representation of dummy variables, which may request lots of memory depending on the number of levels in the categorical features.
I would prefer scikit-learn.preprocessing.OneHotEncoder that outputs sparse matrices:
The code would look like this :
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Create a fake dataframe
df = pd.DataFrame(
{
"df1": np.random.choice(["a", "b"], 100),
"df2": np.random.choice(["c", "d"], 100)
}
)
dummy_cols = ["df1", "df2"]
# LabelEncode categoricals
for f in dummy_cols:
df[f] = LabelEncoder().fit_transform(df[f])
# Transform to dummies in sparse representation (csr_matrix)
df_dummies = OneHotEncoder().fit_transform(df[dummy_cols])

PySpark: Invalid returnType with scalar Pandas UDFs

I'm trying to return a specific structure from a pandas_udf. It worked on one cluster but fails on another.
I try to run a udf on groups, which requires the return type to be a data frame.
from pyspark.sql.functions import pandas_udf
import pandas as pd
import numpy as np
from pyspark.sql.types import *
schema = StructType([
StructField("Distance", FloatType()),
StructField("CarId", IntegerType())
])
def haversine(lon1, lat1, lon2, lat2):
#Calculate distance, return scalar
return 3.5 # Removed logic to facilitate reading
#pandas_udf(schema)
def totalDistance(oneCar):
dist = haversine(oneCar.Longtitude.shift(1),
oneCar.Latitude.shift(1),
oneCar.loc[1:, 'Longitude'],
oneCar.loc[1:, 'Latitude'])
return pd.DataFrame({"CarId":oneCar['CarId'].iloc[0],"Distance":np.sum(dist)},index = [0])
## Calculate the overall distance made by each car
distancePerCar= df.groupBy('CarId').apply(totalDistance)
This is the exception I'm getting:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in returnType(self)
114 try:
--> 115 to_arrow_type(self._returnType_placeholder)
116 except TypeError:
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\types.py in to_arrow_type(dt)
1641 else:
-> 1642 raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
1643 return arrow_type
TypeError: Unsupported type in conversion to Arrow: StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
During handling of the above exception, another exception occurred:
NotImplementedError Traceback (most recent call last)
<ipython-input-35-4f2194cfb998> in <module>()
18 km = 6367 * c
19 return km
---> 20 #pandas_udf("CarId: int, Distance: float")
21 def totalDistance(oneUser):
22 dist = haversine(oneUser.Longtitude.shift(1), oneUser.Latitude.shift(1),
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in _create_udf(f, returnType, evalType)
62 udf_obj = UserDefinedFunction(
63 f, returnType=returnType, name=None, evalType=evalType, deterministic=True)
---> 64 return udf_obj._wrapped()
65
66
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in _wrapped(self)
184
185 wrapper.func = self.func
--> 186 wrapper.returnType = self.returnType
187 wrapper.evalType = self.evalType
188 wrapper.deterministic = self.deterministic
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in returnType(self)
117 raise NotImplementedError(
118 "Invalid returnType with scalar Pandas UDFs: %s is "
--> 119 "not supported" % str(self._returnType_placeholder))
120 elif self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
121 if isinstance(self._returnType_placeholder, StructType):
NotImplementedError: Invalid returnType with scalar Pandas UDFs: StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true))) is not supported
I've also tried changing the schema to
#pandas_udf("<CarId:int,Distance:float>")
and
#pandas_udf("CarId:int,Distance:float")
but get the same exception. I suspect it has to do with my pyarrow version, which isn't compatible with my pyspark version.
Any help would be appreciated. Thanks!
As reported in the error message ("Invalid returnType with scalar Pandas UDFs"), you are trying to create a SCALAR vectorized pandas UDF, but using a StructType schema and returning a pandas DataFrame.
You should rather declare your function as a GROUPED MAP pandas UDF, i.e.:
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
Difference between scalar and grouped vectorized UDFs is explained in the pyspark doc: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf.
A scalar UDF defines a transformation: One or more pandas.Series -> A pandas.Series. The returnType should be a primitive data type, e.g., DoubleType(). The length of the returned pandas.Series must be of the same as the input pandas.Series.
To summarize, a scalar pandas UDF processes a column at a time (a pandas Series), leading to better performance than traditional UDFs that process one row element at a time. Note that the performance improvement is due to efficient python serialization using PyArrow.
A grouped map UDF defines transformation: A pandas.DataFrame -> A pandas.DataFrame The returnType should be a StructType describing the schema of the returned pandas.DataFrame. The length of the returned pandas.DataFrame can be arbitrary and the columns must be indexed so that their position matches the corresponding field in the schema.
A grouped pandas UDF processes multiple rows and columns at a time (using a pandas DataFrame, not to be confused with a Spark DataFrame), and is extremely useful and efficient for multivariate operations (especially when using local python numerical analysis and machine learning libraries like numpy, scipy, scikit-learn etc.). In this case, the output is a single-row DataFrame with several columns.
Note that I did not check the internal logic of the code, only the methodology.

Resources