How to use python-multiprocessing to concat many files/dataframes? - python-3.x

I'm relatively new to python and programming and just use it for the analysis of simulation data.
I have a directory "result_1/" with over 150000 CSV files with simulation data I want to concat into one pandas-dataFrame. To evade problems with readdir() only reading 32K of directory entries at a time, I prepared "files.csv" - listing all the files in the directory.
("sim", "det", and "run" are pieces of information I read from the filenames and insert as Series into the dataFrame. For better overlook, I took their definition out of the concat.)
My problem is as follows:
The program takes too much time to run and I would like to use multiprocessing/-threading to speed up the for-loop, but as I never used mp/mt before, I don't even know if or how it may be used here.
Thank you in advance and have a great day!
import numpy as np
import pandas as pd
import os
import multiprocessing as mp
df = pd.DataFrame()
path = 'result_1/'
list = pd.read_csv('files.csv', encoding='utf_16_le', names=['f'])['f'].values.tolist()
for file in list:
dftemp = pd.read_csv(r'{}'.format(os.path.join(path, file)), skiprows=8, names=['x', 'y', 'z', 'dos'], sep=',').drop(['y', 'z'], axis=1)
sim = pd.Series(int(file.split('Nr')[1].split('_')[0]) * np.ones((300,), dtype=int))
det = pd.Series(int(file.split('Nr')[0]) * np.ones((300,), dtype=int))
run = pd.Series(int(file[-8:-4]) * np.ones((300,), dtype=int))
dftemp = pd.concat([sim, det, run, dftemp], axis=1)
df = pd.concat([df, dftemp], axis=0)
df.rename({0:'sim', 1:'det', 2:'run', 3:'x', 4:'dos'}, axis=1).to_csv(r'df.csv')
The CSV files look like this: "193Nr6_Run_0038.csv" (f.e.)
#(8 lines of things I don't need.)
0, 0, 0, 4.621046656438921e-09
1, 0, 0, 4.600856584602298e-09
(... 300 lines of data [x, y, z, dose])

Processing DataFrames in parallel can be difficult due to CPU and RAM limitations. I don't know the specs of your hardware nor the details of your DataFrames. However, I would use multiprocessing to "parse/make" the DataFrames, and then concatenate them afterwards. Here is an example:
import numpy as np
import pandas as pd
import os
from multiprocessing import Pool
path = 'result_1/'
list_of_files = pd.read_csv('files.csv', encoding='utf_16_le', names=['f'])['f'].values.tolist()
#make a function to replace the for-loop:
def my_custom_func(file):
dftemp = pd.read_csv(r'{}'.format(os.path.join(path, file)), skiprows=8, names=['x', 'y', 'z', 'dos'], sep=',').drop(['y', 'z'], axis=1)
sim = pd.Series(int(file.split('Nr')[1].split('_')[0]) * np.ones((300,), dtype=int))
det = pd.Series(int(file.split('Nr')[0]) * np.ones((300,), dtype=int))
run = pd.Series(int(file[-8:-4]) * np.ones((300,), dtype=int))
return pd.concat([sim, det, run, dftemp], axis=1)
#use multiprocessing to process multiple files at once
with Pool(8) as p: #8 processes simultaneously. Avoid using more processes than cores in your CPU
dataframes = p.map(my_custom_func, list_of_files)
#Finally, concatenate them all
df = pd.concat(dataframes)
df.rename({0:'sim', 1:'det', 2:'run', 3:'x', 4:'dos'}, axis=1).to_csv(r'df.csv')
Have a look at multiprocessing.Pool() for more info.

Related

Pandas pandarallel parallel_aply

Here is a simple program that works in parallel. But in has an issue when I want to use a previous result to apply.
import pandas as pd
import numpy as np
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=8) # nb_workers=NUMBER_OF_CPU_CORES
def dummy_fit(x, y_hint=0.5):
# Imagine quite a complicated code here
# y_hint is a previous fit. When it is not given, use default
y = (x.mean() + y_hint) / 2
return y
df = pd.DataFrame(np.random.random((10, 3)), columns=list("ABC"))
print("data:\n", df)
result = df.parallel_apply(dummy_fit, axis=1)
print(result)
We can use a global variable, but it is only one (we have more threads)
How to make it work in parallel?

Dask memory usage exploding even for simple computations

I have a parquet folder created with dask containing multiple files of about 100MB each. When I load the dataframe with df = dask.dataframe.read_parquet(path_to_parquet_folder), and run any sort of computation (such as df.describe().compute()), my kernel crashes.
Things I have noticed:
CPU usage (about 100%) indicates that multithreading is not used
memory usage shoots way past the size of a single file
the kernel crashes after system memory usage approaches 100%
EDIT:
I tried to create a reproducible example, without success, but I discovered some other oddities, seemingly all related to the newer pandas dtypes that I'm using:
import pandas as pd
from dask.diagnostics import ProgressBar
ProgressBar().register()
from dask.diagnostics import ResourceProfiler
rprof = ResourceProfiler(dt=0.5)
import dask.dataframe as dd
# generate dataframe with 3 different nullable dtypes and n rows
n = 10000000
test = pd.DataFrame({
1:pd.Series(['a', pd.NA]*n, dtype = pd.StringDtype()),
2:pd.Series([1, pd.NA]*n, dtype = pd.Int64Dtype()),
3:pd.Series([0.56, pd.NA]*n, dtype = pd.Float64Dtype())
})
dd_df = dd.from_pandas(test, npartitions = 2) # convert to dask df
dd_df.to_parquet('test.parquet') # save as parquet directory
dd_df = dd.read_parquet('test.parquet') # load files back
dd_df.mean().compute() # compute something
dd_df.describe().compute() # compute something
dd_df.count().compute() # compute something
dd_df.max().compute() # compute something
Output, respectively:
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
Kernel appears to have died.
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
It seems that the dtypes are preserved even throughout the parquet IO, but dask has some trouble actually doing anything with these columns.
Python version: 3.9.7
dask version: 2021.11.2
It seems the main error is due to NAType which is not yet fully supported by numpy (version 1.21.4):
~/some_env/python3.8/site-packages/numpy/core/_methods.py in _var(a, axis, dtype, out, ddof, keepdims, where)
240 # numbers and complex types with non-native byteorder
241 else:
--> 242 x = um.multiply(x, um.conjugate(x), out=x).real
243
244 ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)
TypeError: loop of ufunc does not support argument 0 of type NAType which has no callable conjugate method
As a workaround, casting columns to float will compute the descriptives. Note that to avoid KeyError the column names are given as strings rather than int.
import pandas as pd
from dask.diagnostics import ProgressBar
ProgressBar().register()
from dask.diagnostics import ResourceProfiler
rprof = ResourceProfiler(dt=0.5)
import dask.dataframe as dd
# generate dataframe with 3 different nullable dtypes and n rows
n = 1000
# note that column names are changed to strings rather than ints
test = pd.DataFrame(
{
"1": pd.Series(["a", pd.NA] * n, dtype=pd.StringDtype()),
"2": pd.Series([1, pd.NA] * n, dtype=pd.Int64Dtype()),
"3": pd.Series([0.56, pd.NA] * n, dtype=pd.Float64Dtype()),
}
)
dd_df = dd.from_pandas(test, npartitions=2) # convert to dask df
dd_df.to_parquet("test.parquet", engine="fastparquet") # save as parquet directory
dd_df = dd.read_parquet("test.parquet", engine="fastparquet") # load files back
dd_df.mean().compute() # compute something
dd_df.astype({"2": "float"}).describe().compute() # compute something
dd_df.count().compute() # compute something
dd_df.max().compute() # compute something

kernel dies when computing DBSCAN in scikit-learn after dimensionality reduction

I have some data after using ColumnTransformer() like
>>> X_trans
<197431x6040 sparse matrix of type '<class 'numpy.float64'>'
with 3553758 stored elements in Compressed Sparse Row format>
I transform the data using TruncatedSVD() which seems to work like
from sklearn.decomposition import TruncatedSVD
>>> svd = TruncatedSVD(n_components=3, random_state=0)
>>> X_trans_svd = svd.fit_transform(X_trans)
>>> X_trans_svd
array([[ 1.72326526, 1.85499833, -1.41848742],
[ 1.67802434, 1.81705149, -1.25959756],
[ 1.70251936, 1.82621935, -1.33124505],
...,
[ 1.5607798 , 0.07638707, -1.11972714],
[ 1.56077981, 0.07638652, -1.11972728],
[ 1.91659627, -0.12081577, -0.84551125]])
Now I want to apply the transformed data to DBSCAN like
>>> dbscan = DBSCAN(eps=0.5, min_samples=5)
>>> clusters = dbscan.fit_predict(X_trans_svd)
but my kernel crashes.
I also tried converting it back to a df and apply it to DBSCAN
>>> d = {'1st_component': X_trans_svd[:, 0],
'2nd_component': X_trans_svd[:, 1],
'3rd_component': X_trans_svd[:, 2]}
>>> df = pd.DataFrame(data=d)
>>> dbscan = DBSCAN(eps=0.5, min_samples=5)
>>> clusters = dbscan.fit_predict(df)
But the kernel keeps crashing. Any idea why is that? I'd appreciate a hint.
EDIT: If I use just part of my 3x197431 array it works until X_trans_svd[0:170000] and starts crashing at X_trans_svd[0:180000]. Furthermore the size of the array is
>>> X_trans_svd.nbytes
4738344
EDIT2: Sorry for doing this earlier. Here's an example to reproduce. I tried two machines with 16 and 64gb ram. Data is here: original data
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.cluster import DBSCAN
s = np.loadtxt('data.txt', dtype='float')
elapsed = datetime.now()
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(s)
elapsed = datetime.now() - elapsed
print(elapsed)

Parallelizing fastText.get_sentence_vector with dask gives pickling error

I was trying to get fastText sentence embeddings for 80 Million English tweets using the parallelizing mechanism using dask as described in this answer: How do you parallelize apply() on Pandas Dataframes making use of all cores on one machine?
Here is my full code:
import dask.dataframe as dd
from dask.multiprocessing import get
import fasttext
import fasttext.util
import pandas as pd
print('starting langage: ' + 'en')
lang_output = pd.DataFrame()
lang_input = full_input.loc[full_input.name == 'en'] # 80 Million English tweets
ddata = dd.from_pandas(lang_input, npartitions = 96)
print('number of lines to compute: ' + str(len(lang_input)))
fasttext.util.download_model('en', if_exists='ignore') # English
ft = fasttext.load_model('cc.'+'en'+'.300.bin')
fasttext.util.reduce_model(ft, 20)
lang_output['sentence_embedding'] = ddata.map_partitions(lambda lang_input: lang_input.apply((lambda x: get_fasttext_sentence_embedding(x.tweet_text, ft)), axis = 1)).compute(scheduler='processes')
print('finished en')
This is the get_fasttext_sentence_embedding function:
def get_fasttext_sentence_embedding(row, ft):
if pd.isna(row):
return np.zeros(20)
return ft.get_sentence_vector(row)
But, I get a pickling error on this line:
lang_output['sentence_embedding'] = ddata.map_partitions(lambda lang_input: lang_input.apply((lambda x: get_fasttext_sentence_embedding(x.tweet_text, ft)), axis = 1)).compute(scheduler='processes')
This is the error I get:
TypeError: can't pickle fasttext_pybind.fasttext objects
Is there a way to parallelize fastText model get_sentence_vector with dask (or anything else)? I need to parallelize because getting sentence embeddings for 80 Million tweets takes two much time and one row of my data frame is completely independent of the other.
The problem here is that fasttext objects apparently can't be pickled, and Dask doesn't know how to serialize and deserialize this data structure without pickling.
The simplest way to use Dask here (but likely not the most efficient), would be to have each process define the ft model itself, which would avoid the need to transfer it (and thus avoid the attempted pickling). Something like the following would work. Notice that ft is defined inside the function being mapped across partitions.
First, some example data.
import dask.dataframe as dd
import fasttext
import pandas as pd
import dask
import numpy as np
df = pd.DataFrame({"text":['this is a test sentence', None, 'this is another one.', 'one more']})
ddf = dd.from_pandas(df, npartitions=2)
ddf
Dask DataFrame Structure:
text
npartitions=2
0 object
2 ...
3 ...
Dask Name: from_pandas, 2 tasks
Next, we can tweak your functions to define ft within each process. This duplicates effort, but avoids the need to transfer the model. With that, we can smoothly run it via map_partitions.
def get_embeddings(sent, model):
return model.get_sentence_vector(sent) if not pd.isna(sent) else np.zeros(10)
def func(df):
ft = fasttext.load_model("amazon_review_polarity.bin") # arbitrary model
res = df['text'].apply(lambda x: get_embeddings(x, model=ft))
return res
ddf['sentence_vector'] = ddf.map_partitions(func)
ddf.compute(scheduler='processes')
text sentence_vector
0 this is a test sentence [-0.01934033, 0.03729743, -0.04679677, -0.0603...
1 None [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2 this is another one. [-0.0025579212, 0.0353713, -0.027139299, -0.05...
3 one more [-0.014522496, 0.10396308, -0.13107553, -0.198...
Note that this nested data structure (list in a column) is probably not the optimal way to handle these vectors, but it will depend on your use case. Also, there is probably a way to do this computation in batches using fastext rather than one row at a time (in Python), but I'm not well versed in the nuances of fastext.
I had the same problem, but I found a solution using Multiprocessing - Python's Standard Library.
First step - wrap
model = fasttext.load_model(file_name_model)
def get_vec(txt):
'''
First tried to put model.get_sentence_vector into map (look below), but it resulted in pickle error.
This works, lol.
'''
return model.get_sentence_vector(txt)
Then, I'm doing this:
from multiprocessing import Pool
text = ["How to sell drugs (fast)", "House of Cards", "The Crown"]
with Pool(40) as p: # I have 40 cores
result = p.map(get_vec, text)
With 40 cores processing 10M short texts took me ~80s.

How to iteratively add rows to an inital empty pandas Dataframe?

I have to iteratively add rows to a pandas DataFrame and find this quite hard to achieve. Also performance-wise I'm not sure if this is the best approach.
So from time to time, I get data from a server and this new dataset from the server will be a new row in my pandas DataFrame.
import pandas as pd
import datetime
df = pd.DataFrame([], columns=['Timestamp', 'Value'])
# as this df will grow over time, is this a costly copy (df = df.append) or does pandas does some optimization there, or is there a better way to achieve this?
# ignore_index, as I want the index to automatically increment
df = df.append({'Timestamp': datetime.datetime.now()}, ignore_index=True)
print(df)
After one day the DataFrame will be deleted, but during this time, probably 100k times a new row with data will be added.
The goal is still to achieve this in a very efficient way, runtime-wise (memory doesn't matter too much as enough RAM is present).
I tried this to compare the speed of 'append' compared to 'loc' :
import timeit
code = """
import pandas as pd
df = pd.DataFrame({'A': range(0, 6), 'B' : range(0,6)})
df= df.append({'A' : 3, 'B' : 4}, ignore_index = True)
"""
code2 = """
import pandas as pd
df = pd.DataFrame({'A': range(0, 6), 'B' : range(0,6)})
df.loc[df.index.max()+1, :] = [3, 4]
"""
elapsed_time1 = timeit.timeit(code, number = 1000)/1000
elapsed_time2 = timeit.timeit(code2, number = 1000)/1000
print('With "append" :',elapsed_time1)
print('With "loc" :' , elapsed_time2)
On my machine, I obtained these results :
With "append" : 0.001502693824000744
With "loc" : 0.0010836279180002747
Using "loc" seems to be faster.

Resources