select a part of dataframe every time in parallel - python-3.x

I want to create dictionaries in a loop.
Since, in every iteration I am taking only a part of the initial dataframe ( df_train = df[df['CLASS'] == oneClass]) , I want to make it parallel.
My code is:
import pandas as pd
import numpy as np
from multiprocessing import Pool
df = pd.DataFrame({'a':[0,1,2], 'b':[3, 4, 5], 'c': [6, 7, 8], 'CLASS':['A', 'B', 'C']})
def make_dataframes(df, oneClass):
new_df = {}
df_train = df[df['CLASS'] == oneClass]
numeric_only_data_cols = df_train.select_dtypes(include=np.number).columns.difference(['CLASS'])
numeric_only_data = df_train[numeric_only_data_cols]
X = numeric_only_data.values
x = X * 100
orig_columns = numeric_only_data.loc[:,
numeric_only_data.columns!='CLASS'].columns
new_df[oneClass] = pd.DataFrame(x, columns=orig_columns)
new_df[oneClass]['CLASS'] = df_train['CLASS']
return new_df
new_df = {}
classes = np.unique(df['CLASS'])
with Pool(4) as pool:
for new_dataframe in pool.map(make_dataframes, classes):
new_df['new_dataframe'] = new_dataframe
pool.close()
pool.join()
I omitted the for loop in the function:
new_df = {}
for oneClass in classes:
df_train = df[df['GROUP_DESC'] == oneClass]
...
Now, I am receiving:
make_dataframes() missing 1 required positional argument: 'oneClass'
I am not sure how to place the arguments of the function and if the classes is a valid argument for map.

Are you planning on executing your code inside a cluster? If not, then you're probably better off executing your code in the old single process fashioned way. There's this great talk on the subject by Raymond Hettinger that I find pretty interesting, and I recommend checking out: Raymond Hettinger, Keynote on Concurrency, PyBay 2017.
Having said that, one easy fix to your implementation would be to define a single parameter as input to make_dataframes, that represents a tuple of both df, and oneClass:
import pandas as pd
import numpy as np
from multiprocessing import Pool
def make_dataframes(args):
new_df = {}
df = args[0] # <--- Unpacking values
oneClass = args[-1] # <--- Unpacking values
df_train = df[df['CLASS'] == oneClass]
numeric_only_data = df_train.select_dtypes(include=np.number).loc[:, lambda xdf: xdf.columns.difference(['CLASS'])]
X = numeric_only_data.values
x = X * 100
orig_columns = numeric_only_data.loc[:, numeric_only_data.columns != 'CLASS'].columns
new_df[oneClass] = pd.DataFrame(x, columns=orig_columns)
new_df[oneClass]['CLASS'] = df_train['CLASS']
return new_df
df = pd.DataFrame({'a':[0,1,2], 'b':[3, 4, 5], 'c': [6, 7, 8], 'CLASS':['A', 'B', 'C']})
new_df = {}
classes = np.unique(df["CLASS"])
with Pool(4) as pool:
for new_dataframe in pool.map(make_dataframes, zip([df]*len(classes), classes)):
new_df[list(new_dataframe.keys())[0]] = list(new_dataframe.values())[0]
pool.close()
pool.join()
A second approach would be to use the Joblib package instead of multiprocessing, like so:
import pandas as pd
import numpy as np
from joblib import Parallel, delayed
def make_dataframes(df, oneClass):
new_df = {}
df_train = df[df["CLASS"] == oneClass]
numeric_only_data = df_train.select_dtypes(include=np.number).loc[
:, lambda xdf: xdf.columns.difference(["CLASS"])
]
X = numeric_only_data.values
x = X * 100
orig_columns = numeric_only_data.loc[
:, numeric_only_data.columns != "CLASS"
].columns
new_df[oneClass] = pd.DataFrame(x, columns=orig_columns)
new_df[oneClass]["CLASS"] = df_train["CLASS"]
return new_df
df = pd.DataFrame({'a':[0,1,2], 'b':[3, 4, 5], 'c': [6, 7, 8], 'CLASS':['A', 'B', 'C']})
classes = np.unique(df["CLASS"])
new_df = {
key: value
for parallel in Parallel(n_jobs=4)(
delayed(make_dataframes)(df, i) for i in classes
)
for key, value in parallel.items()
}
Finally, the approach I recommend using, if you're not planning on running this code inside a power-hungry cluster, and need to extract all the juice you can get from it:
import pandas as pd
import numpy as np
from joblib import Parallel, delayed
def make_dataframes(df, oneClass):
new_df = {}
df_train = df[df["CLASS"] == oneClass]
numeric_only_data = df_train.select_dtypes(include=np.number).loc[
:, lambda xdf: xdf.columns.difference(["CLASS"])
]
X = numeric_only_data.values
x = X * 100
orig_columns = numeric_only_data.loc[
:, numeric_only_data.columns != "CLASS"
].columns
new_df[oneClass] = pd.DataFrame(x, columns=orig_columns)
new_df[oneClass]["CLASS"] = df_train["CLASS"]
return new_df
df = pd.DataFrame({'a':[0,1,2], 'b':[3, 4, 5], 'c': [6, 7, 8], 'CLASS':['A', 'B', 'C']})
classes = np.unique(df["CLASS"])
new_df = {c: make_dataframes(df, c)[c] for c in classes}
For comparison, I've recorded each approach execution time:
multiprocessing: CPU times: user 13.6 ms, sys: 41.1 ms, total: 54.7 ms Wall time: 158 ms
joblib: CPU times: user 14.3 ms, sys: 0 ns, total: 14.3 ms Wall time: 16.5 ms
Serial processing: CPU times: user 14.1 ms, sys: 797 µs, total: 14.9 ms Wall time: 14.9 ms
Running things in parallel has a lot of overhead communication costs between the different processing nodes. Besides it's an intrinsically more complex task to do, then to run things serially. Consequently, developing and maintaining the code becomes exponentially harder and expensive. If running things in parallel is number 1 priority, I would recommend first ditching Pandas, and using PySpark, or Dask instead.

Related

Python Way - Processing multiple files & add the sum of all the results in Parallel

I have a list of 500 json files. Contents of the files are as follows
{'minute': '2022-11-16T02:29:00.000+00:00', 'mycount': [[0, 0], [1, 32], [2, 3456], [3, 446], [4, 534534], [5, 474], [6, 448], [7, 529], [8, 507], [9, 515], [10, 477], [11, 486], [12, 491], [13, 474], [14, 528], [15, 23]]}
I want to achieve the following using parallel processing ( may be processing 100 files in parallel)
For each file find the sum of second element of each element of mycount ( 0 + 32 +3456 +446+534534..]. Lets call it sum1
calculate sum1 for all the files and return total sum = sum1 + sum2+ sum3...
How can I achieve this using mutithreading in python?
If you don't mind of using multiprocessing instead of multithreading, you can adopt multiprocessing library and json decoder to parse the content of your files:
import multiprocessing as mp
import json
# Other libraries
import os
import warnings
def compute_file_sum(f):
"""Compute the sum for a file"""
try:
# Read the whole content
with open(f, 'r') as ff:
file_content = ff.readlines()
# Load as a JSON (mind the change of ' into ")
file_content = json.loads('\n'.join(file_content).replace("'", '"'))
# Compute the sum of second items of each element in 'mycount'
return sum(
c[1] for c in file_content['mycount']
)
except Exception as e:
# Handle exceptions
warnings.warn(f"Issues with file {f}, {str(e)}")
return 0
def get_filepaths(root_dir):
"""Get an iterator with the paths of the files of interest"""
return map(
lambda y: os.path.join(root_dir, y),
filter(
# Filter only files whose names match some conditions
lambda x: os.path.splitext(x)[0].startswith('bbb') and os.path.splitext(x)[1] == '.txt',
next(os.walk(root_dir))[2]
)
)
if __name__ == '__main__':
# Get the path of the folder with the files of interest
# Here is the folder with the python script
root_dir = os.path.dirname(__file__)
# Compute in parallel the sum for each file
with mp.Pool(processes=mp.cpu_count() - 1) as a:
file_sums = a.imap_unordered(compute_file_sum, get_filepaths(root_dir))
# Get the total sum
total_sum = sum(file_sums)

Python Pandas How to get rid of groupings with only 1 row?

In my dataset, I am trying to get the margin between two values. The code below runs perfectly if the fourth race was not included. After grouping based on a column, it seems that sometimes, there will be only 1 value, therefore, no other value to get a margin out of. I want to ignore these groupings in that case. Here is my current code:
import pandas as pd
data = {'Name':['A', 'B', 'B', 'C', 'A', 'C', 'A'], 'RaceNumber':
[1, 1, 2, 2, 3, 3, 4], 'PlaceWon':['First', 'Second', 'First', 'Second', 'First', 'Second', 'First'], 'TimeRanInSec':[100, 98, 66, 60, 75, 70, 75]}
df = pd.DataFrame(data)
print(df)
def winning_margin(times):
times = list(times)
winner = min(times)
times.remove(winner)
return min(times) - winner
winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
.groupby('RaceNumber').agg(winning_margin)
winning_margins.columns = ['margin']
winners = df.loc[df.PlaceWon == 'First', :]
winners = winners.join(winning_margins, on='RaceNumber')
avg_margins = winners[['Name', 'margin']].groupby('Name').mean()
avg_margins
How about returning a NaN if times does not have enough elements:
import numpy as np
def winning_margin(times):
if len(times) <= 1: # New code
return np.NaN # New code
times = list(times)
winner = min(times)
times.remove(winner)
return min(times) - winner
your code runs with this change and seem to produce sensible results. But you can furthermore remove NaNs later if you want eg in this line
winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
.groupby('RaceNumber').agg(winning_margin).dropna() # note the addition of .dropna()
You could get the winner and margin in one step:
def get_margin(x):
if len(x) < 2:
return np.NaN
i = x['TimeRanInSec'].idxmin()
nl = x['TimeRanInSec'].nsmallest(2)
margin = nl.max()-nl.min()
return [x['Name'].loc[i], margin]
Then:
df.groupby('RaceNumber').apply(get_margin).dropna()
RaceNumber
1 [B, 2]
2 [C, 6]
3 [C, 5]
(the data has the 'First' indicator corresponding to the slower time in the data)

Generate a list with two unique elements with specific length [duplicate]

Simple question here:
I'm trying to get an array that alternates values (1, -1, 1, -1.....) for a given length. np.repeat just gives me (1, 1, 1, 1,-1, -1,-1, -1). Thoughts?
I like #Benjamin's solution. An alternative though is:
import numpy as np
a = np.empty((15,))
a[::2] = 1
a[1::2] = -1
This also allows for odd-length lists.
EDIT: Also just to note speeds, for a array of 10000 elements
import numpy as np
from timeit import Timer
if __name__ == '__main__':
setupstr="""
import numpy as np
N = 10000
"""
method1="""
a = np.empty((N,),int)
a[::2] = 1
a[1::2] = -1
"""
method2="""
a = np.tile([1,-1],N)
"""
method3="""
a = np.array([1,-1]*N)
"""
method4="""
a = np.array(list(itertools.islice(itertools.cycle((1,-1)), N)))
"""
nl = 1000
t1 = Timer(method1, setupstr).timeit(nl)
t2 = Timer(method2, setupstr).timeit(nl)
t3 = Timer(method3, setupstr).timeit(nl)
t4 = Timer(method4, setupstr).timeit(nl)
print 'method1', t1
print 'method2', t2
print 'method3', t3
print 'method4', t4
Results in timings of:
method1 0.0130500793457
method2 0.114426136017
method3 4.30518102646
method4 2.84446692467
If N = 100, things start to even out but starting with the empty numpy arrays is still significantly faster (nl changed to 10000)
method1 0.05735206604
method2 0.323992013931
method3 0.556654930115
method4 0.46702003479
Numpy arrays are special awesome objects and should not be treated like python lists.
use resize():
In [38]: np.resize([1,-1], 10) # 10 is the length of result array
Out[38]: array([ 1, -1, 1, -1, 1, -1, 1, -1, 1, -1])
it can produce odd-length array:
In [39]: np.resize([1,-1], 11)
Out[39]: array([ 1, -1, 1, -1, 1, -1, 1, -1, 1, -1, 1])
Use numpy.tile!
import numpy
a = numpy.tile([1,-1], 15)
use multiplication:
[1,-1] * n
If you want a memory efficient solution, try this:
def alternator(n):
for i in xrange(n):
if i % 2 == 0:
yield 1
else:
yield -1
Then you can iterate over the answers like so:
for i in alternator(n):
# do something with i
Maybe you're looking for itertools.cycle?
list_ = (1,-1,2,-2) # ,3,-3, ...
for n, item in enumerate(itertools.cycle(list_)):
if n==30:
break
print item
I'll just throw these out there because they could be more useful in some circumstances.
If you just want to alternate between positive and negative:
[(-1)**i for i in range(n)]
or for a more general solution
nums = [1, -1, 2]
[nums[i % len(nums)] for i in range(n)]

Splitting a matrix using an array of indices

I have a matrix that I want to split up into two. The two new are sort of tangled together, but I do have a "start" and "stop" array indicating what rows belong to each new matrix.
I have given a small example below including my own solution which I do not find satisfying.
Is there a smarter way of splitting the matrix?
Note that there is a certain periodicity in this example, which in not the case in the real matrix.
import numpy as np
np.random.seed(1)
a = np.random.normal(size=[20,2])
print(a)
b_start = np.array([0, 5, 10, 15])
b_stop = np.array([2, 7, 12, 17])
c_start = np.array([2, 7, 12, 17])
c_stop = np.array([5, 10, 15, 20])
b = a[b_start[0]:b_stop[0], :]
c = a[c_start[0]:c_stop[0], :]
for i in range(1, len(b_start)):
b = np.append(b, a[b_start[i]:b_stop[i], :], axis=0)
c = np.append(c, a[c_start[i]:c_stop[i], :], axis=0)
print(b)
print(c)
You can use fancy indexing functionality of numpy.
index_b = np.array([np.arange(b_start[i], b_stop[i]) for i in range(b_start.size)])
index_c = np.array([np.arange(c_start[i], c_stop[i]) for i in range(c_start.size)])
b = a[index_b].reshape(-1, a.shape[1])
c = a[index_c].reshape(-1, a.shape[1])
This will give you the same output.
Test run:
import numpy as np
np.random.seed(1)
a = np.random.normal(size=[20,2])
print(a)
b_start = np.array([0, 5, 10, 15])
b_stop = np.array([2, 7, 12, 17])
c_start = np.array([2, 7, 12, 17])
c_stop = np.array([5, 10, 15, 20])
index_b = np.array([np.arange(b_start[i], b_stop[i]) for i in range(b_start.size)])
index_c = np.array([np.arange(c_start[i], c_stop[i]) for i in range(c_start.size)])
b = a[index_b].reshape(-1, a.shape[1])
c = a[index_c].reshape(-1, a.shape[1])
print(b)
print(c)
Output:
[[ 1.62434536 -0.61175641]
[-0.52817175 -1.07296862]
[ 1.46210794 -2.06014071]
[-0.3224172 -0.38405435]
[-1.10061918 1.14472371]
[ 0.90159072 0.50249434]
[-0.69166075 -0.39675353]
[-0.6871727 -0.84520564]]
[[ 0.86540763 -2.3015387 ]
[ 1.74481176 -0.7612069 ]
[ 0.3190391 -0.24937038]
[ 1.13376944 -1.09989127]
[-0.17242821 -0.87785842]
[ 0.04221375 0.58281521]
[ 0.90085595 -0.68372786]
[-0.12289023 -0.93576943]
[-0.26788808 0.53035547]
[-0.67124613 -0.0126646 ]
[-1.11731035 0.2344157 ]
[ 1.65980218 0.74204416]]
I did 100 runs of two approaches, running time is:
0.008551359176635742#python for loop
0.0034341812133789062#fancy indexing
And 10000 runs:
0.18994426727294922#python for loop
0.26583170890808105#fancy indexing
Congratulations on using np.append correctly. A lot of posters have problems with it.
But it is faster to collect values in a list, and do one concatenate. np.append makes a whole new array each time; list append just adds a pointer to the list in-place.
b = []
c = []
for i in range(1, len(b_start)):
b.append(a[b_start[i]:b_stop[i], :])
c.append(a[c_start[i]:c_stop[i], :])
b = np.concatenate(b, axis=0)
c = np.concatenate(c, axis=0)
or even
b = np.concatenate([a[i:j,:] for i,j in zip(b_start, b_stop)], axis=0)
The other answer does
idx = np.hstack([np.arange(i,j) for i,j in zip(b_start, b_stop)])
a[idx,:]
Based on previous SO questions I expect the two approaches to have about the same speed.

ARMA model order selection using arma_order_select_ic from statsmodel

I am using the arma_order_select_ic from the statsmodel library to calculate the (p,q) order for the ARMA model, I am using for loop to loop over the different companies that are in each column of the data-frame. The code is as follows:
import pandas as pd
from statsmodels.tsa.stattools import arma_order_select_ic
df = pd.read_csv("Adjusted_Log_Returns.csv", index_col = 'Date').dropna()
main_df = pd.DataFrame()
for i in range(146):
order_selection = arma_order_select_ic(df.iloc[i].values, max_ar = 4,
max_ma = 2, ic = "aic")
ticker = [df.columns[i]]
df_aic_min = pd.DataFrame([order_selection["aic_min_order"]], index =
ticker)
main_df = main_df.append(df_aic_min)
main_df.to_csv("aic_min_orders.csv")
The code runs fine and I get all the results in the csv file at the end but the thing thats confusing me is that when I compute the (p,q) outside the for loop for a single company then I get different results
order_selection = arma_order_select_ic(df["ABL"].values, max_ar = 4,
max_ma = 2, ic = "aic")
The order for the company ABL is (1,1) when computed in the for loop while its (4,1) when computed outside of it.
So my question is what am I doing wrong or why is it like this? Any help would be appreciated.
Thanks in Advance
It's pretty clear from your code that you're trying to find the parameters for an ARMA model on the columns' data, but it's not what the code is doing: you're finding in the loop the parameters for the rows.
Consider this:
import pandas as pd
df = pd.DataFrame({'a': [3, 4]})
>>> df.iloc[0]
a 3
Name: 0, dtype: int64
>>> df['a']
0 3
1 4
Name: a, dtype: int64
You should probably change your code to
for c in df.columns:
order_selection = arma_order_select_ic(df[c].values, max_ar = 4,
max_ma = 2, ic = "aic")
ticker = [c]

Resources