pandas group by in parallel - python-3.x

I'm having some trouble splitting the aggregation step of a group-by operation across multiple cores. I have the following working code, and would like to apply it over several processors:
import pandas as pd
import numpy as np
from multiprocessing import Pool, cpu_count
mydf = pd.DataFrame({'v1':[1,2,3,4]*6,'v2':['a','b','c']*8,'v3':np.arange(20,44)})
Which I can then apply the following GroupBy operation:
(the step I wish to do in parallel)
pd.groupby(mydf,by=['v1','v2']).apply(lambda x: np.percentile(x['v3'],[20,30]))
yielding the series:
1 a [22.4, 23.6]
b [26.4, 27.6]
c [30.4, 31.6]
2 a [31.4, 32.6]
b [23.4, 24.6]
c [27.4, 28.6]
I Tried the following, with reference to:parallel groupby
def applyParallel(dfGrouped, func):
with Pool(1) as p:
ret_list = p.map(func, [group for name, group in dfGrouped])
return pd.concat(ret_list)
def myfunc(df):
df['pct1'] = df.loc[:,['v3']].apply(np.percentile,args=([20],))
df['pct2'] = df.loc[:,['v3']].apply(np.percentile,args=([80],))
return(df)
grouped = pd.groupby(mydf,by=['v1','v2'])
applyParallel(grouped,myfunc)
But I'm losing the index structure and getting duplicates. I could probably solve this step with a further group by operation, but I think it shouldn't be too difficult to avoid it entirely. Any suggestions?

Not that I'm still looking for an answer, but It'd probably be better to use a library that handles parallel manipulations of pandas DataFrames, rather than trying to do so manually.
Dask is one option which is intended to scale Pandas operations with little code modification.
Another option (but is maybe a little more difficult to set up) is PySpark

Related

Trying to use pandas to group all data in Column B [duplicate]

I have a large (about 12M rows) DataFrame df:
df.columns = ['word','documents','frequency']
The following ran in a timely fashion:
word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']
However, this is taking an unexpectedly long time to run:
Occurrences_of_Words = word_grouping[['word']].count().reset_index()
What am I doing wrong here? Is there a better way to count occurrences in a large DataFrame?
df.word.describe()
ran pretty well, so I really did not expect this Occurrences_of_Words DataFrame to take very long to build.
I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.
When you want to count the frequency of categorical data in a column in pandas dataFrame use: df['Column_Name'].value_counts()
-Source.
Just an addition to the previous answers. Let's not forget that when dealing with real data there might be null values, so it's useful to also include those in the counting by using the option dropna=False (default is True)
An example:
>>> df['Embarked'].value_counts(dropna=False)
S 644
C 168
Q 77
NaN 2
Other possible approaches to count occurrences could be to use (i) Counter from collections module, (ii) unique from numpy library and (iii) groupby + size in pandas.
To use collections.Counter:
from collections import Counter
out = pd.Series(Counter(df['word']))
To use numpy.unique:
import numpy as np
i, c = np.unique(df['word'], return_counts = True)
out = pd.Series(c, index = i)
To use groupby + size:
out = pd.Series(df.index, index=df['word']).groupby(level=0).size()
One very nice feature of value_counts that's missing in the above methods is that it sorts the counts. If having the counts sorted is absolutely necessary, then value_counts is the best method given its simplicity and performance (even though it still gets marginally outperformed by other methods especially for very large Series).
Benchmarks
(if having the counts sorted is not important):
If we look at runtimes, it depends on the data stored in the DataFrame columns/Series.
If the Series is dtype object, then the fastest method for very large Series is collections.Counter, but in general value_counts is very competitive.
However, if it is dtype int, then the fastest method is numpy.unique:
Code used to produce the plots:
import perfplot
import numpy as np
import pandas as pd
from collections import Counter
def creator(n, dt='obj'):
s = pd.Series(np.random.randint(2*n, size=n))
return s.astype(str) if dt=='obj' else s
def plot_perfplot(datatype):
perfplot.show(
setup = lambda n: creator(n, datatype),
kernels = [lambda s: s.value_counts(),
lambda s: pd.Series(Counter(s)),
lambda s: pd.Series((ic := np.unique(s, return_counts=True))[1], index = ic[0]),
lambda s: pd.Series(s.index, index=s).groupby(level=0).size()
],
labels = ['value_counts', 'Counter', 'np_unique', 'groupby_size'],
n_range = [2 ** k for k in range(5, 25)],
equality_check = lambda *x: (d:= pd.concat(x, axis=1)).eq(d[0], axis=0).all().all(),
xlabel = '~len(s)',
title = f'dtype {datatype}'
)
plot_perfplot('obj')
plot_perfplot('int')

Network bound transformation and threading

I am trying to use a REST API to enrich data I have in a spark dataframe. The REST API isn't built by me and requires a single input at a time (no batch option). Unfortunately the REST API latency is slower than I would like so my spark applications seem to spend a lot of time waiting for the API to iterate over each row. Although my REST API has higher latency, it does have very high throughput/capacity which does not seem to get fully used by my spark application.
Since my application appears to be network bound, I was wondering if it would make sense to use threading to help improve the speed of my application. Does spark already capable of doing this internally? If using threads does make sense, is there an easy way to accomplish this? Has anybody successfully done this?
I’ve encountered the same problem when fetching data from a blob storage.
Below is a small self-contained dummy example that I think you can easily modify for your needs.
In the example you should be able to register that it takes a lot longer to construct df_slow vs constructing df_fast.
It works by making each worker process a list of rows in parallel, instead of processing one row at a time sequentially.
You might be able to just swap the slowAdd function with your own Row transforming function. The slowAdd function simulates network latency by sleeping 0.1 seconds.
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import Row
# Just some dataframe with numbers
data = [(i,) for i in range(0, 1000)]
df = spark.createDataFrame(data, ["Data"], T.IntegerType())
# Get an rdd that contains 'list of Rows' instead of 'Row'
standardRdd = df.rdd # contains [row1, row3, row3,...]
number_of_partitions = 10
repartionedRdd = standardRdd.repartition(number_of_partitions) # contains [row1, row2, row3,...] but repartioned to increase parallelism
glomRdd = repartionedRdd.glom() # contains roughly [[row1, row2, row3,..., row100], [row101, row102, row103, ...], ...]
# where the number of sublists corresponds to the number of partitions
# Define a transformation function with an artificial delay.
# Substitute this with your own transformation function.
import time
def slowAdd(r):
d = r.asDict()
d["Data"] = d["Data"] + 100
time.sleep(0.1)
return Row(**d)
# Define a function that maps the slowAdd function from 'list of Rows' to 'list of Rows' in parallel
import concurrent.futures
def slowAdd_with_thread_pool(list_of_rows):
thread_pool = concurrent.futures.ThreadPoolExecutor(max_workers=100)
return [result for result in thread_pool.map(slowAdd, list_of_rows)]
# Perform a fast mapping from 'list of Rows' to 'Rows'.
transformed_fast_rdd = glomRdd.flatMap(slowAdd_with_thread_pool)
# For reference, perform a slow mapping from 'Rows' to 'Rows'
transformed_slow_rdd = repartionedRdd.map(slowAdd)
# Convert the rdds back to dataframes from the rdd's
df_fast = spark.createDataFrame(transformed_fast_rdd)
#This sum operation will be fast (~100 threads sleeping in parallel on each worker)
df_fast.agg(F.sum(F.col("Data"))).show()
df_slow = spark.createDataFrame(transformed_slow_rdd)
#This sum operation will be slow (1 thread sleeping in parallel on each worker)
df_slow.agg(F.sum(F.col("Data"))).show()

Equivalent of pandas apply in pyspark?

I really want to be able to run complex functions over a whole column of a spark dataframe, as i would do in Pandas with the apply function.
For example, in Pandas I have an apply function that takes a messy domain like sub-subdomain.subdomain.facebook.co.nz/somequerystring and just outputs facebook.com.
How would I do that in Spark?
I have looked at UDF's but I am not clear how I would run it on a single column.
Let's say I have a simple function like below where I extract different bits of a date from the column of the pandas DF:
def format_date(row):
year = int(row['Contract_Renewal'][7:])
month = int(row['Contract_Renewal'][4:6])
day = int(row['Contract_Renewal'][:3])
date = datetime.date(year, month, day)
return date-now
In Pandas I would call it like:
df['days_until'] = df.apply(format_date, axis=1)
Can I achieve the same in Pyspark?
In this scenario, you may be able to use some combination of regexp_extract (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.regexp_extract), regexp_replace (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.regexp_replace), and split (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.split) to reformat the date the strings.
It's not as clean as defining your own function and using apply like Pandas, but it should be more performant than defining a Pandas/Spark UDF.
Good luck!
The latest version of PySpark provides a way to run apply() function by leveraging pandas. You can find the example at PySpark apply Function to Column
# Imports
import pyspark.pandas as ps
import numpy as np
technologies = ({
'Fee' :[20000,25000,30000,22000,np.NaN],
'Discount':[1000,2500,1500,1200,3000]
})
# Create a DataFrame
psdf = ps.DataFrame(technologies)
print(psdf)
def add(row):
return row[0]+row[1]
addDF = psdf.apply(add,axis=1)
print(addDF)

Using loops to call multiple pandas dataframe columns

I am new to python3 and trying to do chisquared tests on columns in a pandas dataframe. My columns are in pairs: observed_count_column_1, expected count_column_1, observed_count_column_2, expected_count_column_2 and so on. I would like to make a loop to get all column pairs done at once.
I succeed doing this if I specify column index integers or column names manually.
This works
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
chisquare(df.iloc[:,[0]], df.iloc[:,[1]])
This, trying with a loop, does not:
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
for n in [0,2,4,6,8,10]:
chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
The loop code does not seem to run at all and I get no error but no output either.
I was wondering why this is happening and how can I actually approach this?
Thank you,
Dan
Consider building a data frame of chi-square results from list of tuples, then assign column names as indicators for observed and expected frequencies (subsetting even/odd columns by indexed notation):
# CREATE DATA FRAME FROM LIST IF TUPLES
# THEN ASSIGN COLUMN NAMES
chi_square_df = (pd.DataFrame([chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]) \
for n in range(0,11,2)],
columns = ['chi_sq_stat', 'p_value'])
.assign(obs_freq = df.columns[::2],
exp_freq = df.columns[1::2])
)
chisquare() function returns two values so you can try this:
for n in range(0, 11, 2):
chisq, p = chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
print('Chisq: {}, p-value: {}'.format(chisq, p))
You can find what it returns in the docs here https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
Thank you for the suggestions. Using the information from Parfait comment, that loops don't print I managed to find a solution, although not as elegant as their own solution above.
for n in range(0, 11, 2):
print(chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]))
This gives the expected results.
Dan

Poor performance of multiple aggregations with windowing in Pandas

I need to calculate in Pandas a lot of aggregations by Dataframe index and with taking in mind windowing by time (column MONTH). Something like:
# t is my DataFrame
grouped=t.groupby(t.index)
def f(g):
g1=g[g.MONTH<=1]
g2=g[g.MONTH<=5]
agrs=[]
index=[]
for c in cat_columns:
index.append(c+'_EOP')
agrs.append(g.iloc[0][c])
for c in cont_columns:
index.append(c+'_MEAN_2')
mean2=g1[c].mean()
agrs.append(mean2)
index.append(c+'_MEAN_6')
mean6=g2[c].mean()
agrs.append(mean6)
index.append(c+'_MEDIAN_2')
agrs.append(g1[c].median())
index.append(c+'_MEDIAN_6')
agrs.append(g2[c].median())
index.append(c+'_MIN_2')
agrs.append(g1[c].min())
index.append(c+'_MIN_6')
agrs.append(g2[c].min())
index.append(c+'_MAX_2')
agrs.append(g1[c].max())
index.append(c+'_MAX_6')
agrs.append(g2[c].max())
index.append(c+'_MEAN_CHNG')
agrs.append((mean2-mean6)/mean6)
return pd.Series(agrs, index=index)
aggrs=grouped.apply(f)
I have 100-120 attributes in each list: cat_columns and cont_columns and about 1.5 million of rows.
The performance is very slow (I'm waiting already 15 hours). How to speed up it?
Probably there exactly two questions:
1. Can I speed up performance with tuning this code with use of Pandas only?
2. Is it possible to calculate the same aggregations in Dask (I read it is multi-core wrapper over Pandas)? I already tried to parallelize work with help of joblib. Something like (I also added cont_columns to the prototype of f):
def tt(grouped, cont_columns):
return grouped.apply(f, cont_columns)
r = Parallel(n_jobs=4, verbose=True)([delayed(tt)(grouped, cont_columns[:16]),
delayed(tt)(grouped, cont_columns[16:32]),
delayed(tt)(grouped, cont_columns[32:48]),
delayed(tt)(grouped, cont_columns[48:])]
)
But got unlimited recursion error in Pandas groupby.
Pandas experts, please advise!
Thanks!
Sergey.

Resources