Optimizing Pyspark UDF on large data - apache-spark

I am trying to optimize this code that creates a dummy when the column's value (of a pyspark dataframe) is in [categories].
When the run is on 100K rows, it takes about 30seconds to run. In my case I have around 20M rows which will take a lot of time.
def create_dummy(dframe,col_name,top_name,categories,**options):
lst_tmp_col = []
if 'lst_tmp_col' in options:
lst_tmp_col = options["lst_tmp_col"]
udf = UserDefinedFunction(lambda x: 1 if x in categories else 0, IntegerType())
dframe = dframe.withColumn(str(top_name), udf(col(col_name))).cache()
dframe = dframe.select(lst_tmp_col+ [str(top_name)])
return dframe
In other words, how do I optimize this function and cut the total time down regarding the volume of my data? And how to make sure that this function does not iterates over my data?
Appreciate your suggestions. Thanks

You don't need a UDF for encoding the categories. You can use isin:
import pyspark.sql.functions as F
def create_dummy(dframe, col_name, top_name, categories, **options):
lst_tmp_col = []
if 'lst_tmp_col' in options:
lst_tmp_col = options["lst_tmp_col"]
dframe = dframe.withColumn(str(top_name), F.col(col_name).isin(categories).cast("int")).cache()
dframe = dframe.select(lst_tmp_col + [str(top_name)])
return dframe

Related

How to apply a function fastly on the list of DataFrame in Python?

I have a list of DataFrames with equal length of columns and rows but different values, such as
data = [df1, df2,df3.... dfn] .
How can I apply a function function on each dataframe in the list data? I used following code but it doe not work
data = [df1, def2,df3.... dfn]
def maxloc(data):
data['loc_max'] = np.zeros(len(data))
for i in range(1,len(data)-1): #from the second value on
if data['q_value'][i] >= data['q_value'][i-1] and data['q_value'][i] >= data['q_value'][i+1]:
data['loc_max'][i] = 1
return data
df_list = [df.pipe(maxloc) for df in data]
Seems to me the problem is in your maxloc() function as this code works.
I added also the maximum value in the return of maxloc.
from random import randrange
import pandas as pd
def maxloc(data_frame):
max_index = data_frame['Value'].idxmax(0)
maximum = data_frame['Value'][max_index]
return max_index, maximum
# create test list of data-frames
data = []
for i in range(5):
temp = []
for j in range(10):
temp.append(randrange(100))
df = pd.DataFrame({'Value': temp}, index=(range(10)))
data.append(df)
df_list = [df.pipe(maxloc) for df in data]
for i, (index, value) in enumerate(df_list):
print(f"Data-frame {i:02d}: maximum = {value} at position {index}")

Performing a calculation on every item in a DataFrame

Have a large pandas DataFrame of 1m rows. I want to perform a calculation on every item and create a new DataFrame from it.
The way I'm currently doing it is crazily slow. Any thoughts on how I might improve the efficiency?
# Create some random data in a DataFrame
import pandas as pd
import numpy as np
dfData = pd.DataFrame(np.random.randint(0,1000,size=(100, 10)), columns=list('ABCDEFGHIJ'))
# Key values
colTotals = dfData.sum(axis=0)
rowTotals = dfData.sum(axis=1)
total = dfData.values.sum()
dfIdx = pd.DataFrame()
for respId, row in dfData.iterrows():
for scores in row.iteritems():
colId = scores[0]
score = scores[1]
# Do the calculation
idx = (score / colTotals[colId]) * (total / rowTotals[respId]) * 100
dfIdx.loc[respId, colId] = idx
I think this is the logic of your code
dfData.div(colTotals).mul((total / rowTotals) * 100, 0)

Improvement of a Python script | Performance

I wrote a code. But it's very slow. The goal is to look for matches. It doesn't have to be one-on-one matches.
I have a data frame which has about 3,600,000 entries --> "SingleDff"
I have a data frame with about 110'000 entries --> "dfnumbers"
Now - The code tries to find out if out of these 110'000 entries you can find entries in the 3'600'000 million.
I added a counter to see how "fast" it is. After 24h I only got 11'000 entries. 10% in total
I'm looking now for ways and/or ideas how I can improve the performance of the Code.
The Code:
import os
import glob
import numpy as np
import pandas as pd
#Preparation
pathfiles = 'C:\\Python\\Data\\Input\\'
df_Files = glob.glob(pathfiles + "*.csv")
df_Files = [pd.read_csv(f, encoding='utf-8', sep=';', low_memory=False) for f in df_Files]
SingleDff = pd.concat(df_Files, ignore_index=True, sort=True)
dfnumbers = pd.read_excel('C:\\Python\\Data\\Input\\UniqueNumbers.xlsx')
#Output
outputDf = pd.DataFrame()
SingleDff['isRelevant'] = np.nan
count = 0
max = len(dfnumbers['Korrigierter Wert'])
arrayVal = dfnumbers['Korrigierter Wert']
for txt in arrayVal:
outputDf = outputDf.append(SingleDff[SingleDff['name'].str.contains(txt)], ignore_index = True)
outputDf['isRelevant'] = np.where(outputDf['isRelevant'].isnull(),txt,outputDf['isRelevant'])
count += 1
outputDf.to_csv('output_match.csv')
Edit:
Example of Data
In the 110'000 Data Frame I have something like this:
ABCD-12345-1245-T1
ACDB-98765-001 AHHX800.0-3
In the huge DF i have entrys like:
AHSG200-B0097小样图.dwg
MUDI-070097-0-05-00.dwg
ABCD-12345-1245.xlsx
ABCD-12345-1245.pdf
ABCD-12345.xlsx
Now i try to find matches - For which number we can find documents
Thank you for your inputs

Resample time series after removing top x percentile data

I have an hourly time series data (say df with date/time and value columns) where I want to:
Step 1: Remove the top 5 percentile of each day
Step 2: Get the max(Step 1)for each day
Step 3: Get the mean(Step 2) for each month
Here is what I have tried to implement the above logic:
step_1 = df.resample('D').apply(lambda x: x<x.quantile(0.95))
step_2 = step_1.resample('D').max()
step_3 = step_2.resample('M').mean()
Even though I do not get any code error, the generated output is different to the expected result based on the above 3 steps (I always get a constant value)
Any help will be appreciated.
You are almost there. Your step_1 is a series of booleans with the same index as the original data, you can use it to filter your DataFrame, thus:
step_1 = df.resample('D').apply(lambda x: x<x.quantile(0.95))
step_2 = df[step_1].resample('D').max()
step_3 = step_2.resample('M').mean()
Your first step is a boolean mask, so you need to add an additional step:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range(start='1/1/2019', periods=1000, freq='H'), columns=['my_data'])
mask = df.resample('D').apply(lambda x: x < x.quantile(.95))
step_1 = df[mask]
step_2 = df.resample('D').max()
step_3 = df.resample('M').mean()

In python pandas why can't I add more levels to a MultiIndex?

I would like to create DataFrames that have three levels. Why does the following function not work twice?
def superGroup(dataframe=None,multi_index_name=None):
out_dataframe = pd.DataFrame(dataframe.values,index=dataframe.index,columns=pd.MultiIndex.from_product([[multi_index_name],dataframe.columns]))
return out_dataframe
ran = pd.DataFrame(np.random.rand(3),columns=["Random"])
ran2 = superGroup(ran,"Hello")
superGroup(ran2,"World")#Does not work
>>>[Out]: NotImplementedError: isnull is not defined for MultiIndex
Here is a solution I figured out after spending way too much time on this problem. Hope that it helps those out there that have had the same problem.
def superGroup(dataframe=None,new_level=None):
"""Returns a dataframe entered but multiindexed with name new level.
Parameters
----------
dataframe : DataFrame
new_level : str
Returns
-------
out_df : DataFrame
"""
if type(dataframe.columns) == pd.indexes.base.Index:
out_df = pd.DataFrame(dataframe.values,index=dataframe.index,columns=pd.MultiIndex.from_product([[new_level],dataframe.columns]))
return out_df
if type(dataframe.columns) == pd.indexes.multi.MultiIndex:
levels = [list(i.values) for i in dataframe.columns.levels]
levels = [[new_level]]+levels
out_df = pd.DataFrame(dataframe.values, index = dataframe.index, columns = pd.MultiIndex.from_product(levels))
return out_df

Resources