Pyspark FP growth implementation running slow - apache-spark

I am using the pyspark.ml.fpm (FP Growth) implementation of association rule mining on Spark v2.3.
The spark UI shows that the tasks as the end run very slowly. This seems to be a common problem and might be related to data skew.
Is this the real reason? Is there any solution for this?
I don't want to change the minSupport or minConfidence thresholds because that would effect by results. Removing the columns isn't a solution either.

I was facing a similar issue. One solution you might try is setting a threshold on the amount of products in a transaction. If there are a couple of transactions that have way more products than the average, the tree computed by FP Growth blows up. This causes the runtime increases significantly and the risk for memory errors is much higher.
Hence, doing outlier removal on the transactions with disproportional amount of products might do the trick.
Hope this helps you out a bit :)

Late answer but I also had an issue with long FPGrowth wait times, and the above answer really helped. Implemented as such to filter out anything that's above one standard deviation (this is after the transactions have been grouped):
def clean_transactions(df):
transactions_init = df.withColumn("basket_size", size("basket"))
print('---collecting stats')
df_stats = transactions_init.select(
_mean(col('basket_size')).alias('mean'),
_stddev(col('basket_size')).alias('std')
).collect()
mean = df_stats[0]['mean']
std = df_stats[0]['std']
max_ct = mean + std
print("--filtering out outliers")
transactions_cleaned = transactions_init.filter(transactions_init.basket_size <= max_ct)
return transactions_cleaned

Related

Another problem with: PerformanceWarning: DataFrame is highly fragmented

Since I am still learning Python, I am getting some optimisation errors here.
I keep getting the error
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
and it does take for a while to load for what I am doing now.
Here is my code:
def Monte_Carlo_for_Tracking_Error(N,S,K,Ru,Rd,r,I,a):
ldv=[]
lhp=[]
lsp=[]
lod=[]
Tracking_Error_df=pd.DataFrame()
# Go through different time steps of rebalancing
for y in range(1,I+1):
i=0
# do the same step a amount of times
while i<a:
Sample_Stock_Prices=[]
Sample_Hedging_Portfolio=[]
Hedging_Portfolio_Value=np.zeros(N) # Initzialize Hedging PF
New_Path=Portfolio_specification(N,S,K,Ru,Rd,r) # Get a New Sample Path
Sample_Stock_Prices.append(New_Path[0])
Sample_Hedging_Portfolio.append(Changing_Rebalancing_Rythm(New_Path,y))
Call_Option_Value=[]
Call_Option_Value.append(New_Path[1])
Differences=np.zeros(N)
for x in range(N):
Hedging_Portfolio_Value[x]=Sample_Stock_Prices[0][x]*Sample_Hedging_Portfolio[0][x]
for z in range(N):
Differences[z]=Call_Option_Value[0][z]-Hedging_Portfolio_Value[z]
lhp.append(Hedging_Portfolio_Value)
lsp.append(np.asarray(Sample_Stock_Prices))
ldv.append(np.asarray(Sample_Hedging_Portfolio))
lod.append(np.asarray(Differences))
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
i=i+1
return(Tracking_Error_df,lod,lsp,lhp,ldv)
Code starts to give me warnings when I try to run:
Simulation=MCTE(100,100,104,1.05,0.95,0,10,200)
Small part of the warning:
C:\Users\xxx\AppData\Local\Temp\ipykernel_1560\440260239.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
C:\Users\xxx\AppData\Local\Temp\ipykernel_1560\440260239.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
C:\Users\xxx\AppData\Local\Temp\ipykernel_1560\440260239.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
I am using jupyter notebook for this. If somebody could help me optimise it, would appreciate it.
I tried to test the code and I am expecting to have a more performance-oriented version of it.

how to benchmark pypspark queries?

I have got a simple pyspark script and I would like to benchmark each section.
# section 1: prepare data
df = spark.read.option(...).csv(...)
df.registerTempTable("MyData")
# section 2: Dataframe API
avg_earnings = df.agg({"earnings": "avg"}).show()
# section 3: SQL
avg_earnings = spark.sql("""SELECT AVG(earnings)
FROM MyData""").show()
Do generate reliable measurements one would need to run each section multiple times. My solution using the python time module looks like this.
import time
for _ in range(iterations):
t1 = time.time()
df = spark.read.option(...).csv(...)
df.registerTempTable("MyData")
t2 = time.time()
avg_earnings = df.agg({"earnings": "avg"}).show()
t3 = time.time()
avg_earnings = spark.sql("""SELECT AVG(earnings)
FROM MyData""").show()
t4 = time.time()
write_to_csv(t1, t2, t3, t4)
My Question is how would one benchmark each section ? Would you use the time-module as well ? How would one disable caching for pyspark ?
Edit:
Plotting the first 5 iterations of the benchmark shows that pyspark is doing some form of caching.
How can I disable this behaviour ?
First, you can't benchmark using show, it only calculates and returns the top 20 rows.
Second, in general, PySpark API and Spark SQL share the same Catalyst Optimizer behind the scene, so overall what you are doing (using .agg vs avg()) is pretty much similar and don't have much difference.
Third, usually, benchmarking is only meaningful if your data is really big, or your operation is much longer than expected. Other than that, if the runtime difference is only a couple of minutes, it doesn't really matter.
Anyway, to answer your question:
Yes, there is nothing wrong to use time.time() to measure.
You should use count() instead of show(). count would go forward and compute your entire dataset.
You don't have to worry about cache if you don't call it. Spark won't cache unless you ask for it. In fact, you shouldn't cache at all when benchmarking.
You should also use static allocation instead of dynamic allocation. Or if you're using Databricks or EMR, use a fixed amount of workers and don't auto-scale it.

Featuretools Deep Feature Synthesis (DFS) extremely high overhead

The execution of both ft.dfs(...) and ft.calculate_feature_matrix(...) on some time series to extract the day month and year from a very small dataframe (<1k rows) takes about 800ms. When I compute no features at all, it still takes about 750ms. What is causing this overhead and how can I reduce it?
I've testing different combinations of features as well as testing it on a bunch of small dataframes, and the execution time is pretty constant at 700-800ms.
I've also tested it on much larger dataframes with >1million rows. The execution time without any actual features (primitives) is pretty comparable at around to that with all the date features at around 80-90 seconds. So it seems like the computation time depends on the number of rows but not on the features?
I'm running with a n_jobs=1 to avoid any weirdness with parallelism. It seems to me like featuretools is doing some configuration or setup for the dask back-end every time and that is causing all of the overhead.
es = ft.EntitySet(id="testing")
es = es.entity_from_dataframe(
entity_id="time_series",
make_index=True,
dataframe=df_series[[
"date",
"flag_1",
"flag_2",
"flag_3",
"flag_4"
]],
variable_types={},
index="id",
time_index="date"
)
print(len(data))
features = ft.dfs(entityset=es, target_entity="sales", agg_primitives=[], trans_primitives=[])
The actual output seems to be correct, I am just surprised that FeatureTools would take 800ms to compute nothing on a small dataframe. Is the solution simply to avoid small dataframes and compute everything with a custom primitive on a large dataframe to mitigate the overhead? Or is there a smarter/more correct way of using ft.dfs(...) or ft.compute_feature_matrix.

For Scikit-Learn's RandomForestRegressor, can I specify a different n_jobs for predictions?

Scikit-Learn's RandomForestRegressor has an n_jobs instance attribute, that, from the documentation:
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both fit and predict. If
-1, then the number of jobs is set to the number of cores.
Training the Random Forest model with more than one core is obviously more performant than on a single core. But I have noticed that predictions are a lot slower (approximately 10 times slower) - this is probably because I am using .predict() on an observation-by-observation basis.
Therefore, I would like to train the random forest model on, say, 4 cores, but run the prediction on a single core. (The model is pickled and used in a separate process.)
Is it possible to configure the RandomForestRegressor() in this way?
Oh sure you can, I use a similar strategy for stored-models.
Just set <_aRFRegressorModel_>.n_jobs = 1 upon pickle.load()-ed, before using a .predict() method.
Nota bene:
the amount of work on .predict()-task is pretty "lightweight" if compared to .fit(), so in doubts, what are is core-motivation for tweaking this. Memory could be the issue, once large-scale forests may get a need to get scanned in n_jobs-"many" replicas ( which due to joblib nature re-instate all the python process-state into that many full-scale replicas ... and the new, overhead-strict Amdahl's Law re-fomulation shows one, what a bad idea that was -- to pay a way more than finally earned ( performancewise ) ). This is not an issue for .fit(), where concurrent processes can well adjust the setup overheads ( in my models ~ 4:00:00+ hrs runtime per process ), but right due to this cost/benefit "imbalance", it could be a killer-factor for "lightweight"-.predict(), where not much work is to be done, so masking the process setup/termination costs cannot be done ( and you pay way more than get ).
BTW, do you pickle.dump() object(s) from the top-level namespace? I got issues if not and the stored object(s) did not reconstruct correctly. ( Spent ages on this issue )

Spark: rdd.countApprox() vs rdd.count()

Could someone please explain the difference between RDD countApprox() vs count() and also if possible can answer which is the fastest ? it would be of great help we have a requirement where count() is very slow takes about 30 min's ** ...tried countApprox() it was **fast for the first run (**About 1.2 min) and then slowed to 30 min's .....
this is how we used it not sure if it's the best way to use
rdd.countApprox(timeout=800, confidence=0.5)
Count() - Returns you the number of elements in an RDD.
CountApprox - Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.
countApprox(timeout: Long, confidence: Double)
Default: confidence = 0.95
Note: As per the spark source code, support for countApprox is marked 'Experimental'.
With timeout=800, you should have seen an approximate count in <1min.
Are you sure nothing else is causing this slowdown of 30mins.
Share your code/code-snippet to get more accurate inputs from other members.
Not my answer, but there is a very useful and important answer here.
In very short, countApprax.getFinalValue blocks even if this is longer than the timeout.
getInitialValue does not block and so you will get a response within the timeout.
BUT, as I learned from painful experience, even if you use getInitalValue the process will continue to final value.
If you are repeating this in a loop, the getFinalValue will be running for multiple RDDs long after you have retrieved the result from getInitialValue. This can then lead to OOM conditions and broadcast errors that are difficult to diagnose
rdd.count() is an action, which is an eager operation.
This means that all the other transformations that you had written before that will start executing now because of Spark's lazy evaluation. So, essentially its not only Count() operation that's taking all the time but, all the other operations which were waiting to get executed.
Now coming back to the question of count() vs countApprox().
Count is just like doing a select count(*) from Table. countApprox can have a timeout and confidence level which returns back a result which is approximately correct and a number you can live with.
We should use countApprox when we are more interested in knowing an approximate number and save time for example in a streaming application.
Count() should be used when you need the exact count for example to log something or for auditing.

Resources