Featuretools Deep Feature Synthesis (DFS) extremely high overhead - python-3.x

The execution of both ft.dfs(...) and ft.calculate_feature_matrix(...) on some time series to extract the day month and year from a very small dataframe (<1k rows) takes about 800ms. When I compute no features at all, it still takes about 750ms. What is causing this overhead and how can I reduce it?
I've testing different combinations of features as well as testing it on a bunch of small dataframes, and the execution time is pretty constant at 700-800ms.
I've also tested it on much larger dataframes with >1million rows. The execution time without any actual features (primitives) is pretty comparable at around to that with all the date features at around 80-90 seconds. So it seems like the computation time depends on the number of rows but not on the features?
I'm running with a n_jobs=1 to avoid any weirdness with parallelism. It seems to me like featuretools is doing some configuration or setup for the dask back-end every time and that is causing all of the overhead.
es = ft.EntitySet(id="testing")
es = es.entity_from_dataframe(
entity_id="time_series",
make_index=True,
dataframe=df_series[[
"date",
"flag_1",
"flag_2",
"flag_3",
"flag_4"
]],
variable_types={},
index="id",
time_index="date"
)
print(len(data))
features = ft.dfs(entityset=es, target_entity="sales", agg_primitives=[], trans_primitives=[])
The actual output seems to be correct, I am just surprised that FeatureTools would take 800ms to compute nothing on a small dataframe. Is the solution simply to avoid small dataframes and compute everything with a custom primitive on a large dataframe to mitigate the overhead? Or is there a smarter/more correct way of using ft.dfs(...) or ft.compute_feature_matrix.

Related

Pyspark FP growth implementation running slow

I am using the pyspark.ml.fpm (FP Growth) implementation of association rule mining on Spark v2.3.
The spark UI shows that the tasks as the end run very slowly. This seems to be a common problem and might be related to data skew.
Is this the real reason? Is there any solution for this?
I don't want to change the minSupport or minConfidence thresholds because that would effect by results. Removing the columns isn't a solution either.
I was facing a similar issue. One solution you might try is setting a threshold on the amount of products in a transaction. If there are a couple of transactions that have way more products than the average, the tree computed by FP Growth blows up. This causes the runtime increases significantly and the risk for memory errors is much higher.
Hence, doing outlier removal on the transactions with disproportional amount of products might do the trick.
Hope this helps you out a bit :)
Late answer but I also had an issue with long FPGrowth wait times, and the above answer really helped. Implemented as such to filter out anything that's above one standard deviation (this is after the transactions have been grouped):
def clean_transactions(df):
transactions_init = df.withColumn("basket_size", size("basket"))
print('---collecting stats')
df_stats = transactions_init.select(
_mean(col('basket_size')).alias('mean'),
_stddev(col('basket_size')).alias('std')
).collect()
mean = df_stats[0]['mean']
std = df_stats[0]['std']
max_ct = mean + std
print("--filtering out outliers")
transactions_cleaned = transactions_init.filter(transactions_init.basket_size <= max_ct)
return transactions_cleaned

Spark optimize "DataFrame.explain" / Catalyst

I've got a complex software which performs really complex SQL queries (well not queries, Spark plans you know). <-- The plans are dynamic, they change based on user input so I can't "cache" them.
I've got a phase in which spark takes 1.5-2min building the plan. Just to make sure, I added "logXXX", then explain(true), then "logYYY" and it takes 1minute 20 seconds for the explain to execute.
I've trying breaking the lineage but this seems to cause worse performance because the actual execution time becomes longer.
I can't parallelize driver work (already did, but this task can't be overlapped with anything else).
Any ideas/guide on how to improve the plan builder in Spark? (like for example, flags to try enabling/disabling and such...)
Is there a way to cache plans in Spark? (so I can run that in parallel and then execute it)
I've tried disabling all possible optimizer rules, setting min iterations to 30... but nothing seems to affect that concrete point :S
I tried disabling wholeStageCodegen and it helped a little, but the execution is longer so :).
Thanks!,
PS: The plan does contain multiple unions (<20, but quite complex plans inside each union) which are the cause for the time, but splitting them apart also affects execution time.
Just in case it helps someone (and if no-one provides more insights).
As I couldn't manage to reduce optimizer times (and well, not sure if reducing optimizer times would be good, as I may lose execution time).
One of the latest parts of my plan was scanning two big tables and getting one column from each one of them (using windows, aggregations etc...).
So I splitted my code in two parts:
1- The big plan (cached)
2- The small plan which scans and aggregates two big tables (cached)
And added one more part:
3- Left Join/enrich the big plan with the output of "2" (this takes like 10seconds, the dataset is not so big) and finish the remainder computation.
Now I launch both actions (1,2) in parallel (using driver-level parallelism/threads), cache the resulting DataFrames and then wait+ afterwards perform 3.
With this, while Spark driver (thread 1) is calculating the big plan (~2minutes) the executors will be executing part "2" (which has a small plan, but big scans/shuffles) and then both get "mixed" in like 10-15seconds, which a good improvement in execution time over the 1:30 I save while calculating the plan.
Comparing times:
Before I would have
1:30 Spark optimizing time + 6 minutes execution time
Now I have
max
(
1:30 Spark Optimizing time + 4 minutes execution time,
0:02 Spark Optimizing time + 2 minutes execution time
)
+ 15 seconds joining both parts
Not so much, but quite a few "expensive" people will be waiting for it to finish :)

Faster way to count values greater than 0 in Spark DataFrame?

I have a Spark DataFrame where all fields are integer type. I need to count how many individual cells are greater than 0.
I am running locally and have a DataFrame with 17,000 rows and 450 columns.
I have tried two methods, both yielding slow results:
Version 1:
(for (c <- df.columns) yield df.where(s"$c > 0").count).sum
Version 2:
df.columns.map(c => df.filter(df(c) > 0).count)
This calculation takes 80 seconds of wall clock time. With Python Pandas, it takes a fraction of second. I am aware that for small data sets and local operation, Python may perform better, but this seems extreme.
Trying to make a Spark-to-Spark comparison, I find that running MLlib's PCA algorithm on the same data (converted to a RowMatrix) takes less than 2 seconds!
Is there a more efficient implementation I should be using?
If not, how is the seemingly much more complex PCA calculation so much faster?
What to do
import org.apache.spark.sql.functions.{col, count, when}
df.select(df.columns map (c => count(when(col(c) > 0, 1)) as c): _*)
Why
Your both attempts create number of jobs proportional to the number of columns. Computing the execution plan and scheduling the job alone are expensive and add significant overhead depending on the amount of data.
Furthermore, data might be loaded from disk and / or parsed each time the job is executed, unless data is fully cached with significant memory safety margin which ensures that the cached data will not be evicted.
This means that in the worst case scenario nested-loop-like structure you use can roughly quadratic in terms of the number of columns.
The code shown above handles all columns at the same time, requiring only a single data scan.
The problem with your approach is that the file is scanned for every column (unless you have cached it in memory). The fastet way with a single FileScan should be:
import org.apache.spark.sql.functions.{explode,array}
val cnt: Long = df
.select(
explode(
array(df.columns.head,df.columns.tail:_*)
).as("cell")
)
.where($"cell">0).count
Still I think it will be slower than with Pandas, as Spark has a certain overhead due to the parallelization engine

For Scikit-Learn's RandomForestRegressor, can I specify a different n_jobs for predictions?

Scikit-Learn's RandomForestRegressor has an n_jobs instance attribute, that, from the documentation:
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both fit and predict. If
-1, then the number of jobs is set to the number of cores.
Training the Random Forest model with more than one core is obviously more performant than on a single core. But I have noticed that predictions are a lot slower (approximately 10 times slower) - this is probably because I am using .predict() on an observation-by-observation basis.
Therefore, I would like to train the random forest model on, say, 4 cores, but run the prediction on a single core. (The model is pickled and used in a separate process.)
Is it possible to configure the RandomForestRegressor() in this way?
Oh sure you can, I use a similar strategy for stored-models.
Just set <_aRFRegressorModel_>.n_jobs = 1 upon pickle.load()-ed, before using a .predict() method.
Nota bene:
the amount of work on .predict()-task is pretty "lightweight" if compared to .fit(), so in doubts, what are is core-motivation for tweaking this. Memory could be the issue, once large-scale forests may get a need to get scanned in n_jobs-"many" replicas ( which due to joblib nature re-instate all the python process-state into that many full-scale replicas ... and the new, overhead-strict Amdahl's Law re-fomulation shows one, what a bad idea that was -- to pay a way more than finally earned ( performancewise ) ). This is not an issue for .fit(), where concurrent processes can well adjust the setup overheads ( in my models ~ 4:00:00+ hrs runtime per process ), but right due to this cost/benefit "imbalance", it could be a killer-factor for "lightweight"-.predict(), where not much work is to be done, so masking the process setup/termination costs cannot be done ( and you pay way more than get ).
BTW, do you pickle.dump() object(s) from the top-level namespace? I got issues if not and the stored object(s) did not reconstruct correctly. ( Spent ages on this issue )

Bad performance with window function in streaming job

I use Spark 2.0.2, Kafka 0.10.1 and the spark-streaming-kafka-0-8 integration. I want to do the following:
I extract features in a streaming job out of NetFlow connections and than apply the records to a k-means model. Some of the features are simple ones which are calculated directly from the record. But I also have more complex features which depend on records from a specified time window before. They count how many connections in the last second were to the same host or service as the current one. I decided to use the SQL window functions for this.
So I build window specifications:
val hostCountWindow = Window.partitionBy("plainrecord.ip_dst").orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
val serviceCountWindow = Window.partitionBy("service").orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
And a function which is called to extract this features on every batch:
def extractTrafficFeatures(dataset: Dataset[Row]) = {
dataset
.withColumn("host_count", count(dataset("plainrecord.ip_dst")).over(hostCountWindow))
.withColumn("srv_count", count(dataset("service")).over(serviceCountWindow))
}
And use this function as follows
stream.map(...).map(...).foreachRDD { rdd =>
val dataframe = rdd.toDF(featureHeaders: _*).transform(extractTrafficFeatures(_))
...
}
The problem is that this has a very bad performance. A batch needs between 1 and 3 seconds for a average input rate of less than 100 records per second. I guess it comes from the partitioning, which produces a lot of shuffling?
I tried to use the RDD API and countByValueAndWindow(). This seems to be much faster, but the code looks way nicer and cleaner with the DataFrame API.
Is there a better way to calculate these features on the streaming data? Or am I doing something wrong here?
Relatively low performance is to be expected here. Your code has to shuffle and sort data twice, once for:
Window
.partitionBy("plainrecord.ip_dst")
.orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
and once for:
Window
.partitionBy("service")
.orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
This will have a huge impact on the runtime and if these are the hard requirements you won't be able to do much better.

Resources