scikit learn extratreeclassifier hanging - scikit-learn

I'm running the scikit learn on some rather large training datasets ~1,600,000,000 rows with ~500 features. The platform is Ubuntu server 14.04, the hardware has 100gb of ram and 20 CPU cores.
The test datasets are about half as many rows.
I set n_jobs = 10, and am forest_size = 3*number_of_features so about 1700 trees.
If I reduce the number of features to about 350 it works fine but never completes the training phase with the full feature set of 500+. The process is still executing and using up about 20gb of ram but is using 0% of CPU. I have also successfully completed on datasets with ~400,000 rows but twice as many features which completes after only about 1 hour.
I am being careful to delete any arrays/objects that are not in use.
Does anyone have any ideas I might try?

Installing the current master branch version as suggested by orgrisel worked for me. I did have to a "make clean" as described here.
The new version seems to be a really big improvement. I hope it is released soon.
Many thanks to orgisel and other contributors for such a great piece of software!

Related

Invisible Delays between Spark Jobs

There are 4 major actions(jdbc write) with respect to application and few counts which in total takes around 4-5 minutes for completion.
But the total uptime of Application is around 12-13minutes.
I see there are certain jobs by name run at ThreadPoolExecutor.java : 1149. Just before this job being reflected on Spark UI, the invisible long delays occur.
I want to know what are the possible causes for these delays.
My application is reading 8-10 CSV files, 5-6 VIEWs from table. Number of joins are around 59, few groupBy with agg(sum) are there and 3 unions are there.
I am not able to reproduce the issue in DEV/UAT env since the data is not that much.
It's in the production where I get the app. executed run by my Manager.
If anyone has come across such delays in their job, please share your experience what could be the potential cause for this, currently I am working around the unions, i.e. caching the associated dataframes and calling count so as to get the benefit of cache in the coming union(yet to test, if union is the reason for delays)
Similarly, I tried the break the long chain of transformations with cache and count in between to break the long lineage.
The time reduced from initial 18 minutes to 12 minutes but the issue with invisible delays still persist.
Thanks in advance
I assume you don't have a CPU or IO heavy code between your spark jobs.
So it really sparks, 99% it is QueryPlaning delay.
You can use
spark.listenerManager.register(QueryExecutionListener) to check different metrics of query planing performance.

CUDA out of memory when running Bert with Pytorch (Previously worked)

I am building a BERT binary classification on SageMaker using Pytorch.
Previously when I ran the model, I set the Batch size to 16 and the model were able to run successfully. However, yesterday after I stopped SageMaker and restarted the this morning, I can't run the model with Batch size as 16 any more. I am able to run the model with batch size 8.
However, the model is not producing the same result (of course). I didn't change anything else in between. All other settings are the same. (Except I change the SageMaker volume from 30GB to 200GB.)
Does anyone know what may cause this problem? I really want to reproduce the result with batch size 16.
Any answers will help and thank you in advance!

Spark optimize "DataFrame.explain" / Catalyst

I've got a complex software which performs really complex SQL queries (well not queries, Spark plans you know). <-- The plans are dynamic, they change based on user input so I can't "cache" them.
I've got a phase in which spark takes 1.5-2min building the plan. Just to make sure, I added "logXXX", then explain(true), then "logYYY" and it takes 1minute 20 seconds for the explain to execute.
I've trying breaking the lineage but this seems to cause worse performance because the actual execution time becomes longer.
I can't parallelize driver work (already did, but this task can't be overlapped with anything else).
Any ideas/guide on how to improve the plan builder in Spark? (like for example, flags to try enabling/disabling and such...)
Is there a way to cache plans in Spark? (so I can run that in parallel and then execute it)
I've tried disabling all possible optimizer rules, setting min iterations to 30... but nothing seems to affect that concrete point :S
I tried disabling wholeStageCodegen and it helped a little, but the execution is longer so :).
Thanks!,
PS: The plan does contain multiple unions (<20, but quite complex plans inside each union) which are the cause for the time, but splitting them apart also affects execution time.
Just in case it helps someone (and if no-one provides more insights).
As I couldn't manage to reduce optimizer times (and well, not sure if reducing optimizer times would be good, as I may lose execution time).
One of the latest parts of my plan was scanning two big tables and getting one column from each one of them (using windows, aggregations etc...).
So I splitted my code in two parts:
1- The big plan (cached)
2- The small plan which scans and aggregates two big tables (cached)
And added one more part:
3- Left Join/enrich the big plan with the output of "2" (this takes like 10seconds, the dataset is not so big) and finish the remainder computation.
Now I launch both actions (1,2) in parallel (using driver-level parallelism/threads), cache the resulting DataFrames and then wait+ afterwards perform 3.
With this, while Spark driver (thread 1) is calculating the big plan (~2minutes) the executors will be executing part "2" (which has a small plan, but big scans/shuffles) and then both get "mixed" in like 10-15seconds, which a good improvement in execution time over the 1:30 I save while calculating the plan.
Comparing times:
Before I would have
1:30 Spark optimizing time + 6 minutes execution time
Now I have
max
(
1:30 Spark Optimizing time + 4 minutes execution time,
0:02 Spark Optimizing time + 2 minutes execution time
)
+ 15 seconds joining both parts
Not so much, but quite a few "expensive" people will be waiting for it to finish :)

For Scikit-Learn's RandomForestRegressor, can I specify a different n_jobs for predictions?

Scikit-Learn's RandomForestRegressor has an n_jobs instance attribute, that, from the documentation:
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both fit and predict. If
-1, then the number of jobs is set to the number of cores.
Training the Random Forest model with more than one core is obviously more performant than on a single core. But I have noticed that predictions are a lot slower (approximately 10 times slower) - this is probably because I am using .predict() on an observation-by-observation basis.
Therefore, I would like to train the random forest model on, say, 4 cores, but run the prediction on a single core. (The model is pickled and used in a separate process.)
Is it possible to configure the RandomForestRegressor() in this way?
Oh sure you can, I use a similar strategy for stored-models.
Just set <_aRFRegressorModel_>.n_jobs = 1 upon pickle.load()-ed, before using a .predict() method.
Nota bene:
the amount of work on .predict()-task is pretty "lightweight" if compared to .fit(), so in doubts, what are is core-motivation for tweaking this. Memory could be the issue, once large-scale forests may get a need to get scanned in n_jobs-"many" replicas ( which due to joblib nature re-instate all the python process-state into that many full-scale replicas ... and the new, overhead-strict Amdahl's Law re-fomulation shows one, what a bad idea that was -- to pay a way more than finally earned ( performancewise ) ). This is not an issue for .fit(), where concurrent processes can well adjust the setup overheads ( in my models ~ 4:00:00+ hrs runtime per process ), but right due to this cost/benefit "imbalance", it could be a killer-factor for "lightweight"-.predict(), where not much work is to be done, so masking the process setup/termination costs cannot be done ( and you pay way more than get ).
BTW, do you pickle.dump() object(s) from the top-level namespace? I got issues if not and the stored object(s) did not reconstruct correctly. ( Spent ages on this issue )

How much load on a node is too much?

We have 10 node Cassandra cluster and would like to monitor Load(CPU) .
The OpsCenter shows the load on nodes ranges from 0 to 20s and sometimes it goes in 50s and 60s . But normally its below 25 .
We just want to understand how much should be considered alarming ?
It's ok until you don't have performance issues/timeouts.
Going higher than 80% is not recommended as you don't have time to react. Also consider checking your 60% spikes.
Depends on your hardware and installation.
I recommend doing stress-testing of your system. Because we don't know how many writes and reads you are doing. You need reserve one core for linux and keep load 80%.

Resources