What cluster settings for the best Python performance in Databricks - databricks

I have a few Python Notebooks in Databricks, however running Python is very slow in Databricks from my experience, in fact, my local machine (m1 Mac) zips through the same queries significantly faster.
I've tried messing around with the Cluster, but nothing seems to help with the efficiency, is there any particular setting I should focus on to speed up the Python queries?
I've tried increasing memory, worker count, worker type, driver type, etc... nothing seems to work. Also, just to note, it's not a matter of loading in the data, that takes seconds -- it's once I start qoing through my algo that pretty much consists of Pandas transformations is where it takes forever, as if there is not enough processing power.
Also, probably not fully understanding how to optimize for this type of work, so any guidance appreciated.

Related

Confusion about the data location when applying Scikit-learn on cluster (Dask)

I'm currently working on implementing machine learning (Scikit-Learn) from a single machine to a Slurm cluster via dask. According to some tutorials (e.g. https://examples.dask.org/machine-learning/scale-scikit-learn.html), it's quite simple by using job_lib.parallel_backend('dask'). However, the location of the read in data confuses me and none of the tutorials mention it. Should I use dask.dataframe to read in data to make sure it is passed to the cluster or it doesn't matter if I just read in it using pd.dataframe (then the data is stored in the RAM of which machine I run the Jupiter notebook)?
Thank you very much.
If your data is small enough (which it is in the tutorial), and preprocessing steps are rather trivial, then it is okay to read in with pandas. This will read the data in to your local session, not yet any of the dask workers. Once you call with joblib.parallel_backend('dask'), the data will be copied to each worker process and the scikit work will be done there.
If your data is large or you have intensive preprocessing steps its best to "load" the data with dask, and then use dask's built-in preprocessing and grid search where possible. In this case the data will actually be loaded directly from the workers, because of dask's lazy execution paradigm. Dask's grid search will also cache repeated steps of the cross validation and can speed up computation immensely. More can be found here: https://ml.dask.org/hyper-parameter-search.html

Speeding up dataframe.to_excel operations by a GPU

I was working on extracting some data wherein I constantly need to manipulate some part of fetched data and then append it to another dataframe which contains the combined dataset. I constantly save the dataframe using dataframe.to_excel. Since there is a lot of data, it has started to become a time taking operation, reading the previous file, appending and saving it again, inspite of ample of CPU and RAM. I am using GCP, an N1 type 8vCPU along a 30GB memory. Moreover since I am running various instances of the same script for various projects together, would using a GPU speed these things up ?
I never did it by myself but I think this is possible by using some Pandas alternative.
I found this thread which users seems to provide some solutions to a similar question.
I too have not tried this. I could offer couple of suggestions
rather than to_excel try to use to_csv probably there might be small gains.
you can try this library https://github.com/modin-project/modin, this library seems to make the read and operations faster, but i am not sure able to the write operations.
or you could move it to to_excel line to a different function and perform that operation by spinning out a new thread.

How big can batches in Flink respectively Spark get?

I am currently working on a framework for analysis application of an large scale experiment. The experiment contains about 40 instruments each generating about a GB/s with ns timestamps. The data is intended to be analysed in time chunks.
For the implemetation I would like to know how big such a "chunk" aka batch can get before Flink or Spark stop processing the data. I think it goes with out saying that I intend to recollect the processed data.
For live data analysis
In general, there is no hard limit on how much data you can process with the systems. It all depends on how many nodes you have and what kind of a query you have.
As it sounds as you would mainly want to aggregate per instrument on a given time window, your maximum scale-out is limited to 40. That's the maximum number of machines that you could throw at your problem. Then, the question arises on how big your time chunks are/how complex the aggregations become. Assuming that your aggregation requires all data of a window to be present, then the system needs to hold 1 GB per second. So if you window is one hour, the system needs to hold at least 3.6 TB of data.
If the main memory of the machines is not sufficient, data needs to be spilled to disk, which slows down processing significantly. Spark really likes to keep all data in memory, so that would be the practical limit. Flink can spill almost all data to disk, but then disk I/O becomes a bottleneck.
If you rather need to calculate small values (like sums, averages), main memory shouldn't become an issue.
For old data analysis
When analysis old data, the system can do batch processing and have much more options to handle the volume including spilling to local disk. Spark usually shines if you can keep all data of one window in main memory. If you are not certain about that or you know it will not fit into main memory, Flink is the more scalable solution. Nevertheless, I'd expect both frameworks to work well for your use case.
I'd rather look at the ecosystem and the suit for you. Which languages do you want to use? It feels like using Jupyter notebooks or Zeppelin would work best for your rather ad-hoc analysis and data exploration. Especially if you want to use Python, I'd probably give Spark a try first.

Is it better to create many small Spark clusters or a smaller number of very large clusters

I am currently developing an application to wrangle a huge amount of data using Spark. The data is a mixture of Apache (and other) log files as well as csv and json files. The directory structure of my Google bucket will look something like this:
root_dir
web_logs
\input (subdirectory)
\output (subdirectory)
network_logs (same subdirectories as web_logs)
system_logs (same subdirectories as web_logs)
The directory structure under the \input directories is arbitrary. Spark jobs pick up all of their data from the \input directory and place it in the \output directory. There is an arbitrary number of *_logs directories.
My current plan is to split the entire wrangling task into about 2000 jobs and use the cloud dataproc api to spin up a cluster, do the job, and close down. Another option would be to create a smaller number of very large clusters and just send jobs to the larger clusters instead.
The first approach is being considered because each individual job is taking about an hour to complete. Simply waiting for one job to finish before starting the other will take too much time.
My questions are: 1) besides the cluster startup costs, are there any downside to taking the first approach? and 2) is there a better alternative?
Thanks so much in advance!
Besides startup overhead, the main other consideration when using single-use clusters per job is that some jobs might be more prone to "stragglers" where data skew leads to a small number of tasks taking much longer than other tasks, so that the cluster isn't efficiently utilized near the end of the job. In some cases this can be mitigated by explicitly downscaling, combined with the help of graceful decommissioning, but if a job is shaped such that many "map" partitions produce shuffle output across all the nodes but there are "reduce" stragglers, then you can't safely downscale nodes that are still responsible for serving shuffle data.
That said, in many cases, simply tuning the size/number of partitions to occur in several "waves" (i.e. if you have 100 cores working, carving the work into something like 1000 to 10,000 partitions) helps mitigate the straggler problem even in the presence of data skew, and the downside is on par with startup overhead.
Despite the overhead of startup and stragglers, though, usually the pros of using new ephemeral clusters per-job vastly outweigh the cons; maintaining perfect utilization of a large shared cluster isn't easy either, and the benefits of using ephemeral clusters includes vastly improved agility and scalability, letting you optionally adopt new software versions, switch regions, switch machine types, incorporate brand-new hardware features (like GPUs) if they become needed, etc. Here's a blog post by Thumbtack discussing the benefits of such "job-scoped clusters" on Dataproc.
A slightly different architecture if your jobs are very short (i.e. if each one only runs a couple minutes and thus amplify the downside of startup overhead) or the straggler problem is unsolveable, is to use "pools" of clusters. This blog post touches on using "labels" to easily maintain pools of larger clusters where you still teardown/create clusters regularly to ensure agility of version updates, adopting new hardware, etc.
You might want to explore my solution for Autoscaling Google Dataproc Clusters
The source code can be found here

Why is Spark Mllib KMeans algorithm extremely slow?

I'm having the same problem as in this post, but I don't have enough points to add a comment there. My dataset has 1 Million rows, 100 cols. I'm using Mllib KMeans also and it is extremely slow. The job never finishes in fact and I have to kill it. I am running this on Google cloud (dataproc). It runs if I ask for a smaller number of clusters (k=1000), but still take more than 35 minutes. I need it to run for k~5000. I have no idea why is it so slow. The data is properly partitioned given the number of workers/nodes and SVD on a 1 million x ~300,000 col matrix takes ~3 minutes, but when it comes to KMeans it just goes into a black hole. I am now trying a lower number of iterations (2 instead of 100), but I feel something is wrong somewhere.
KMeansModel Cs = KMeans.train(datamatrix, k, 100);//100 iteration, changed to 2 now. # of clusters k=1000 or 5000
It looks like the reason is relatively simple. You use quite large k and combine it with an expensive initialization algorithm.
By default Spark is using as distributed variant of K-means++ called K-means|| (see What exactly is the initializationSteps parameter in Kmeans++ in Spark MLLib?). Distributed version is roughly O(k) so with larger k you can expect slower start. This should explain why you see no improvement when you reduce number of iterations.
Using large K is also expensive when model is trained. Spark is using a variant of Lloyds which is roughly O(nkdi).
If you expect complex structure of the data there most likely a better algorithms out there to handle this than K-Means but if you really want to stick with it you start with using random initialization.
Please try other implementations of k-means. Some like the variants in ELKI are way better than Spark, even on only a single CPU. You will be surprised how much performance you can get out of a single node, without going to a cluster! From my experiments, you would need at least a 100 node cluster to beat good local implementations, unfortunately.
I read that these C++ versions are multi-core (but single-node) and probably the fastest K-means you can find right now, but I have not yet tried that myself yet (for all my needs, the ELKI versions were bazingly fast, finishing in a few seconds on my largest data sets).

Resources