Stop AutoML sampling the dataset

Stop AutoML sampling the dataset - databricks

Whenever I run an Azure Databricks AutoML run it samples the dataset, only using around 66% of the rows.
I currently have 40,000 rows, each with 600 features.
Is there a way to force AutoML to use all the rows? I have tried increasing the memory of the compute I am using, but it does not appear to help

Although AutoML distributes hyperparameter tuning trials across the worker nodes of a cluster, each model is trained on a single worker node.
AutoML automatically estimates the memory required to load and train your dataset and samples the dataset if necessary.
In Databricks Runtime 9.1 LTS ML through Databricks Runtime 10.5 ML, the sampling fraction does not depend on the cluster’s node type or the amount of memory on each node.
In Databricks Runtime 11.0 ML and above:
• The sampling fraction increases for worker nodes with more memory.
• You can increase the sample size by choosing a Memory optimized worker type when you create the cluster.
• You can also increase the sample size by choosing a larger value for spark.task.cpus in the Spark configuration for the cluster. The default setting is 1; the maximum value is the number of CPUs in the worker. When you increase this value, the sample size is larger, but fewer trials run in parallel. For example, in a machine with 4 cores and 64GB total RAM, the default spark.task.cpus=1 runs 4 trials per worker with each trial limited to 16GB RAM. If you set spark.task.cpus=4, each worker runs only one trial but that trial can use 64GB RAM.
Refer - https://learn.microsoft.com/en-us/azure/databricks/applications/machine-learning/automl#--sampling-large-datasets

Related

GC and shuffle read is high (red) in databricks cluster, how to tune this?

I need recommendation for cluster design in databricks, we have a ETL batch load running every 20 minutes.
there are 25+ notebooks doing straight merge in silver layer tables( fact/dimensions)
cluster config are as follows:
instance type- F64- compute optimised
worker nodes -3 - 128 gb memroy X 64 cores
driver node - 1 memory optimized - 64 GB X8 cores
we need to run minimize execution time and increase parallelism
I am attaching snapshot from SPARKUI of databricks cluster (executor page) for your refrence.
[1]: https://i.stack.imgur.com/qMFyf.png
I see red flag for GC time and shuffle read, GC time turns out more than 10% of total task time.
How can we bring this down and we missing our SLA for load cycle.
Thanks

Try increasing the executors and nodes and also giving more memory per executor. Also, you can see which line in your code is specifically taking long by looking at logs.
You need to ensure that you are not performing actions that involve a lot of data shuffling.

In GCP Dataproc, what is the maximum number of worker nodes we can use in a cluster?

I am about to train a 5 million rows of data containing 7 categorical variables (string), but soon will train a 31 million rows of data.
I am wondering what the maximum number of worker nodes we can use in a cluster, because even if I type something like: 2,000,000, it doesn't show any indication of an error.
Another question would be, what would be the best way to determine how many worker nodes needed?
Thank you in advance!

Max cluster size
Dataproc does not limit number of the nodes in the cluster, but other software can have limitations. For example, it's known that there are YARN cluster deployments that have 10k nodes, so going above that may not work for Spark on YARN that Dataproc runs.
Also, you need to take into account GCE limitations like different quotas (CPU, RAM, Disk, external IPs, etc) and QPS limits and make sure that you have enough of these for such a large cluster.
I think that 1k nodes is a reasonable size to start from for a large Dataproc cluster if you need it, and you can upscale it further to add more nodes as necessary after cluster creation.
Cluster size estimation
You should determine how many nodes you need based on your workload and VM size that you want to use. For your use case it seems that you need to find a guide on how to estimate cluster size for ML training.
Or alternatively you can just do a binary search until you satisfied with a training time. For example, you can start from 500 8-core nodes cluster and if training time is too long then increase cluster size to 600-750 nodes and see if training time decreases as you expect - you can repeat this until you satisfied with training time or until it does not scale/improve anymore.

Processing Pipeline using Spark SQL- jobs, stages and DAG sizes

I have a processing pipeline that is built using Spark SQL. The objective is to read data from Hive in the first step and apply a series of functional operations (using Spark SQL) in order to achieve the functional output. Now, these operations are quite in number (more than 100), which means I am running around 50 to 60 spark sql queries in a single pipeline. While the application completes successfully without any issues, my focus area has shifted to optimizing the overall process. I have been able to speed up the executions using spark.sql.shuffle.partitions, changing the executor memory and reducing the size of the spark.memory.fraction from default 0.6 to 0.2. I got great benefits by doing all these changes and the over all execution time reduced from 20-25 mins to around 10 mins. Data volume is around 100k rows (source side).
The observations that I have from the Cluster are:
-The number of jobs triggered as apart of application id are 235.
-The total number of stages across all the jobs created are around 600.
-8 executors are used in a two node cluster (64 GB RAM in total with 10 cores).
-The resource manager UI of Yarn (for an application id) becomes very slow to retrieve the details of jobs/stages.
In one of the videos of Spark tuning, I heard that we should try to reduce the number of stages to a bare minimum, also DAG size should be smaller. What are the guidelines to do this. How to find the number of shuffles that are happening (my SQLs have many joins and group by clauses).
I would like to have suggestions on the above scenario of what possible things I can do in order to improvise the performance and handle the data skews in the SQL queries that are JOIN/GROUP_BY heavy.
Thanks

What performance parameters to set for spark scala code to run on yarn using spark-submit?

My use case is to merge two tables where one table contains 30 million records with 200 cols and another table contains 1 million records with 200 cols.I am using broadcast join for small table.I am loading both the tables as data-frames from hive managed tables on HDFS.
I need the values to set for driver memory and executor memory and other parameters along with it for this use case.
I have this hardware configurations for my yarn cluster :
Spark Version 2.0.0
Hdp version 2.5.3.0-37
1) yarn clients 20
2) Max. virtual cores allocated for a container (yarn.scheduler.maximum.allocation-vcores) 19
3) Max. Memory allocated for a yarn container 216gb
4) Cluster Memory Available 3.1 TB available
Any other info you need I can provide for this cluster.
I have to decrease the time to complete this process.
I have been using some configurations but I think its wrong, it took me 4.5 mins to complete it but I think spark has capability to decrease this time.

There are mainly two things to look at when you want to speed up your spark application.
Caching/persistance:
This is not a direct way to speed up the processing. This will be useful when you have multiple actions(reduce, join etc) and you want to avoid the re-computation of the RDDs in the case of failures and hence decrease the application run duration.
Increasing the parallelism:
This is the actual solution to speed up your Spark application. This can be achieved by increasing the number of partitions. Depending on the use case, you might have to increase the partitions
Whenever you create your dataframes/rdds: This is the better way to increase the partitions as you don't have to trigger a costly shuffle operation to increase the partitions.
By calling repartition: This will trigger a shuffle operation.
Note: Once you increase the number of partitions, then increase the executors(may be very large number of small containers with few vcores and few GBs of memory
Increasing the parallelism inside each executor
By adding more cores to each executor, you can increase the parallelism at the partition level. This will also speed up the processing.
To have a better understanding of configurations please refer this post

Number of workers in SPARK standalone cluster mode

How to decide the number of workers on spark standalone cluster mode?
The duration will decreased when I added workers in standalone cluster mode.
For example, for my input data 3.5 G, it would take 3.8 min for WordCount. However, it would take 2.6 min after I added one worker with memory 4 G.
Is it fine to add workers for tuning spark? I am thinking about the risk on that.
My environment setting were as below,
Memory 128 G, 16 CPU for 9 VM
Centos
Hadoop 2.5.0-cdh5.2.0
Spark 1.1.0
Input data information
3.5 G data file from HDFS

You can tune both the executors (number of JVMs and their memory) as well as the number of tasks. if what you're doing can benefit from parallelism you can spin more executors by configuration and increase the number of tasks (by calling partitions/coalesce etc in your code).
When you set the parallelism take into account if you're doing mostly IO or computations etc. generally speaking Spark recommendation is for 2-3 tasks per CPU core

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string