As far as I understand data size will vary as per window interval and slide interval and big intervals like weekly and more (though monthly interval is not allowed) might affect performance as actual data would be stored in rdds in a datastream.
Does window and slide intervals affect the spark streaming application performance? if yes, then what are ways to fine tune performance and intervals?
Related
If I have an application that runs the same job on the same set of columns (not necessarily same row values) every day. Is there a way I can save the spark execution plan without having spark recompute it every time?
My application requires thousands of transformations and there is significant time involved in building the lineage graph and optimization plan.
Is there a way I can save the spark execution plan without having spark recompute it every time?
I have never came across such possibility, so with large dose of confidence I can say that it's not an option.
What instead you can do it to optimize the data that is the input to Spark - optimal partitioning, compression, a format that supports predicate pushdown is probably the places where you can look for some time savings.
I have a delta table which is partitioned by multiple keys, one of which includes date excluding minute details(only upto hour, example - Fri, 15 Jul 2022 07)
Now, with the data keep ingesting via batch and streaming ingestion workflow, what would be the best strategy to evaluate number of executors to read all the data from delta table?
One of the very naive way could be to just let spark autoscale but we may still need to play with shuffle partitions etc. Looking for hints or best practices around the same. Thanks!
If you want to "read all the data from delta table" it does not really matter whether this table is partitioned or not since the query reads all the data and hence loads the whole table.
This is the worst possible query - the dreaded full scan. If it's inevitable, just know that that is the kind of queries where Spark SQL shines so bright utilising the full power of a Spark cluster. You've been warned :)
Executors are simply machines with CPU cores and memory. You're probably more interested in the number of CPU cores for all the tasks to load the delta table.
I'd start this calculation with the number of files for a given version of the delta table. Files are of different size and (I might be wrong here) they are usually chunked (I don't want to use the overloaded term partitioned here, but that's what springs to my mind) to 512MB splits.
The number of splits (512MB blocks) for all the files of a given version of the delta table would be the number of tasks. That would give you the number of CPU cores and hence their "containers", i.e. Spark executors (to evenly saturate available physical resources for the best performance).
I am new Big data and Spark. I have to work on real-time data and old data from the past 2 years. There are around a million rows for each day. I am using PySpark and Databricks. Data is partitioned on created date. I have to perform some transformations and load it to a database.
For real-time data, I will be using spark streaming (readStream to read, perform transformation and then writeStream).
How do I work with the data from the past 2 years? I tried filtering data from 30 days I got good throughput. Should I be running the process on all 2 years of data at once or should I doing it in batches? If I perform this processes in batches, does Spark provide a way to batch it or do I do it in Python. Also, do I run these batches in parallel or in sequence?
It is kind of open ended but let me try to address your concerns.
How do I work with the data from the past 2 years? I tried filtering data from 30 days I got good throughput. Should I be running the process on all 2 years of data at once or should I doing it in batches?
Since you are new to Spark, do it in batches and start by running 1 day at a time, then 1 week and so one. Get your program to run successfully and optimize. As you increase the batch size you can increase your cluster size using Pyspark Dataframes (not Pandas). If your job is verified and efficient, you can run monthly, bi-monthly or larger batches (smaller jobs are better in your case).
If I perform this processes in batches, does Spark provide a way to batch it or do I do it in Python. Also, do I run these batches in parallel or in sequence?
You can use the date range as parameters to your Databricks job and use data bricks to schedule your jobs to ran back to back. Sure you can run them in parallel on different clusters but the whole idea with Spark is to use Sparks distributed capability and run your job on as many worker nodes as your job requires. Again, get one small job to work and validate your results, then validate a larger set and so on. If you feel confident, start a large cluster (many and fat workers) and run a large date range.
It is not an easy task for a newbie but should be a lot of fun. Best wishes.
I have a processing pipeline that is built using Spark SQL. The objective is to read data from Hive in the first step and apply a series of functional operations (using Spark SQL) in order to achieve the functional output. Now, these operations are quite in number (more than 100), which means I am running around 50 to 60 spark sql queries in a single pipeline. While the application completes successfully without any issues, my focus area has shifted to optimizing the overall process. I have been able to speed up the executions using spark.sql.shuffle.partitions, changing the executor memory and reducing the size of the spark.memory.fraction from default 0.6 to 0.2. I got great benefits by doing all these changes and the over all execution time reduced from 20-25 mins to around 10 mins. Data volume is around 100k rows (source side).
The observations that I have from the Cluster are:
-The number of jobs triggered as apart of application id are 235.
-The total number of stages across all the jobs created are around 600.
-8 executors are used in a two node cluster (64 GB RAM in total with 10 cores).
-The resource manager UI of Yarn (for an application id) becomes very slow to retrieve the details of jobs/stages.
In one of the videos of Spark tuning, I heard that we should try to reduce the number of stages to a bare minimum, also DAG size should be smaller. What are the guidelines to do this. How to find the number of shuffles that are happening (my SQLs have many joins and group by clauses).
I would like to have suggestions on the above scenario of what possible things I can do in order to improvise the performance and handle the data skews in the SQL queries that are JOIN/GROUP_BY heavy.
Thanks
Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming?
I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them.
I think your problem can be solved by Spark Streaming Backpressure.
Check spark.streaming.backpressure.enabled and spark.streaming.backpressure.initialRate.
By default spark.streaming.backpressure.initialRate is not set and spark.streaming.backpressure.enabled is disabled by default so I suppose spark will take as much as he can.
From Apache Spark Kafka configuration
spark.streaming.backpressure.enabled:
This enables the Spark Streaming to control the receiving rate based
on the current batch scheduling delays and processing times so that
the system receives only as fast as the system can process.
Internally, this dynamically sets the maximum receiving rate of
receivers. This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see below).
And since you want to control first batch, or to be more specific - number of messages in first batch, I think you need spark.streaming.backpressure.initialRate
spark.streaming.backpressure.initialRate:
This is the initial maximum receiving rate at which each receiver will
receive data for the first batch when the backpressure mechanism is
enabled.
This one is good when your Spark job (respectively Spark workers at all) is able to process let say 10000 messages from kafka, but kafka brokers give to your job 100000 messages.
Maybe you will be also interested to check spark.streaming.kafka.maxRatePerPartition and also some research and suggestions for these properties on real example by Jeroen van Wilgenburg on his blog.
Apart from above answers. Batch size is product of 3 parameters
batchDuration: The time interval at which streaming data will be divided into batches (in Seconds).
spark.streaming.kafka.maxRatePerPartition: set the maximum number of messages per partition per second. This when combined with batchDuration will control the batch size. You want the maxRatePerPartition to be set, and large (otherwise you are effectively throttling your job) and batchDuration to be very small.
No of partitions in kafka topic
For better explaination how this product work when backpressure enable/disable (set spark.streaming.kafka.maxRatePerPartition for createDirectStream)
Limiting the Max batch size will greatly help to control the processing time, however, it increase the processing latency of message.
By settings below properties, we could control the batch size
spark.streaming.receiver.maxRate=
spark.streaming.kafka.maxRatePerPartition=
You could even dynamically set the batch size based on processing time, by enabling the back pressure
spark.streaming.backpressure.enabled:true
spark.streaming.backpressure.initialRate: