How do I process non-real time data in batches in Spark? - apache-spark

I am new Big data and Spark. I have to work on real-time data and old data from the past 2 years. There are around a million rows for each day. I am using PySpark and Databricks. Data is partitioned on created date. I have to perform some transformations and load it to a database.
For real-time data, I will be using spark streaming (readStream to read, perform transformation and then writeStream).
How do I work with the data from the past 2 years? I tried filtering data from 30 days I got good throughput. Should I be running the process on all 2 years of data at once or should I doing it in batches? If I perform this processes in batches, does Spark provide a way to batch it or do I do it in Python. Also, do I run these batches in parallel or in sequence?

It is kind of open ended but let me try to address your concerns.
How do I work with the data from the past 2 years? I tried filtering data from 30 days I got good throughput. Should I be running the process on all 2 years of data at once or should I doing it in batches?
Since you are new to Spark, do it in batches and start by running 1 day at a time, then 1 week and so one. Get your program to run successfully and optimize. As you increase the batch size you can increase your cluster size using Pyspark Dataframes (not Pandas). If your job is verified and efficient, you can run monthly, bi-monthly or larger batches (smaller jobs are better in your case).
If I perform this processes in batches, does Spark provide a way to batch it or do I do it in Python. Also, do I run these batches in parallel or in sequence?
You can use the date range as parameters to your Databricks job and use data bricks to schedule your jobs to ran back to back. Sure you can run them in parallel on different clusters but the whole idea with Spark is to use Sparks distributed capability and run your job on as many worker nodes as your job requires. Again, get one small job to work and validate your results, then validate a larger set and so on. If you feel confident, start a large cluster (many and fat workers) and run a large date range.
It is not an easy task for a newbie but should be a lot of fun. Best wishes.

Related

Spark SQL output multiple small files

We are having multiple joins involving a large table (about 500gb in size). The output of the joins is stored into multiple small files each of size 800kb-1.5mb. Because of this the job is split into multiple tasks and taking a long time to complete.
We have tried using spark tuning configurations like using broadcast join, changing partition size, changing max records per file etc., But there is no performance improvement with this methods and the issue is also not fixed. Using coalesce makes the job struck at that stage and there is no progress.
Please view this link for Spark UI metrics screenshot, https://i.stack.imgur.com/FfyYy.png
The spark UI confirms your report of too many small files. You will get a file for every spark partition, and you have 33,479 in your final stage where you're writing the output. 33k partitions was probably the right number of partitions for your join but not the right number for your write.
You need to add another stage in your job that comes after your join. That 2nd needs to reduce the number of spark partitions to a reasonable number (that outputs 32MB - ~128MB files)
Something like a coalesce, or repartition. Maybe even a sort :(
You want to target ~350 partitions.
This diagram shows what you want to do manually or automatically (with spark on Databricks)
If you're using Databricks then it's easy as with Delta Lake you can turn on Auto Optimize

HBase batch loading with speed control cause of slow consumer

We need to load a big part of data from HBase using Spark.
Then we put it into Kafka and read by consumer. But consumer is too slow
At the same time Kafka memory is not enough to keep all scan result.
Our key contain ...yyyy.MM.dd, and now we load 30 days in one Spark job, using operator filter.
But we cant split job to many jobs, (30 jobs filtering each day), cause then each job will have to scan all HBase, and it will make summary scan to slow.
Now we launch Spark job with 100 threads, but we cant make speed slower by set less threads (for example 7 threads). Cause Kafka is used by third hands developers, that make Kafka sometimes too busy to keep any data. So, we need to control HBase scan speed, checking all time is there a memory in Kafka to store our data
We try to save scan result before load to Kafka into some place, for example in ORC files in hdfs, but scan result make many little files, it is problem to group them by memory (or there is a way, if you know please tell me how?), and store into hdfs little files bad. And merging such a files is very expensive operation and spend a lot of time that will make total time too slow
Sugess solutions:
Maybe it is possible to store scan result in hdfs by spark, by set some special flag in filter operator and then run 30 spark jobs to select data from saved result and put each result to Kafka when it possible
Maybe there is some existed mechanism in spark to stop and continue launched jobs
Maybe there is some existed mechanism in spark to separate result by batches (without control to stop and continue loading)
Maybe there is some existed mechanism in spark to separate result by batches (with control to stop and continue loading by external condition)
Maybe when Kafka will throw an exception (that there is no place to store data), there is some backpressure mechanism in spark that will stop scan for some time if there some exceptions appear in execution (but i guess that there is will be limited retry of restarting to execute operator, is it possible to set restart operation forever, if it is a real solution?). But better to keep some free place in Kafka, and not to wait untill it will be overloaded
Do using PageFilter in HBase (but i guess that it is hard to realize), or other solutions variants? And i guess that there is too many objects in memory to use PageFilter
P.S
This https://github.com/hortonworks-spark/shc/issues/108 will not help, we already use filter
Any ideas would be helpful

Apache Spark: is it possible to get dataset counts in a spark job?

Sometimes some Spark job which runs in our cluster runs too long not because of bad optimization, but because of bad logic of the algorithm. In most cases this is a consequence of some unnecessary joins that produce too many rows. Normally we spot such jobs by looking at Spark execution plan where we can find such joins by looking at "number of output rows: xxx" in blue stage labels.
I want to understand - is it possible to optimize this procedure and somehow automatically notify the programmer that the job has too many rows in some dataset (after execution)?
Maybe we can print this in logs (without manually counting dataset's size in code)?
Maybe after running the job we can get the output of the execution plan somehow and save it for further investigations?
No, it's not an option. Spark will do its best to optimize the query plan, so manual interaction with lower execution level is pretty much limited. However, you can "control" the rows for each jobs/tasks by changing some configurations (like spark.sql.shuffle.partitions or spark.sql.files.maxPartitionBytes), or by repartitioning data, which will cause data to be shuffled and re-distributed nearly equally between executors.

High shuffle write and high execution time

I'm using EMR and I'm developing a collaborative-filtering approach with ALS. I have three doubt:
In order to see the execution time in Spark UI, I do several experiments. I've noticed that with one master and 4 workers the execution time is less than the EMR cluster with one master and six workers. Anyone know why?
The other thing is shuffle write. With one master and six workers, I have 3,2GB. Is it too high, isn't it? In the code I use RDD, groupbykey and I do two join. How can I minimize it?
With one master and six workers, the execution time is 7,5 min. Considering I'm using the Movielens Dataset, with Machine Learning approach I can't understand if this execution time is too high or it's quite good.
I attach a picture with the result form the Spark UI.
Thank you in advance

Processing Pipeline using Spark SQL- jobs, stages and DAG sizes

I have a processing pipeline that is built using Spark SQL. The objective is to read data from Hive in the first step and apply a series of functional operations (using Spark SQL) in order to achieve the functional output. Now, these operations are quite in number (more than 100), which means I am running around 50 to 60 spark sql queries in a single pipeline. While the application completes successfully without any issues, my focus area has shifted to optimizing the overall process. I have been able to speed up the executions using spark.sql.shuffle.partitions, changing the executor memory and reducing the size of the spark.memory.fraction from default 0.6 to 0.2. I got great benefits by doing all these changes and the over all execution time reduced from 20-25 mins to around 10 mins. Data volume is around 100k rows (source side).
The observations that I have from the Cluster are:
-The number of jobs triggered as apart of application id are 235.
-The total number of stages across all the jobs created are around 600.
-8 executors are used in a two node cluster (64 GB RAM in total with 10 cores).
-The resource manager UI of Yarn (for an application id) becomes very slow to retrieve the details of jobs/stages.
In one of the videos of Spark tuning, I heard that we should try to reduce the number of stages to a bare minimum, also DAG size should be smaller. What are the guidelines to do this. How to find the number of shuffles that are happening (my SQLs have many joins and group by clauses).
I would like to have suggestions on the above scenario of what possible things I can do in order to improvise the performance and handle the data skews in the SQL queries that are JOIN/GROUP_BY heavy.
Thanks

Resources