i am trying to run a apache spark sql job (1.6) in local mode over 3 node cluster and i face below issues in production.
Execution time for duplication layer is increasing day by day after incremental load at DL layer.
Nearly 150K records are being inserted in each table every day.
We have tried with default as well as “MEMORY AND DISK” persist mechanism , but its working same in both cases.
Execution time is impacting the other tables if we run large tables first.
spark job is being invoked in a standard format and executed shell script using spark-submit and below sql query from my spark job is as below.
val result=sqlcontext.sql("CREATE TABLE "+DB+"."+table_name+" row format delimited fields terminated by '^' STORED as ORC tblproperties(\"orc.compress\"=\"SNAPPY\",\"orc.stripe.size\"='67108864') AS select distinct a.* from "+fdl_db+"."+table_name+" a,(SELECT SRL_NO,MAX("+INC_COL+") as incremental_col FROM "+fdl_db+"."+table_name+" group by SRL_NO) b where a.SRL_NO=b.SRL_NO and a."+INC_COL+"=b.incremental_col").repartition(100);
please let me know if you need any more info.


what is spark doing after insertInto?

I have a super-simple pyspark script:
Run query(Hive) and create a dataframe A
Perform aggregates on A which creates dataframe B
Print the number of rows on the aggregates with B.count()
Save the results of B in a Hive table using B.insertInto()
But I noticed something, in the Spark web UI the insertInto is listed as completed, but the client program(notebook) is still marking the insert as as running, if I run a count directly to the Hive table with a Hive client(no spark) the row-count is not the same as printed with B.count(), if I run the query again, the number of rows increases(but still not matching to B.count()) after some minutes, the row count hive query, matches B.count(). Question is, if the insertInto() job is already completed (according to the Spark web UI) what is it doing? given by the row-count increase behavior it seems as it is still running the insertInto but that does not matches with the spark web UI. My guess is something like hive table partition metadata update is running, or something similar.

Azure Databricks: Switching from batch to streaming mode

Hello spark community,
Currently we have a batch pipeline in azure databricks that reads from a delta table. Job is run every night once per day. We extract its data, save it in our own location as delta table again and then we write to azure sql database table. Everything is partioned on date like this: Date=2021-01-01, etc.
Things are about to change now since our source delta table is about to get refreshed every 2-3 minutes and the requirement is to change the ETL from nightly batch to streaming mode but still using the same source and target tables as well as the same sql database table.
Right now this streaming challenge is imposing several questions:
Our delta source table is quite huge (30B+ rows), so far in our nightly batch we extracted only the changed date keys and used MERGE INTO to write the updates/inserts, however once we switch to streaming mode is it possible to tell spark to start streaming from a specific point in time since we do not want to load the whole table again this time with streaming mode?
How do you write a stream to a target delta table with MERGE INTO having to check huge amount of keys on both sides. I suppose we can get use of a foreachBatch and do that in a micro batch manner, but still how each new micro batch to check for existing/non-existing keys in our target delta table in a time-efficient manner (30B+ rows in the current target table)?
Not sure about that but is it possible for a streaming spark job to write directly to a sql database (azure) and will this become a bottleneck situation hence not advisable to be done?
I am really looking forward for some good advice and design decisions since this would be quite a big issue with the current data size. Appreciate every opinion here!

In Spark job, few tasks are taking very long (stuck) when using Spark SQL group by

I have Spark job (on Azure HDInsight) which runs SQL query (standard groupby) and saves result to Parquet & csv format.
This job is taking really long. When I check Spark UI, I see there are two tasks that are stuck.
I thought this was because of data skew. However, if I check these executors:
they have similar size of input data, shuffle read and write. So it doesn't seem to be data skew issue. What other reasons could cause some tasks to take very long?
Just to give some idea about what I am running, sample query:
SELECT year_month, id, feature,
min(prev_ts) AS timedelta_min,
max(prev_ts) AS timedelta_max,
stddev_pop(prev_ts) AS timedelta_sd,
AVG(prev_ts) AS timedelta_mean,
percentile_approx(prev_ts,0.5) AS timedelta_median,
percentile_approx(prev_ts,0.25) AS timedelta_1st_quartile,
percentile_approx(prev_ts,0.75) AS timedelta_3rd_quartile
FROM table_a
GROUP BY id, feature, year_month
Here prev_ts was column created using LAG on timestamp column.

increasing number of partitions in spark

I was using Hive for executing SQL queries on a project. I used ORC with 50k Stride for my data and have created the hive ORC tables using this configuration with a certain date column as partition.
Now I wanted to use Spark SQL to benchmark the same queries operating on the same data.
I executed the following query
val q1 = sqlContext.sql("select col1,col2,col3,sum(col4),sum(col5) from mytable where date_key=somedatkye group by col1,col2,col3")
In hive it takes 90 seconds for this query. But spark takes 21 minutes for the same query and on looking at the job, i found the issue was because Spark creates 2 stages and on the first stage, it has only 7 tasks, one each for each of the 7 blocks of data within that given partition in orc file. The blocks are of different size, one is 5MB while the other is 45MB and because of this stragglers take more time leading to taking too much time for the job.
How do i mitigate this issue in spark. How do i manually increase the number of partitions, resulting in increasing the number of tasks in stage 1, even though there are only 7 physical blocks for the given range of the query.

Baseline for measuring Apache Spark jobs execution times

I am fairly new to Apache Spark. I have been using it for several months, but this is my first project that uses it.
I use Spark to compute dynamic reports from data, stored in a NoSQL database (Cassandra). So far I have created several reports and they are computed correctly. Inside them I use DataFrame .unionAll(), .join(), .count(), .map(), etc.
I am running a 1.4.1 Spark cluster on my local machine with the following setup:
I have also populated the database with test data which is around 10-12k records per table.
By using the driver's web UI (http://localhost:4040/), I have noticed that the jobs are taking 40s-50s to execute, so lately I have been researching ways to tune Apache Spark and the jobs.
I have configured Spark to use the KryoSerializer, I have set the to lzf, I have optimized the jobs as much as I can and as much as my knowledge allows me to.
This led to the jobs taking 20s-30s to compute (which I think is a good improvement). The problem is that because this is my first Spark project, I have no baseline to compare the jobs times, so I have no idea if the execution is slow or fast and whether there is some problem in the code or with the Spark config.
What is the best way to proceed? Is there a graph or benchmark that shows how much time an action with N data should take?
You have to use hive . On top of hive you can put spark . After doing this create temp table in hive for Cassandra table you can perform all type of aggregation and filtering . After doing this you can use hive jdbc connection to get result . It will give fast result .
