Pypark job re-triggered itself - apache-spark

I submitted my pyspark job from a shell script.
pyspark job will insert data from Hive tables to SQL server tables.
it was running fine for 2 hours and step 3 is running, totally 5 steps.
But after 2 hours it again started from 1st step, How is that possible on a pyspark job. it look likes Pyspark job re started it.
what could have caused this issue?

Related

what is spark doing after insertInto?

I have a super-simple pyspark script:
Run query(Hive) and create a dataframe A
Perform aggregates on A which creates dataframe B
Print the number of rows on the aggregates with B.count()
Save the results of B in a Hive table using B.insertInto()
But I noticed something, in the Spark web UI the insertInto is listed as completed, but the client program(notebook) is still marking the insert as as running, if I run a count directly to the Hive table with a Hive client(no spark) the row-count is not the same as printed with B.count(), if I run the query again, the number of rows increases(but still not matching to B.count()) after some minutes, the row count hive query, matches B.count(). Question is, if the insertInto() job is already completed (according to the Spark web UI) what is it doing? given by the row-count increase behavior it seems as it is still running the insertInto but that does not matches with the spark web UI. My guess is something like hive table partition metadata update is running, or something similar.

Databricks Streaming scheduled job fails

I have created a scheduled job in databricks to execute a notebook at regular intervals.
Within the notebook, there are many commands separated by cells. Spark streaming query is one of the command in a cell.
The scheduled job fails because the streaming query takes sometime to complete the execution. But the problem is before completing the streaming query, next command is trying to get executed. So the job gets failed.
How can I make dependency for these 2 commands? I want the next command to run only after completion of streaming query.
I am using Dataframe API using Pyspark. Thanks
you need to wait query to finish. it's usually done with .awaitTermination function (doc), like this:
query = df.writeStream.....
query.awaitTermination()

How to understand spark execution time without access to spark history UI?

I have written pyspark job , and my job is running longer . I want to analyze job execution and fix the code part that is causing slowness. Due to access issue over spark history ui I can not analyze job plan. Hence I have to do some tricks around the code and understand at what section spark is consuming more time.
I have tried to run count on data-frame but it seems this is not that much help to understand job slowness.
below are step I am doing on my code:
step-1 : read from cassandra table:
cassandra_data = spark_session.read \
.format('org.apache.spark.sql.cassandra') \
.options(table=table, keyspace=keyspace) \
.load()
return data
step -2 : add a column in data-frame read from cassandra that has value of md5 over entire row .
data_wth_hash = prepare_data_md5(cassandra_data)
data_wth_hash.cache()
data_wth_hash.count()
step -3 : write into aws s3 folder .
Job is taking much more time while writing into s3 , I do not have access to spark history ui to understand where it is consuming more time.

SPARK Performance degrades with incremental loads in local mode

i am trying to run a apache spark sql job (1.6) in local mode over 3 node cluster and i face below issues in production.
Execution time for duplication layer is increasing day by day after incremental load at DL layer.
Nearly 150K records are being inserted in each table every day.
We have tried with default as well as “MEMORY AND DISK” persist mechanism , but its working same in both cases.
Execution time is impacting the other tables if we run large tables first.
spark job is being invoked in a standard format and executed shell script using spark-submit and below sql query from my spark job is as below.
val result=sqlcontext.sql("CREATE TABLE "+DB+"."+table_name+" row format delimited fields terminated by '^' STORED as ORC tblproperties(\"orc.compress\"=\"SNAPPY\",\"orc.stripe.size\"='67108864') AS select distinct a.* from "+fdl_db+"."+table_name+" a,(SELECT SRL_NO,MAX("+INC_COL+") as incremental_col FROM "+fdl_db+"."+table_name+" group by SRL_NO) b where a.SRL_NO=b.SRL_NO and a."+INC_COL+"=b.incremental_col").repartition(100);
please let me know if you need any more info.

How to write functional Test with spark

I have a spark batch job that talks with cassandra. After the batch job gets completed , I need to verify few entries in cassandra and the cycle continues for 2-3 times. How do I know when the batch job ends ? I don't want to track the status of batch job by adding entry in db.
How to write functional test in spark ?

Resources