what is spark doing after insertInto? - apache-spark

I have a super-simple pyspark script:
Run query(Hive) and create a dataframe A
Perform aggregates on A which creates dataframe B
Print the number of rows on the aggregates with B.count()
Save the results of B in a Hive table using B.insertInto()
But I noticed something, in the Spark web UI the insertInto is listed as completed, but the client program(notebook) is still marking the insert as as running, if I run a count directly to the Hive table with a Hive client(no spark) the row-count is not the same as printed with B.count(), if I run the query again, the number of rows increases(but still not matching to B.count()) after some minutes, the row count hive query, matches B.count(). Question is, if the insertInto() job is already completed (according to the Spark web UI) what is it doing? given by the row-count increase behavior it seems as it is still running the insertInto but that does not matches with the spark web UI. My guess is something like hive table partition metadata update is running, or something similar.

Related

What is the Spark behavior when changing the source before sink completes

we have 2 different pipelines A and B.
A loads data to hive table. When loading, it writes to a new S3 location and then ALTER hive table to this new location. This happens every 15 min.
B exports hive table to MSSQL through Spark and JDBC. When exports, write to a temp table through Spark first, and then do a rename from SQL Server stored procedure.
We expect the table in SQL Server and the hive table are exactly the same, but occasionally we found SQL server table as around 1.7 times of hive table row count. What is the possible reason?
And we found when sinking to SQL Server (needs around 40min), pipeline A run twice, that is, before sink completes, the hive table it reads has been changed to a new location, but the Spark job completes successfully. What is the Spark behavior when changing the source before sink completes?
Thanks

In Pyspark Structured Streaming, how can I discard already generated output before writing to Kafka?

I am trying to do Structured Streaming (Spark 2.4.0) on Kafka source data where I am reading latest data and performing aggregations on a 10 minute window. I am using "update" mode while writing the data.
For example, the data schema is as below:
tx_id, cust_id, product, timestamp
My aim is to find customers who have bought more than 3 products in last 10 minutes. Let's say prod is the dataframe which is read from kafka, then windowed_df is defined as:
windowed_df_1 = prod.groupBy(window("timestamp", "10 minutes"), cust_id).count()
windowed_df = windowed_df_1.filter(col("count")>=3)
Then I am joining this with a master dataframe from hive table "customer_master" to get cust_name:
final_df = windowed_df.join(customer_master, "cust_id")
And finally, write this dataframe to Kafka sink (or console for simplicity)
query = final_df.writeStream.outputMode("update").format("console").option("truncate",False).trigger(processingTime='2 minutes').start()
query.awaitTermination()
Now, when this code runs every 2 minutes, in the subsequent runs, I want to discard all those customers who were already part of my output. I don't want them in my output even if they buy any product again.
Can I write the stream output temporarily somewhere (may be a hive table) and do an "anti-join" for each execution ?
This way I can also have a history maintained in a hive table.
I also read somewhere that we can write the output to a memory sink and then use df.write to save it in HDFS/Hive. But what if we terminate the job and re-run ? The in-memory table will be lost in this case I suppose.
Please help as I am new to Structured Streaming.
**
Update: -
**
I also tried below code to write output in Hive table as well as Console(or Kafka sink):
def write_to_hive(df, epoch_id):
df.persist()
df.write.format("hive").mode("append").saveAsTable("hive_tab_name")
pass
final_df.writeStream.outputMode("update").format("console").option("truncate", False).start()
final_df.writeStream.outputMode("update").foreachBatch(write_to_hive).start()
But this only performs the 1st action, i.e. write to Console.
If i write "foreachBatch" first, it will save to Hive table but does not print to console.
I want to write to 2 different sinks. Please help.

SPARK Performance degrades with incremental loads in local mode

i am trying to run a apache spark sql job (1.6) in local mode over 3 node cluster and i face below issues in production.
Execution time for duplication layer is increasing day by day after incremental load at DL layer.
Nearly 150K records are being inserted in each table every day.
We have tried with default as well as “MEMORY AND DISK” persist mechanism , but its working same in both cases.
Execution time is impacting the other tables if we run large tables first.
spark job is being invoked in a standard format and executed shell script using spark-submit and below sql query from my spark job is as below.
val result=sqlcontext.sql("CREATE TABLE "+DB+"."+table_name+" row format delimited fields terminated by '^' STORED as ORC tblproperties(\"orc.compress\"=\"SNAPPY\",\"orc.stripe.size\"='67108864') AS select distinct a.* from "+fdl_db+"."+table_name+" a,(SELECT SRL_NO,MAX("+INC_COL+") as incremental_col FROM "+fdl_db+"."+table_name+" group by SRL_NO) b where a.SRL_NO=b.SRL_NO and a."+INC_COL+"=b.incremental_col").repartition(100);
please let me know if you need any more info.

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

In Spark job, few tasks are taking very long (stuck) when using Spark SQL group by

I have Spark job (on Azure HDInsight) which runs SQL query (standard groupby) and saves result to Parquet & csv format.
This job is taking really long. When I check Spark UI, I see there are two tasks that are stuck.
I thought this was because of data skew. However, if I check these executors:
they have similar size of input data, shuffle read and write. So it doesn't seem to be data skew issue. What other reasons could cause some tasks to take very long?
Just to give some idea about what I am running, sample query:
SELECT year_month, id, feature,
min(prev_ts) AS timedelta_min,
max(prev_ts) AS timedelta_max,
stddev_pop(prev_ts) AS timedelta_sd,
AVG(prev_ts) AS timedelta_mean,
percentile_approx(prev_ts,0.5) AS timedelta_median,
percentile_approx(prev_ts,0.25) AS timedelta_1st_quartile,
percentile_approx(prev_ts,0.75) AS timedelta_3rd_quartile
FROM table_a
GROUP BY id, feature, year_month
Here prev_ts was column created using LAG on timestamp column.

Resources