Last few tasks in spark job are running very slow - apache-spark

I was running below scala spark code on databricks. I am facing an issue like last few task are running very slowly. where 1st few tasks are completing with in seconds. I was tried with repartation of dataframe. But didn't helped. I am also not understanding what basics the tasks are divided to drocket node.
//#reading files from DBFS. And creating temp view on top of that.
val cdf_hsc_interim_nobed_main=spark.read.format("delta").option("header","true").load(s"${Interimpath}/cdf_hsc_interim_nobed")
cdf_hsc_interim_nobed_main.createOrReplaceTempView("cdf_hsc_interim_nobed")
spark.sql(s"REFRESH TABLE cdf_hsc_interim_nobed")
val cdf_hsc_facl_decn_bed_interim_main=spark.read.format("delta").option("header","true").load(s"${Interimpath}/cdf_hsc_facl_decn_bed_interim")
cdf_hsc_facl_decn_bed_interim_main.createOrReplaceTempView("cdf_hsc_facl_decn_bed_interim")
spark.sql(s"REFRESH TABLE cdf_hsc_facl_decn_bed_interim")
spark.sql(
s"""
select
nb.*,
facl.id as facl_decn_id,
facl.seq_nbr as facl_decn_seq_nbr,
case when facl.id is not null then concat(substr(nb.cse_dttm, 1, 10), ' 00:00:00.000') else cast(null as string) end as eng_dt
from interim_nobed nb
left outer join decn_bed_interim facl on
(nb.id=facl.hsc_id and nb.facl_decn_id=facl.hsc_id)
where nb.facl_id is not null
union all
select
nb.*,
cast(null as int) as facl_bed_id,
cast(null as int) as facl_bed_seq_nbr,
cast(null as string) as engg_dt
from interim_nobed nb
where nb.facl_id is null
""").write.mode("overwrite").option("header", "true").parquet(s"${Interimpath}/set1_interimdelete")
UPDATE:
My spark version : 3.1.2
I am having two tables below are the table information.
First table
Second Table
Also checked spark UI in total output more than 50% of output records are only for last task.

Adaptive Query Execution (AQE) is the best practice for speeding up the query.
In Spark 3.0, the AQE framework is shipped with three features:
coalescing shuffle partitions
switching join strategies
optimizing skew joins
All the above operations are dynamically implemented.
Check the below link for better implementation.
https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html

From the query side, try filtering before joining. Even though the spark is smart enough sometimes a nudge from an Engineer helps.
select
nb.*
,facl.id as facl_decn_id
,facl.seq_nbr as facl_decn_seq_nbr
,case when facl.id is not null then concat(substr(nb.cse_dttm, 1, 10), ' 00:00:00.000') else cast(null as string) end as eng_dt
from
(select
*
from
interim_nobed
where
facl_id is not null
) nb
left outer join
decn_bed_interim facl
on
nb.id=facl.hsc_id and nb.facl_decn_id=facl.hsc_id
UNION ALL
select
nb.*
,cast(null as int) as facl_bed_id
,cast(null as int) as facl_bed_seq_nbr
,cast(null as string) as engg_dt
from
interim_nobed nb
where
nb.facl_id is null

Related

Spark partition filter is skipped when table is used in where condition, why?

Maybe someone observed this behavior and knows why Spark takes this route.
I wanted to read only few partitions from partitioned table.
SELECT *
FROM my_table
WHERE snapshot_date IN('2023-01-06', '2023-01-07')
results in (part of) the physical plan:
-- Location: PreparedDeltaFileIndex [dbfs:/...]
-- PartitionFilters: [cast(snapshot_date#282634 as string) IN (2023-01-06,2033-01-07)]
It is very fast, ~1s, in the execution plan I see it is using those provided datasets as arguments for partition filters.
If I try to provide filter predicate in form of the one column table it does full table scan and it takes 100x longer.
SELECT *
FROM
my_table
WHERE snapshot_date IN (
SELECT snapshot_date
FROM (VALUES('2023-01-06'), ('2023-01-07')) T(snapshot_date)
)
-- plan
Location: PreparedDeltaFileIndex [dbfs:/...]
PartitionFilters: []
ReadSchema: ...
I was unable to find any query hints that would force Spark to push down this predicate.
One can easily do for loop in python and wrap logic of reading a table with desired dates and read them one by one. But I'm not sure it is possible in SQL.
Is there any option/switch I have missed?
I don't think pushing down this kind of predicate is something supported by Spark's HiveMetaStore client, today.
So in first case, HiveShim.convertFilters(...) method will transform
:
WHERE snapshot_date IN ('2023-01-06', '2023-01-07')
into a filtering predicate understood by HMS as
snapshot_date="2023-01-06" or snapshot_date="2023-01-07"
but in the second, sub-select, case the condition will be skipped altogether.
/**
* Converts catalyst expression to the format that Hive's getPartitionsByFilter() expects, i.e.
* a string that represents partition predicates like "str_key=\"value\" and int_key=1 ...".
*
* Unsupported predicates are skipped.
*/
def convertFilters(table: Table, filters: Seq[Expression]): String = {
lazy val dateFormatter = DateFormatter()
:
:

Type 'INTERVAL' are not supported in Spark SQL 2.4.3 - What is workaround?

EDIT : Apparently Spark 2.4.3 does not support INTERVAL. I cannot upgrade to Spark 3.0.0 for the time being (admin policy). I was wondering if there is a workaround or alternating approach for INTERVAL at the moment? Thanks
Running a query on Spark sql in Databricks and the query shows an error on interval line. I am trying to left-join the table on itself on the same user ID and having a one-month difference in users.
Error in SQL statement: ParseException:
Literals of type 'INTERVAL' are currently not supported.
Does not Spark SQL support the interval function?
Here is my try:
%sql
;WITH act_months AS (
SELECT DISTINCT
DATE_TRUNC('month', data_date) ::DATE AS act_month,
user_id
FROM user_sessions)
SELECT
prev.act_month,
prev.user_id,
curr.user_id IS NULL AS churned_next_month
FROM act_months AS prev
LEFT JOIN act_months AS curr
ON prev.user_id = curr.user_id
AND prev.act_month = (curr.act_month - INTERVAL '1 MONTH')
ORDER BY prev.act_month ASC, prev.user_id ASC;
here is my data structure
+----------+----------+
| data_date| user_id|
+----------+----------+
|2020-01-01|22600560aa|
|2020-01-01|17148900ab|
|2020-01-01|21900230aa|
|2020-01-01|35900050ac|
|2020-01-01|22300280ad|
|2020-01-02|19702160ac|
|2020-02-02|17900020aa|
|2020-02-02|16900120aa|
|2020-02-02|11160900aa|
|2020-03-02|16900290aa|
+----------+----------+
(Disclaimer: I am not a Spark user - and this is me reposting my comment as an answer):
From my reading of Spark's documentation, INTERVAL is only supported by Spark 3.0.0 or later.
You said you're running Spark 2.4.3, so INTERVAL is not supported in your system.
However you can use ADD_MONTHS (and DATE_ADD) which is supported by (at least) Spark 2.3.0.
Try this:
;WITH q AS (
SELECT
DISTINCT
DATE_TRUNC( data_date, 'month' ) AS act_year_month, -- DATE_TRUNC( $dt, 'month' ) returns a datetime value with only the YEAR and MONTH components set, all other components are zeroed out.
user_id
FROM
user_sessions
)
SELECT
prev.act_year_month,
prev.user_id,
( curr.user_id IS NULL ) AS churned_next_month
FROM
q AS prev
LEFT JOIN q AS curr ON
prev.user_id = curr.user_id
AND
prev.act_year_month = ADD_MONTHS( curr.act_year_month, -1 )
ORDER BY
prev.act_year_month,
prev.user_id;

LAST (not NULL) gives incorrect answers on a handful for records (Apache Spark SparkSQL)

I'm having a pretty troubling problem with the LAST aggregate in SparkSQL in Spark 2.3.1. It seems to give me around 4 bad results -- that is, values that are not LAST by the specified partitioning and order -- in 500,000 (logical SQL, not Spark) partitions, something like 50MM records. Smaller batches are worse -- the number of errors per batch seems pretty consistent, although I don't think I tried anything smaller than 100,000 logical SQL partitions.
I have roughly 66 FIRST or LAST aggregates, a compound (date, integer) logical sql partition key and a compound (string, string) sort key. I tried converting the four-character numeric values into integers, then I combined them into a single integer. Neither of those moves resolved the problem. Even with a single integer sort key, I was getting a few bad values.
Typically, there are fewer than a hundred records in each partition, and a handful of non-NULL values for any field. It never seems to get the second to last value; it's always at least third to last.
I did try to replace the simple aggregate with a windowed aggregate with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. The one run I did of that gave me six bad records -- the compound integer key had given me only two, but I didn't do enough runs to really compare the approaches and of course I need zero.
Why do I not seem to be able to rely on LAST()? Here's a test which just illustrates the unwindowed version of the LAST function, although my partitioning and sorting fields are each two fields.
import org.apache.spark.sql.functions.{expr}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.scalamock.scalatest.MockFactory
import org.scalatest.{BeforeAndAfterAll, FlatSpec, Matchers}
import collection.JavaConverters._
class LastTest extends FlatSpec with Matchers with MockFactory with BeforeAndAfterAll {
implicit val spark: SparkSession = SparkSession.builder().appName("Last Test").master("local[2]").getOrCreate()
import spark.implicits._
// TRN_DATE, TRN_NUMBER, TRN_TIMESTAMP, DETAILS, DATE_TIME, QUE_LINE_ID, OPR_INITIALS, ENTRY_TYPE, HIST_NO, SUB_HIST_NO, MSG_INFO
"LAST" must "work with GROUP BY" in {
val lastSchema = StructType(Seq(
StructField("Pfield", IntegerType) // partition field
, StructField("Ofield", IntegerType) // order field
, StructField("Vfield", StringType) // value field
))
val last:DataFrame = spark.createDataFrame(List[Row](
Row(0, 1, "Pencil")
, Row(5, 1, "Aardvark")
, Row(10, 1, "Monastery")
, Row(10, 2, "Remediation")
, Row(15, 1, "Parcifal")
, Row(20, 1, "Montenegro")
, Row(20, 2, "Susquehana")
, Row(20, 3, "Perfidy")
, Row(20, 4, "Prosody")
).asJava
, lastSchema
).repartition(expr("MOD(Pfield, 4)"))
last.createOrReplaceTempView("last_group_test")
// apply the unwindowed last
val unwindowed:DataFrame = spark.sql("SELECT Pfield, LAST(Vfield) AS Vlast FROM (SELECT * FROM last_group_test ORDER BY Pfield, Ofield) GROUP BY Pfield ORDER BY Pfield")
unwindowed.show(5)
// apply a windowed last
val windowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
windowed.show(5)
// include the partitioning function in the window
val excessivelyWindowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY MOD(Pfield, 4), Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
excessivelyWindowed.show(5)
assert(unwindowed.collect() === windowed.collect() && windowed.collect() === excessivelyWindowed.collect())
assert(windowed.count() == 5)
assert(windowed.filter("Pfield=20").select($"Vlast").collect()(0)(0)==="Prosody")
}
}
So, all three datasets are the same, which is nice. But, if I apply this logic to my actual needs -- which has sixty-odd columns, almost all of which are LAST values -- I'll get an error, it looks like about 4 times in a batch of 500,000 groups. If I run the dataset 30 times, I'll get 30 different sets of bad records.
Am I doing something wrong, or is this a defect? Is it a known defect? Is it fixed in 2.4? I didn't see if, but "aggregates simply don't work sometimes" can't be something they released with, right?
I was able to resolve the issue by applying with windowed aggregate to a dataset with the same sorting, sorted in a subquery.
SELECT LAST(VAL) FROM (SELEcT * FROM TBL ORDER BY SRT) SRC GROUP BY PRT
was not sufficient, nor was
SELECT LAST(VAL) OVER (PARTITION BY PRT ORDER BY SRT ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM TBL
I had to do both
SELECT DISTINCT LAST(VAL) OVER (PARTITION BY PRT ORDER BY SRT ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM (SELEcT * FROM TBL ORDER BY SRT) SRC
These datasets had been extracted from an Oracle 12.2 instance over JDBC. I also added SRT to the order by clause there, which had just had ORDER BY PRT.
Further -- and I think this may have been most significant -- I used the cacheTable API on the spark catalog object after extracting the data. I had been doing
.repartition
.cache
.count
in order to load all the records with a relatively small number of data connections, but I suspect it was not enough to get all the data sparkside before the aggregations took place.

Converting the Hive SQL output to an array[Double]

I am reading some data from a hive table using a hive context in spark and the out put is a ROW with only one column. I need to convert this to an array of Double. I have tried all possible ways to do it myself with no success. Can somebody please help in this ?
val qRes = hiveContext.sql("""
Select Sum(EQUnit) * Sum( Units)
From pos_Tran_orc T
INNER JOIN brand_filter B
On t.mbbrandid = b.mbbrandid
inner join store_filter s
ON t.msstoreid = s.msstoreid
Group By Transdate
""")
What next ????
You can simply map using Row.getDouble method:
qRes.map(_.getDouble(0)).collect()

How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]).
I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores, 126G RAM).
I have tried several memory setup and tuning options to make it work faster, but neither of them made a huge impact.
I am sure there is something I am missing and below is my final try that took about 11 minutes to get this simple counts vs it only took 40 seconds using a JDBC connection through R to get the counts.
bin/pyspark --driver-memory 40g --executor-memory 40g
df = sqlContext.read.jdbc("jdbc:teradata://......)
df.count()
When I tried with BIG table (5B records) then no results returned upon completion of query.
All of the aggregation operations are performed after the whole dataset is retrieved into memory into a DataFrame collection. So doing the count in Spark will never be as efficient as it would be directly in TeraData. Sometimes it's worth it to push some computation into the database by creating views and then mapping those views using the JDBC API.
Every time you use the JDBC driver to access a large table you should specify the partitioning strategy otherwise you will create a DataFrame/RDD with a single partition and you will overload the single JDBC connection.
Instead you want to try the following AI (since Spark 1.4.0+):
sqlctx.read.jdbc(
url = "<URL>",
table = "<TABLE>",
columnName = "<INTEGRAL_COLUMN_TO_PARTITION>",
lowerBound = minValue,
upperBound = maxValue,
numPartitions = 20,
connectionProperties = new java.util.Properties()
)
There is also an option to push down some filtering.
If you don't have an uniformly distributed integral column you want to create some custom partitions by specifying custom predicates (where statements). For example let's suppose you have a timestamp column and want to partition by date ranges:
val predicates =
Array(
"2015-06-20" -> "2015-06-30",
"2015-07-01" -> "2015-07-10",
"2015-07-11" -> "2015-07-20",
"2015-07-21" -> "2015-07-31"
)
.map {
case (start, end) =>
s"cast(DAT_TME as date) >= date '$start' AND cast(DAT_TME as date) <= date '$end'"
}
predicates.foreach(println)
// Below is the result of how predicates were formed
//cast(DAT_TME as date) >= date '2015-06-20' AND cast(DAT_TME as date) <= date '2015-06-30'
//cast(DAT_TME as date) >= date '2015-07-01' AND cast(DAT_TME as date) <= date '2015-07-10'
//cast(DAT_TME as date) >= date '2015-07-11' AND cast(DAT_TME as date) <= date //'2015-07-20'
//cast(DAT_TME as date) >= date '2015-07-21' AND cast(DAT_TME as date) <= date '2015-07-31'
sqlctx.read.jdbc(
url = "<URL>",
table = "<TABLE>",
predicates = predicates,
connectionProperties = new java.util.Properties()
)
It will generate a DataFrame where each partition will contain the records of each subquery associated to the different predicates.
Check the source code at DataFrameReader.scala
Does the unserialized table fit into 40 GB? If it starts swapping on disk performance will decrease drammatically.
Anyway when you use a standard JDBC with ansi SQL syntax you leverage the DB engine, so if teradata ( I don't know teradata ) holds statistics about your table, a classic "select count(*) from table" will be very fast.
Instead spark, is loading your 100 million rows in memory with something like "select * from table" and then will perform a count on RDD rows. It's a pretty different workload.
One solution that differs from others is to save the data from the oracle table in an avro file (partitioned in many files) saved on hadoop.
This way reading those avro files with spark would be a peace of cake since you won't call the db anymore.

Resources