Spark structured streaming: what are the possible usages of queryName() setting? - apache-spark

As per Structured Streaming Programming Guide
queryName("myTableName") is used to defined the in-memory table name when the output sink is format("memory")
aggDF
.writeStream
.queryName("aggregates") // this query name will be the table name
.outputMode("complete")
.format("memory")
.start()
spark.sql("select * from aggregates").show() // interactively query in-memory table
Spark source code for DataStreamWriterscala documents queryName() as:
Specifies the name of the [[StreamingQuery]] that can be started with start().
This name must be unique among all the currently active queries in the associated SQLContext.
QUESTION: is there any other possible usages of the queryName() setting? Spark job logs? details in progress monitoring of the query ?

I came across the following three usages of the queryName:
As mentioned by OP and documented in the Structured Streaming Guide it is used to define the in-memory table name when the output sink is of format "memory".
The queryName defines the value of event.progress.name where the event is a QueryProgressEvent within the StreamingQueryListener.
It is also used in the description column of the Spark Web UI (see screenshot where I set queryName("StackoverflowTest"):

Adding to #mike answer, I want to mention that in Databricks (which uses Spark at its core) you can use the defined query name in conjunction with the function untilStreamIsReady().
For example, if you define the streaming query StackoverflowTest, then you could execute the function untilStreamIsReady('StackoverflowTest') to wait until the query is ready and started (sorry for being Captain Obvious).
I must say I could not find a direct reference for this function in the official documentation, but found it in the following links:
In Spark Streaming, is there a way to detect when a batch has finished?
example of usage: https://youtu.be/KLD10xn4sX8?t=1219

Related

Is there a way to read data without SQL in Spark?

I am beginner in Spark and was given an assignment to read data from csv and perform some query data using Spark Core.
However, every online resource that I search uses some form of SQL from the pyspark.sql module.
Are there any way to read data and perform data query (select, count, group by) using only Spark Core?
Spark Core is concept RDD. Here you can find more information and examples with processing some textfiles.
its good practice to use Spark Dataframe instead Spark RDD.
Spark Dataframe uses catalyst optimizer which automatically calls out code internally in best way to improve performance.
https://blog.bi-geek.com/en/spark-sql-optimizador-catalyst/

How does spark structured streaming job handle stream - static DataFrame join?

I have a spark structured streaming job which reads a mapping table from cassandra and deltalake and joins with streaming df. I would like to understand the exact mechanism here. Does spark hit these data sources(cassandra and deltalake) for every cycle of microbatch? If that is the case i see in spark web ui that these tables are read only once.
Please help me understand this.
Thanks in advance
"Does spark hit these data sources(cassandra and deltalake) for every cycle of microbatch?"
According to the book "Learning Spark, 2nd edition" from O'Reilly on static-stream joins it is mentioned that the static DataFrame is read in every micro-batch.
To be more precise, I find the following section in the book quite helpful:
Stream-static joins are stateless operations, and therfore do not required any kind of watermarking
The static DataFrame is read repeatedly while joining with the streaming data of every micro-batch, so you can cache the static DataFrame to speed up reads.
If the underlying data in the data source on which the static DataFrame was defined changes, wether those changes are seen by the streaming query depends on the specific behavior of the data source. For example, if the static DataFrame was defined on files, then changes to those files (e.g. appends) will not be picked up until the streaming query is restarted.
When applying a "static-stream" join it is assumed that the static part is not changing at all or only slowly changing. If you plan to join two rapidly changing data sources it is required to switch to a "stream-stream" join.

Pulling only required columns in Spark from Cassandra without loading all the columns

Using the spark-elasticsearch connector it is possible to directly load only the required columns from ES to Spark. However, there doesn't seem to exist such a straight forward option to do the same, using the spark-cassandra connector
Reading data from ES into Spark
-- here only required columns are being brought from ES to Spark :
spark.conf.set('es.nodes', ",".join(ES_CLUSTER))
es_epf_df = spark.read.format("org.elasticsearch.spark.sql") \
.option("es.read.field.include", "id_,employee_name") \
.load("employee_0001") \
Reading data from Cassandra into Spark
-- here all the columns' data is brought to spark and then select is applied to pull columns of interest :
spark.conf.set('spark.cassandra.connection.host', ','.join(CASSANDRA_CLUSTER))
cass_epf_df = spark.read.format('org.apache.spark.sql.cassandra') \
.options(keyspace="db_0001", table="employee") \
.load() \
.select("id_", "employee_name")
Is it possible to do the same for Cassandra? If yes, then how. If not, then why not.
Actually, connector should do that itself, without need to explicitly set anything, it's called "predicate pushdown", and cassandra-connector does it, according to documentation:
The connector will automatically pushdown all valid predicates to
Cassandra. The Datasource will also automatically only select columns
from Cassandra which are required to complete the query. This can be
monitored with the explain command.
source: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
The code which you have written is already doing that. You have written select after load and you may think first all the columns are pulled and then selected columns are filtered, but that is not the case.
Assumption : select * from db_0001.employee;
Actual : select id_, employee_name from db_0001.employee;
Spark will understand the columns which you need and query only those in Cassandra database. This feature is called predicate pushdown. This is not limited just to cassandra, many sources support this feature(this is a feature of spark, not casssandra).
For more info: https://docs.datastax.com/en/dse/6.7/dse-dev/datastax_enterprise/spark/sparkPredicatePushdown.html

How to specify the filter condition with spark DataFrameReader API for a table?

I was reading about spark on databricks documentation https://docs.databricks.com/data/tables.html#partition-pruning-1
It says
When the table is scanned, Spark pushes down the filter predicates
involving the partitionBy keys. In that case, Spark avoids reading
data that doesn’t satisfy those predicates. For example, suppose you
have a table that is partitioned by <date>. A query
such as SELECT max(id) FROM <example-data> WHERE date = '2010-10-10'
reads only the data files containing tuples whose date value matches
the one specified in the query.
How can I specify such filter condition in DataFrameReader API while reading a table?
As spark is lazily evaluated when your read the data using dataframe reader it is just added as a stage in the underlying DAG.
Now when you run SQL query over the data it is also added as another stage in the DAG.
And when you apply any action over the dataframe, then the DAG is evaluated and all the stages are optimized by catalyst optimized which in the end generated the most cost effective physical plan.
At the time of DAG evaluation predicate conditions are pushed down and only the required data is read into the memory.
DataFrameReader is created (available) exclusively using SparkSession.read.
That means it is created when the following code is executed (example of csv file load)
val df = spark.read.csv("path1,path2,path3")
Spark provides a pluggable Data Provider Framework (Data Source API) to rollout your own datasource. Basically, it provides interfaces that can be implemented for reading/writing to your custom datasource. That's where generally the partition pruning and predicate filter pushdowns are implemented.
Databricks spark supports many built-in datasources (along with predicate pushdown and partition pruning capabilities) as per https://docs.databricks.com/data/data-sources/index.html.
So, if the need is to load data from JDBC table and specify filter conditions, please see the following example
// Note: The parentheses are required.
val pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
Please refer to more details here
https://docs.databricks.com/data/data-sources/sql-databases.html

apache-spark-Cost Based Optimizer(CBO) stats are not used while evaluating query plans in Spark Sql

We are trying to leverage CBO for getting better plan results for few critical queries run thru spark-sql or thru thrift server using jdbc driver. Following settings added to spark-defaults.conf {code}
spark.sql.cbo.enabled true spark.experimental.extrastrategies
intervaljoin spark.sql.cbo.joinreorder.enabled true {code}
The tables that we are using are not partitioned.
Please let me know if you need further details.
You provide little detail. Check if all steps have been followed as set out below pls.
From Spark 2.2 when I last looked at this excellent article: https://www.waitingforcode.com/apache-spark-sql/spark-sql-cost-based-optimizer/read
the following:
Spark SQL implementation
At the time of writing (2.2.0 released) Spark SQL Cost Based Optimization is disabled by default and can be activated through spark.sql.cbo.enabled property. When enabled, it applies in: filtering, projection, joins and aggregations, as we can see in corresponding estimation objects from org.apache.spark.sql.catalyst.plans.logical.statsEstimation package: FilterEstimation, ProjectEstimation, JoinEstimation and AggregateEstimation.
Even if at first glance the use of estimation objects seems to be conditioned only by the configuration property, it's not always the case. The Spark's CBO is applied only when the statistics about manipulated data are known (read more about them in the post devoted to Statistics in Spark SQL). This condition is expressed by EstimationUtils method:
def rowCountsExist(conf: SQLConf, plans: LogicalPlan*): Boolean =
plans.forall(_.stats(conf).rowCount.isDefined)
The filtering is an exception because it's checked against the number of rows existence:
if (childStats.rowCount.isEmpty) return None
The statistics can be gathered by the execution of ANALYZE TABLE $TABLE_NAME COMPUTE STATISTICS command before the processing execution. When ANALYZE command is called, it's executed by:
org.apache.spark.sql.execution.command.AnalyzeTableCommand#run(SparkSession) that updates
org.apache.spark.sql.catalyst.catalog.SessionCatalog statistics of processed data.
The only problem with ANALYZE command is that it can be called only for Hive and in-memory data stores.
Also, CBO does not work properly with Partitioned Hive Parquet tables; CBO only gives the size and not the number of rows estimated.

Resources