I'm using structured stream. I need to left join a huge (billions of rows) Cassandra table to know whether the source data in micro-batch is new or existed in terms of id col. If I do something like:
val src = spark.read.cassandraFormat("src", "ks").load().select("id")
val query= some_dataset
.join(src, expr("src.id=some_dataset.id"), joinType = "leftOuter")
.withColumn("flag", expr("case when src.id is null then 0 else 1 end"))
.writeStream
.outputMode("update")
.foreach(...)
.start
Can Cassandra push down the left join and look up with the join col value in source delta? Is there a way to tell whether the pushdown happened or not?
Thanks
Not in the open source version of Spark Cassandra Connector. There is a support for it as DSE Direct Join in DSE Analytics, so if you use DataStax Enterprise, you'll get it. If you're using OSS connector you're limited to RDD API only.
Update, May 2020th: optimized join on dataframes is supported since SCC 2.5.0, together with other commercial features. See this blog posts for details.
Related
How do I use the storage partitioned join feature in Spark 3.3.0? I've tried it out, and my query plan still shows the expensive ColumnarToRow and Exchange steps. My setup is as follows:
joining two Iceberg tables, both partitioned on hours(ts), bucket(20, id)
join attempted on a.id = b.id AND a.ts = b.ts and on a.id = b.id
tables are large, 100+ partitions used, 100+ GB of data to join
spark: 3.3.0
iceberg: org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.1
set my spark session config with spark.sql.sources.v2.bucketing.enabled=true
I read through all the docs I could find on the storage partitioned join feature:
tracker
SPIP
PR
Youtube demo
I'm wondering if there are other things I need to configure, if there needs to be something implemented in Iceberg still, or if I've set up something wrong. I'm super excited about this feature. It could really speed up some of our large joins.
The support hasn't been implemented in Iceberg yet. In fact it looks like the work is proceeding as I'm typing: https://github.com/apache/iceberg/issues/430#issuecomment-1283014666
This answer should be updated when there's a release of Iceberg that supports Spark storage-partitioned joins.
Support for storage-partitioned joins (SPJ) has been added to Iceberg in PR #6371 and will be released in 1.2.0. Keep in mind Spark added support for SPJ for v2 sources only in 3.3, so earlier versions can't benefit from this feature.
Using the spark-elasticsearch connector it is possible to directly load only the required columns from ES to Spark. However, there doesn't seem to exist such a straight forward option to do the same, using the spark-cassandra connector
Reading data from ES into Spark
-- here only required columns are being brought from ES to Spark :
spark.conf.set('es.nodes', ",".join(ES_CLUSTER))
es_epf_df = spark.read.format("org.elasticsearch.spark.sql") \
.option("es.read.field.include", "id_,employee_name") \
.load("employee_0001") \
Reading data from Cassandra into Spark
-- here all the columns' data is brought to spark and then select is applied to pull columns of interest :
spark.conf.set('spark.cassandra.connection.host', ','.join(CASSANDRA_CLUSTER))
cass_epf_df = spark.read.format('org.apache.spark.sql.cassandra') \
.options(keyspace="db_0001", table="employee") \
.load() \
.select("id_", "employee_name")
Is it possible to do the same for Cassandra? If yes, then how. If not, then why not.
Actually, connector should do that itself, without need to explicitly set anything, it's called "predicate pushdown", and cassandra-connector does it, according to documentation:
The connector will automatically pushdown all valid predicates to
Cassandra. The Datasource will also automatically only select columns
from Cassandra which are required to complete the query. This can be
monitored with the explain command.
source: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
The code which you have written is already doing that. You have written select after load and you may think first all the columns are pulled and then selected columns are filtered, but that is not the case.
Assumption : select * from db_0001.employee;
Actual : select id_, employee_name from db_0001.employee;
Spark will understand the columns which you need and query only those in Cassandra database. This feature is called predicate pushdown. This is not limited just to cassandra, many sources support this feature(this is a feature of spark, not casssandra).
For more info: https://docs.datastax.com/en/dse/6.7/dse-dev/datastax_enterprise/spark/sparkPredicatePushdown.html
Question : Essentially it means , rather than running a join of C* table for each streaming records , is there anyway to run a join for each micro-batch ( micro-batching ) of records in spark streaming ?
We are almost finalized to use spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version.
But have one fundamental question regarding the efficiency in the below scenario.
For the streaming data records(i.e. streamingDataSet ) , I need to look up for existing records( i.e. cassandraDataset) from Cassandra(C*) table.
i.e.
Dataset<Row> streamingDataSet = //kafka read dataset
Dataset<Row> cassandraDataset= //loaded from C* table those records loaded earlier from above.
To look up data i need to join above datasets
i.e.
Dataset<Row> joinDataSet = cassandraDataset.join(cassandraDataset).where(//somelogic)
process further the joinDataSet to implement the business logic ...
In the above scenario, my understanding is ,for each record received
from kafka stream it would query the C* table i.e. data base call.
Does not it take huge time and network bandwidth if C* table consists
billions of records? What should be the approach/procedure to be
followed to improve look up C* table ?
What is the best solution in this scenario ? I CAN NOT load once from
C* table and look up as the data keep on adding to C* table ... i.e.
new look ups might need newly persisted data.
How to handle this kind of scenario? any advices plzz..
If you're using Apache Cassandra, then you have only one possibility for effective join with data in Cassandra - via RDD API's joinWithCassandraTable. The open source version of the Spark Cassandra Connector (SCC) supports only it, while in version for DSE, there is a code that allows to perform effective join against Cassandra also for Spark SQL - so-called DSE Direct Join. If you'll use join in Spark SQL against Cassandra table, Spark will need to read all data from Cassandra, and then perform join - that's very slow.
I don't have an example for OSS SCC for doing the join for Spark Structured Streaming, but I have some examples for "normal" join, like this:
CassandraJavaPairRDD<Tuple1<Integer>, Tuple2<Integer, String>> joinedRDD =
trdd.joinWithCassandraTable("test", "jtest",
someColumns("id", "v"), someColumns("id"),
mapRowToTuple(Integer.class, String.class), mapTupleToRow(Integer.class));
I'm trying to migrate a query to pyspark and need to join multiple tables in it. All the tables in question are in Redshift and I'm using the jdbc connector to talk to them.
My problem is how do I do these joins optimally without reading too much data in (i.e. load table and join on key) and without just blatantly using:
spark.sql("""join table1 on x=y join table2 on y=z""")
Is there a way to pushdown the queries to Redshift but still use the Spark df API for writing the logic and also utilizing df from spark context without saving them to Redshift just for the joins?
Please find next some points to consider:
The connector will push-down the specified filters only if there is any filter specified in your Spark code e.g select * from tbl where id > 10000. You can confirm that by yourself, just check the responsible Scala code. Also here is the corresponding test which demonstrates exactly that. The test test("buildWhereClause with multiple filters") tries to verify that the variable expectedWhereClause is equal to whereClause generated by the connector. The generated where clause should be:
"""
|WHERE "test_bool" = true
|AND "test_string" = \'Unicode是樂趣\'
|AND "test_double" > 1000.0
|AND "test_double" < 1.7976931348623157E308
|AND "test_float" >= 1.0
|AND "test_int" <= 43
|AND "test_int" IS NOT NULL
|AND "test_int" IS NULL
"""
which has occurred from the Spark-filters specified above.
The driver supports also column filtering. Meaning it will load only the required columns by pushing down the valid columns to redshift. You can again verify that from the corresponding Scala test("DefaultSource supports simple column filtering") and test("query with pruned and filtered scans").
Although in your case, you haven't specified any filters in your join query hence Spark can not leverage the two previous optimisations. If you are aware of such filters please feel free to apply them though.
Last but not least and as Salim already mentioned, the official Spark connector for redshift can be found here. The Spark connector is built on top of Amazon Redshift JDBC Driver therefore it will try to use it anyway as specified on the connector's code.
I have a spark structured streaming application (listening to kafka) that is also reading from a persistent table in s3 I am trying to have each microbatch check for updates to the table. I have tried
var myTable = spark.table("myTable!")
and
spark.sql("select * from parquet.`s3n://myFolder/`")
Both do not work in a streaming context. The issue is that the parquet file is changing at each update, and spark doesn't run any of the normal commands to refresh such as:
spark.catalog.refreshTable("myTable!")
spark.sqlContext.clearCache()
I have also tried:
spark.sqlContext.setConf("spark.sql.parquet.cacheMetadata","false")
spark.conf.set("spark.sql.parquet.cacheMetadata",false)
to no relief. There has to be a way to do this. Would it be smarter to use a jdbc connection to a Database instead?
Assuming I'm reading you right I believe the issue is that because DataFrame's are immutable, you cannot see the changes to your parquet table unless you restart the streaming query and create a new DataFrame. This question has come up on the Spark Mailing List before. The definitive answer appears to be that the only way to capture these updates is to restart the streaming query. If your application cannot tolerate 10 second hiccups you might want to check out this blog post which summarizes the above conversation and discusses how SnappyData enables mutations on Spark DataFrames.
Disclaimer: I work for SnappyData
This will accomplish what I'm looking for.
val df1Schema = spark.read.option("header", "true").csv("test1.csv").schema
val df1 = spark.readStream.schema(df1Schema).option("header", "true").csv("/1")
df1.writeStream.format("memory").outputMode("append").queryName("df1").start()
var df1 = sql("select * from df1")
The downside is that its appending. getting around one issue is to remove duplicates based on ID and with the newest date.
val dfOrder = df1.orderBy(col("id"), col("updateTableTimestamp").desc)
val dfMax = dfOrder.groupBy(col("id")).agg(first("name").as("name"),first("updateTableTimestamp").as("updateTableTimestamp"))