Can't read via Apache Spark Structured Streaming from Hive Table - apache-spark

When I try and read from a hive table with the following code. I get the error buildReader is not supported for HiveFileFormat from spark driver pod.
spark.readStream \
.table("table_name") \
.repartition("filename") \
.writeStream \
.outputMode("append") \
.trigger(processingTime="10 minutes") \
.foreachBatch(perBatch)
Have tried every possible combination including most simple queries possible. Reading via parquet method direct from specified folder works, as does Spark SQL without streaming, but reading with Structured Streaming via readStream does not.
The documentation says the following...
Since Spark 3.1, you can also use DataStreamReader.table() to read tables as streaming DataFrames and use DataStreamWriter.toTable() to write streaming DataFrames as tables:
I'm using the latest Spark 3.2.1. Although reading from a table is not shown in the examples the paragraph above clearly suggests it should be possible.
Any assistance to help get this working would be really great and simplify my project a lot.

Related

Adding / removing columns from schema with Kafka source without restarting the session using PySpark Structured Streaming

I'm using Pyspark 3.2.0, and I'm pretty new to Structured Streaming and couldn't find an answer to this question.
I want to read json data from a Kafka topic using a predefined schema like the following (code related to initialization / connections is omitted):
# The skeleton schema is defined in 'schema.py'
skeleton_schema = get_skeleton_schema()
df = df.selectExpr("CAST(value AS STRING)") \
.select(from_json("value", skeleton_schema).alias("data")) \
.select(col("data.*"))
...
df.writeStream \
.format("console") \
.outputMode("append") \
.trigger(processingTime='5 minutes') \
.start()
df.awaitTermination()
I want to be able to modify the skeleton_schema (e.g add/remove columns) in the 'schema.py' file and to have those changes reflected to future triggers. Is there a way to achieve this? If not, is there a different mechanism to update the schema without restarting the session?
Unless get_skeleton_schema() function itself is ran per batch and not cached (for example, calls an external REST API, database, or parses some file), which it does not in the shown code, then no, it's not possible to change it at runtime.
Keep in mind, there's no guarantee that all records in the same batch will have the same schema....
You'd need to consume the columns as bytes, then use a ForEachWriter implementation to implement this, but I'm not familiar enough with pyspark to give an example
Depending on where you're actually going to be writing the data into (not the console, e.g. Using Mongo or Snowflake instead), you could look at using Kafka Connect and then using Avro or Protobuf serialization rather than JSON. Then your producer's would decide when to introduce/remove columns in a backwards-compatible manner, enforced by a Schema Registry, and your consumers wouldn't have to change or define any schema themselves

Using redshift as a JDBC source for readStream in the Structured Streaming API (pyspark)

Im looking for a package, or a previous implementation of using redshift as the source for a structured streaming dataframe.
spark.readStream \
.format("io.github.spark_redshift_community.spark.redshift") \
.option('url', redshift_url) \
.option('forward_spark_s3_credentials', 'true') \
.load()
Using the format below you get errors on the read. such as:
Data source io.github.spark_redshift_community.spark.redshift does not support streamed reading
This is the same error if you downgrade from Spark 3 and use: com.databricks.spark.redshift
Is there a known workaround, or methodology/pattern i can use to implement (in pyspark) redshift as a readStream datasource
As the error says, this library does not support streaming reads/ writes to/ from Redshift.
Same can be confirmed from the project source at link. The format does not extend or implement Micro/ Continuous stream readers and writers.
There will be no true streaming easy ways to this. You may explore the following avenues,
Explore 3rd party libs. Search for JDBC streaming spark. Disclaimer: I have not used these and thus do not endorse these libs.
Create a Micro-batching strategy on a custom check-pointing mechanism.
Extended Note: AFAIK, Spark JDBC interfaces do not support Structured Streaming.

Databricks structured streaming with Snowflake as source?

Is it possible to use a Snowflake table as a source for spark structured streaming in Databricks? When I run the following pyspark code:
options = dict(sfUrl=our_snowflake_url,
sfUser=user,
sfPassword=password,
sfDatabase=database,
sfSchema=schema,
sfWarehouse=warehouse)
df = spark.readStream.format("snowflake") \
.schema(final_struct) \
.options(**options) \
.option("dbtable", "BASIC_DATA_TEST") \
.load()
I get this warning:
java.lang.UnsupportedOperationException: Data source snowflake does not support streamed reading
I haven't been able to find anything in the Spark Structured Streaming Docs that explicitly says Snowflake is supported as a source, but I'd like to make sure I'm not missing anything obvious.
Thanks!
The Spark Snowflake connector currently does not support using the .writeStream/.readStream calls from Spark Structured Streaming

How to write a Streaming Structured Stream into Hive directly?

I want to achieve something like this :
df.writeStream
.saveAsTable("dbname.tablename")
.format("parquet")
.option("path", "/user/hive/warehouse/abc/")
.option("checkpointLocation", "/checkpoint_path")
.outputMode("append")
.start()
I am open to suggestions. I know Kafka Connect could be one of the options but how to achieve this using Spark. A possible workaround may be what I am looking for.
Thanks in Advance !!
Spark Structured Streaming does not support writing the result of a streaming query to a Hive table directly. You must write to paths.
For 2.4 they say try foreachBatch, but I have not tried it.

How to read specific columns from Cassandra table using Datastax spark-cassandra-connector?

I am using spark-cassandra-connector_2.11 (version 2.0.5) to load data from Cassandra into Spark cluster. I am using read api to load the data as follows :
SparkUtil.initSpark()
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table"-><table_name>, "keyspace"-><keyspace>))
.load()
Its working fine, however, in one of the use case I want to read only a specific column from Cassandra. How to use read api to do the same?
SparkUtil.initSpark()
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table"-><table_name>, "keyspace"-><keyspace>))
.load()
.select("column_name")
Use select.. you can also use case classes
Other way is to use following approach without using options api.
SparkUtil.initSpark()
.sparkContext
.cassandraTable(<keyspace>, <table_name>)
.select(<column_name>)
One line solution for fetching few columns from Cassandra table :
val rdd=sc.cassandraTable("keyspace","table_name")
.select("service_date","mobile").persist(StorageLevel.MEMORY_AND_DISK)

Resources