Databricks structured streaming with Snowflake as source? - apache-spark

Is it possible to use a Snowflake table as a source for spark structured streaming in Databricks? When I run the following pyspark code:
options = dict(sfUrl=our_snowflake_url,
sfUser=user,
sfPassword=password,
sfDatabase=database,
sfSchema=schema,
sfWarehouse=warehouse)
df = spark.readStream.format("snowflake") \
.schema(final_struct) \
.options(**options) \
.option("dbtable", "BASIC_DATA_TEST") \
.load()
I get this warning:
java.lang.UnsupportedOperationException: Data source snowflake does not support streamed reading
I haven't been able to find anything in the Spark Structured Streaming Docs that explicitly says Snowflake is supported as a source, but I'd like to make sure I'm not missing anything obvious.
Thanks!

The Spark Snowflake connector currently does not support using the .writeStream/.readStream calls from Spark Structured Streaming

Related

Can't read via Apache Spark Structured Streaming from Hive Table

When I try and read from a hive table with the following code. I get the error buildReader is not supported for HiveFileFormat from spark driver pod.
spark.readStream \
.table("table_name") \
.repartition("filename") \
.writeStream \
.outputMode("append") \
.trigger(processingTime="10 minutes") \
.foreachBatch(perBatch)
Have tried every possible combination including most simple queries possible. Reading via parquet method direct from specified folder works, as does Spark SQL without streaming, but reading with Structured Streaming via readStream does not.
The documentation says the following...
Since Spark 3.1, you can also use DataStreamReader.table() to read tables as streaming DataFrames and use DataStreamWriter.toTable() to write streaming DataFrames as tables:
I'm using the latest Spark 3.2.1. Although reading from a table is not shown in the examples the paragraph above clearly suggests it should be possible.
Any assistance to help get this working would be really great and simplify my project a lot.

Using redshift as a JDBC source for readStream in the Structured Streaming API (pyspark)

Im looking for a package, or a previous implementation of using redshift as the source for a structured streaming dataframe.
spark.readStream \
.format("io.github.spark_redshift_community.spark.redshift") \
.option('url', redshift_url) \
.option('forward_spark_s3_credentials', 'true') \
.load()
Using the format below you get errors on the read. such as:
Data source io.github.spark_redshift_community.spark.redshift does not support streamed reading
This is the same error if you downgrade from Spark 3 and use: com.databricks.spark.redshift
Is there a known workaround, or methodology/pattern i can use to implement (in pyspark) redshift as a readStream datasource
As the error says, this library does not support streaming reads/ writes to/ from Redshift.
Same can be confirmed from the project source at link. The format does not extend or implement Micro/ Continuous stream readers and writers.
There will be no true streaming easy ways to this. You may explore the following avenues,
Explore 3rd party libs. Search for JDBC streaming spark. Disclaimer: I have not used these and thus do not endorse these libs.
Create a Micro-batching strategy on a custom check-pointing mechanism.
Extended Note: AFAIK, Spark JDBC interfaces do not support Structured Streaming.

How to connect to redshift data using Spark on Amazon EMR cluster

I have an Amazon EMR cluster running. If I do
ls -l /usr/share/aws/redshift/jdbc/
it gives me
RedshiftJDBC41-1.2.7.1003.jar
RedshiftJDBC42-1.2.7.1003.jar
Now, I want to use this jar to connect to my Redshift database in my spark-shell . Here is what I do -
import org.apache.spark.sql._
val sqlContext = new SQLContext(sc)
val df : DataFrame = sqlContext.read
.option("url","jdbc:redshift://host:PORT/DB-name?user=user&password=password")
.option("dbtable","tablename")
.load()
and I get this error -
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
I am not sure if I am specifying the correct format while reading the data. I have also read that spark-redshift driver is available but I do not want to run spark-submit with extra JARS.
How do I connect to redshift data from Spark-shell ? Is that the correct JAR to configure the connection in Spark ?
The error being generated is because you are missing the .format("jdbc") in your read. It should be:
val df : DataFrame = sqlContext.read
.format("jdbc")
.option("url","jdbc:redshift://host:PORT/DB-name?user=user&password=password")
.option("dbtable","tablename")
.load()
By default, Spark assumes sources to be Parquet files, hence the mention of Parquet in the error.
You may still run into issues with classpath/finding the drivers, but this change should give you more useful error output. I assume that folder location you listed is in the classpath for Spark on EMR and those driver versions look to be fairly current. Those drivers should work.
Note, this will only work for reading from Redshift. If you need to write to Redshift your best bet is using the Databricks Redshift data source for Spark - https://github.com/databricks/spark-redshift.

Spark Structured Streaming Kinesis Data source

Is it possible to use Kinesis streams as a data source for Spark structured streaming? I can't find any connector available.
Qubole have a kinesis-sql library for exactly this.
https://github.com/qubole/kinesis-sql
Then you can use the source similar to any other Spark Structured Streaming source:
val source = spark
.readStream
.format("kinesis")
.option("streamName", "spark-source-stream")
.option("endpointUrl", "https://kinesis.us-east-1.amazonaws.com")
.option("awsAccessKeyId", [YOUR_AWS_ACCESS_KEY_ID])
.option("awsSecretKey", [YOUR_AWS_SECRET_KEY])
.option("startingPosition", "TRIM_HORIZON")
.load

How to write a Streaming Structured Stream into Hive directly?

I want to achieve something like this :
df.writeStream
.saveAsTable("dbname.tablename")
.format("parquet")
.option("path", "/user/hive/warehouse/abc/")
.option("checkpointLocation", "/checkpoint_path")
.outputMode("append")
.start()
I am open to suggestions. I know Kafka Connect could be one of the options but how to achieve this using Spark. A possible workaround may be what I am looking for.
Thanks in Advance !!
Spark Structured Streaming does not support writing the result of a streaming query to a Hive table directly. You must write to paths.
For 2.4 they say try foreachBatch, but I have not tried it.

Resources