Spark Structured Streaming Kinesis Data source - apache-spark

Is it possible to use Kinesis streams as a data source for Spark structured streaming? I can't find any connector available.

Qubole have a kinesis-sql library for exactly this.
https://github.com/qubole/kinesis-sql
Then you can use the source similar to any other Spark Structured Streaming source:
val source = spark
.readStream
.format("kinesis")
.option("streamName", "spark-source-stream")
.option("endpointUrl", "https://kinesis.us-east-1.amazonaws.com")
.option("awsAccessKeyId", [YOUR_AWS_ACCESS_KEY_ID])
.option("awsSecretKey", [YOUR_AWS_SECRET_KEY])
.option("startingPosition", "TRIM_HORIZON")
.load

Related

Spark streaming source

Is it possible to use hive tables as source/sink for spark structured streaming?
Nowhere in the documentation I could see the source or sink as hive tables .
.

Using redshift as a JDBC source for readStream in the Structured Streaming API (pyspark)

Im looking for a package, or a previous implementation of using redshift as the source for a structured streaming dataframe.
spark.readStream \
.format("io.github.spark_redshift_community.spark.redshift") \
.option('url', redshift_url) \
.option('forward_spark_s3_credentials', 'true') \
.load()
Using the format below you get errors on the read. such as:
Data source io.github.spark_redshift_community.spark.redshift does not support streamed reading
This is the same error if you downgrade from Spark 3 and use: com.databricks.spark.redshift
Is there a known workaround, or methodology/pattern i can use to implement (in pyspark) redshift as a readStream datasource
As the error says, this library does not support streaming reads/ writes to/ from Redshift.
Same can be confirmed from the project source at link. The format does not extend or implement Micro/ Continuous stream readers and writers.
There will be no true streaming easy ways to this. You may explore the following avenues,
Explore 3rd party libs. Search for JDBC streaming spark. Disclaimer: I have not used these and thus do not endorse these libs.
Create a Micro-batching strategy on a custom check-pointing mechanism.
Extended Note: AFAIK, Spark JDBC interfaces do not support Structured Streaming.

Reading from Kafka topic using Spark Structured Streaming: Can multi-line JSON published to Kafka topic be parsed by Spark?

Can it be possible to parse/read the multi-line JSON published to Kafka topic by spark using Structured streaming?
if you are using spark version greater than 2.2, following would work.
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")

Databricks structured streaming with Snowflake as source?

Is it possible to use a Snowflake table as a source for spark structured streaming in Databricks? When I run the following pyspark code:
options = dict(sfUrl=our_snowflake_url,
sfUser=user,
sfPassword=password,
sfDatabase=database,
sfSchema=schema,
sfWarehouse=warehouse)
df = spark.readStream.format("snowflake") \
.schema(final_struct) \
.options(**options) \
.option("dbtable", "BASIC_DATA_TEST") \
.load()
I get this warning:
java.lang.UnsupportedOperationException: Data source snowflake does not support streamed reading
I haven't been able to find anything in the Spark Structured Streaming Docs that explicitly says Snowflake is supported as a source, but I'd like to make sure I'm not missing anything obvious.
Thanks!
The Spark Snowflake connector currently does not support using the .writeStream/.readStream calls from Spark Structured Streaming

Not able to convert a spark Dataset<Row> to H2OFrame from asH2OFrame if the dataset is streaming dataset

I already have a Deep Learning model.I am trying to run scoring on streaming data. For this I am reading data from kafka using spark structured streaming api.When I try to convert the received dataset to H20Frame I am getting below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
Code Sample
Dataset<Row> testData=sparkSession.readStream().schema(testSchema).format("kafka").option("kafka.bootstrap.servers", "localhost:9042").option("subscribe", "topicName").load();
H2OFrame h2oTestFrame = h2oContext.asH2OFrame(testData.toDF(), "test_frame");
Is there any example that explains sparkling water using spark structured streaming with streaming source?
Is there any example that explains sparkling water using spark structured streaming with streaming source?
There isn't. Generic purpose transformations, including conversion to RDDs and external formats, are not supported in Structured Streaming.

Resources