Is it possible to use Kinesis streams as a data source for Spark structured streaming? I can't find any connector available.
Qubole have a kinesis-sql library for exactly this.
https://github.com/qubole/kinesis-sql
Then you can use the source similar to any other Spark Structured Streaming source:
val source = spark
.readStream
.format("kinesis")
.option("streamName", "spark-source-stream")
.option("endpointUrl", "https://kinesis.us-east-1.amazonaws.com")
.option("awsAccessKeyId", [YOUR_AWS_ACCESS_KEY_ID])
.option("awsSecretKey", [YOUR_AWS_SECRET_KEY])
.option("startingPosition", "TRIM_HORIZON")
.load
Related
Is it possible to use hive tables as source/sink for spark structured streaming?
Nowhere in the documentation I could see the source or sink as hive tables .
.
Im looking for a package, or a previous implementation of using redshift as the source for a structured streaming dataframe.
spark.readStream \
.format("io.github.spark_redshift_community.spark.redshift") \
.option('url', redshift_url) \
.option('forward_spark_s3_credentials', 'true') \
.load()
Using the format below you get errors on the read. such as:
Data source io.github.spark_redshift_community.spark.redshift does not support streamed reading
This is the same error if you downgrade from Spark 3 and use: com.databricks.spark.redshift
Is there a known workaround, or methodology/pattern i can use to implement (in pyspark) redshift as a readStream datasource
As the error says, this library does not support streaming reads/ writes to/ from Redshift.
Same can be confirmed from the project source at link. The format does not extend or implement Micro/ Continuous stream readers and writers.
There will be no true streaming easy ways to this. You may explore the following avenues,
Explore 3rd party libs. Search for JDBC streaming spark. Disclaimer: I have not used these and thus do not endorse these libs.
Create a Micro-batching strategy on a custom check-pointing mechanism.
Extended Note: AFAIK, Spark JDBC interfaces do not support Structured Streaming.
Can it be possible to parse/read the multi-line JSON published to Kafka topic by spark using Structured streaming?
if you are using spark version greater than 2.2, following would work.
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
Is it possible to use a Snowflake table as a source for spark structured streaming in Databricks? When I run the following pyspark code:
options = dict(sfUrl=our_snowflake_url,
sfUser=user,
sfPassword=password,
sfDatabase=database,
sfSchema=schema,
sfWarehouse=warehouse)
df = spark.readStream.format("snowflake") \
.schema(final_struct) \
.options(**options) \
.option("dbtable", "BASIC_DATA_TEST") \
.load()
I get this warning:
java.lang.UnsupportedOperationException: Data source snowflake does not support streamed reading
I haven't been able to find anything in the Spark Structured Streaming Docs that explicitly says Snowflake is supported as a source, but I'd like to make sure I'm not missing anything obvious.
Thanks!
The Spark Snowflake connector currently does not support using the .writeStream/.readStream calls from Spark Structured Streaming
I already have a Deep Learning model.I am trying to run scoring on streaming data. For this I am reading data from kafka using spark structured streaming api.When I try to convert the received dataset to H20Frame I am getting below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
Code Sample
Dataset<Row> testData=sparkSession.readStream().schema(testSchema).format("kafka").option("kafka.bootstrap.servers", "localhost:9042").option("subscribe", "topicName").load();
H2OFrame h2oTestFrame = h2oContext.asH2OFrame(testData.toDF(), "test_frame");
Is there any example that explains sparkling water using spark structured streaming with streaming source?
Is there any example that explains sparkling water using spark structured streaming with streaming source?
There isn't. Generic purpose transformations, including conversion to RDDs and external formats, are not supported in Structured Streaming.