Converting Kinesis Streams to Pyspark Dataframe and reading dataframe using .show() - python-3.x

I have a kinesis stream resource connected to a dynamodb stream. Whenever any sort of operation happens in the dynamoDB , it is reflected in my kinesis stream. My kinesis is provisioned to only one shard. I am able to see the json arrive into my kinesis streams.
I'm trying to write a script in glue notebook so that when I run the glue notebook cell , i'm able to access all the json from Dbstreams at that particular point of time and convert them into dataframe and view them as a pyspark.dataframe. Is this even possible?
Trying to do something like this :
kinesisDF = spark \
.readStream \
.format("kinesis") \
.option("streamName",'sample-kinesis') \
.option("streamARN", 'arn:aws:kinesis:us-east-1:589840918737:stream/sample-kinesis') \
.option("region", "us-east-1") \
.option("initialPosition", "TRIM_HORIZON") \
.option("format", "json") \
.option("inferSchema", "true") \
.load()
kinesisDF.show()
Ended up with this error:
AnalysisException: 'Queries with streaming sources must be executed with writeStream.start();;\nkinesis'
Is there any other way to do this and do a .show() on the dataframe so as to display all the json that arrived up until i ran the cell in the glue notebook?
I also created a data glue table with source as my kinesis stream and tried using
data_frame = glueContext.create_data_frame.from_catalog(database = "my_db", table_name = "sample-kinesis-tbl", transformation_ctx = "DataSource0", additional_options = {"startingPosition": "TRIM_HORIZON", "inferSchema": "true"})
ended up with same error even this being run as part of glue job:
AnalysisException: Queries with streaming sources must be executed with writeStream.start();
Is there any other way to read kinesis streams (up until that moment of time) and convert it into dataframe and view it ?
kindly help.
EDIT1: I also tried adding a Kinesis Firehose as a delivery stream to my kinesis streams reading dbstreams and assigning the destination to that Delivery stream as S3 and reading from S3. I was able to read it as a dataframe but do not know how efficient of a method this actually is. Any sugestions?

Related

Streaming from a Delta Live Tables in databrick to kafka instance

I have the following live table
And i'm looking to write that into a stream to be written back into my kafka source.
I've seen in the apache spark docs that I can use writeStream ( I've used readStream to get it out of my kafka stream already ). But how do I transform the table into the medium it needs so it can use this?
I'm fairly new to both kafka and the data world so any further explanation's are welcome here.
writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "updates")
.start()
Thanks in Advance,
Ben
I've seen in the apache spark docs that I can use writeStream ( I've used readStream to get it out of my kafka stream already ). But how do I transform the table into the medium it needs so it can use this?I'm fairly new to both kafka and the data world so any further explanation's are welcome here.
writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "updates")
.start()
As of right now, Delta Live Tables can only write data as a Delta table - it's not possible to write in other formats. You can implement a workaround by creating a Databricks workflow that consist of two tasks (with dependencies or not depending if the pipeline is triggered or not):
DLT Pipeline that will do the actual data processing
A task (easiest way to do with notebook) that will read a table generated by DLT as a stream and write its content into Kafka, with something like that:
df = spark.readStream.format("delta").table("database.table_name")
(df.write.format("kafka").option("kafka....", "")
.trigger(availableNow=True) # if it's not continuous
.start()
)
P.S. If you have solution architect or customer success engineer attached to your Databricks account, you can communicate this requirement to them for product prioritization.
The transformation is done after the read stream process is started
read_df = spark.readStream.format('kafka') ... .... # other options
processed_df = read_df.withColumn('some column', some_calculation )
processed_df.writeStream.format('parquet') ... .... # other options
.start()
The spark documentation is helpful and detailed but some articles are not for beginners. You can look on youtube or read articles to help you get started like this one

Can't read via Apache Spark Structured Streaming from Hive Table

When I try and read from a hive table with the following code. I get the error buildReader is not supported for HiveFileFormat from spark driver pod.
spark.readStream \
.table("table_name") \
.repartition("filename") \
.writeStream \
.outputMode("append") \
.trigger(processingTime="10 minutes") \
.foreachBatch(perBatch)
Have tried every possible combination including most simple queries possible. Reading via parquet method direct from specified folder works, as does Spark SQL without streaming, but reading with Structured Streaming via readStream does not.
The documentation says the following...
Since Spark 3.1, you can also use DataStreamReader.table() to read tables as streaming DataFrames and use DataStreamWriter.toTable() to write streaming DataFrames as tables:
I'm using the latest Spark 3.2.1. Although reading from a table is not shown in the examples the paragraph above clearly suggests it should be possible.
Any assistance to help get this working would be really great and simplify my project a lot.

cleanSource option does not delete any files

I have a Structured Streaming Job with Trigger.Once() enabled which I run each 20 minutes. After each running, I wat remove my processed parquet files from S3, so I enabled the cleanSource delete option, but it does not work and I don't know why !
Before showing my code, I have to comment about him. I'm running multiple structured streaming queries in paralell, I have 5 buckets and I submit this in parallel. The job works perfectly, but does not delete any processed files.
var table = ['table1','table2','table3','table4','table5']
tables.par.map(table => {
new ReplicationTables().run(table)
})
object ReplicationTables {
def run(table): Unit = {
val dataFrame = spark.readStream
.option("mergeSchema", "true")
.schema(dfSchema)
.option("cleanSource","delete")
.parquet(s"s3a://my-bucket/${table}/*")
// I do some transformation and after I write my new dataframe called df to S3 in Delta format
df.writeStream
.format("delta")
.outputMode("append")
.queryName(s"Delta/${table.schema}/${table.name}")
.trigger(Trigger.Once())
.option("checkpointLocation", s"s3a://my-bucket/checkpoints/${table.schema}/${table.name}")
.start(s"s3a://my-bucket/Delta_Tables/${table}/")
.awaitTermination()
}
}
PS: Even with INFO log level I does not have any logs about the cleanSource
PS 2: Follow the docs of Structured Streaming about cleanSource https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources
Try using option("spark.sql.streaming.fileSource.cleaner.numThreads", "10") to speedup cleanup. If more files are getting generated in less time, then Spark don't delete. May be increasing threads helps

Azure Event Hubs to Databricks, what happens to the dataframes in use

I've been developing a proof of concept on Azure Event Hubs Streaming json data to an Azure Databricks Notebook, using Pyspark. In the examples I've seen, I've created my rough code as follows, taking the data from the event hub to the delta table I'll be using as a destination
connectionString = "My End Point"
ehConf = {'eventhubs.connectionString' : connectionString}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readEventStream = df.withColumn("body", \
df["body"].cast("string")). \
withColumn("date_only", to_date(col("enqueuedTime")))
readEventStream.writeStream.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/delta/testSink/streamprocess") \
.table("testSink")
After reading around googling, what happens to the df & readEventStream dataframes? Will they just get bigger as they retain the data or will they empty during the normal process? Or is it just a temporary store before dumping the data to the Delta table? Is there a way of setting X amount of items streamed before writing out to the Delta table?
Thanks
I carefully reviewed the description of the APIs you used in the code from the PySpark offical document of pyspark.sql module, I think the memory usage of bigger and bigger was caused by the function table(tableName) as the figure below which is for a DataFrame, not for a streaming DataFrame.
So table function create the data strcuture to fill the streaming data in memory.
I recommanded you need to use start(path=None, format=None, outputMode=None, partitionBy=None, queryName=None, **options) to complete the stream write operation first, then to get a table from delta lake again. And there seems not to be a way for setting X amount of items streamed using PySpark before writing out to the Delta table.

Add schema to Spark structured streaming messages in JSON format

I'm implementing a Spark Structured Streaming job where I'm consuming messages coming from Kafka in JSON format.
def setup_input_stream(kafka_brokers, spark, topic_name):
return spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", kafka_brokers) \
.option("subscribe", topic_name) \
.load()
I am then able to extract the value field from the Kafka message in the form of a String containing a JSON payload.
deserialized_data = data_stream \
.selectExpr("CAST (value AS STRING) as json") \
.select(f.from_json(f.col("json"), schema=JSON_SCHEMA).alias("schemaless_data")) \
.select("schemaless_data.payload")
Once I have this payload column, I'm struggling to find a way to let Spark automatically infer its schema and convert it to a proper DataFrame.
I know I can hardcode a StructType containing the schema of my payload, but since I want to use this generic implementation to ingest data coming from different RDBMS tables (each table in its separate topic), I don't really want to hardcode the schema of every possible table.
Can message schemas be inferred somehow?

Resources