Structured Streaming from Kafka to Hbase - need to set custom timestamps - apache-spark

I am using
Spark 2.2 (from HDP 2.6.3)
Kafka 1.0.1
HBase 1.1.2 (from HDP 2.6.3)
shc-core-1.1.2-2.2-s_2.11-SNAPSHOT.jar built manually with an additional scala class HBaseSinkProvider (link to a github issue)
I want to use this stack to create a kind of a realtime ETL:
I have millions objects of one type (for example "customers")
These object have different fields (like Name, Surname, Status, Email etc) and fields Version and LastUpdateDate
When an object changes, its full description with all fields and values is published to Kafka topic
With Spark I have a Structured Streaming app, which consumes the stream of data and save object to HBase
Everything above works well now, the code is like this (Python):
spark \
.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', 'worker01:9092,worker02:9092,worker03:9092') \
.option('subscribe', 'youdo') \
.option('group.id', 'spark') \
.option('maxOffsetsPerTrigger', 100) \
.option('startingOffsets', 'earliest') \
.load() \
.withColumn(
'decoded',
from_json(
col('value').cast('string'),
schema
)
) \
.select(
'decoded.body.*',
'timestamp'
) \
.na.fill('null') \
.writeStream \
.outputMode("append") \
.format('HBase.HBaseSinkProvider') \
.option('hbasecat', catalog) \
.option('checkpointLocation', '/tmp/checkpoint') \
.start() \
.awaitTermination()
Now I have a problem that I have no guarantee, that messages in different partitions are in proper order. I mean for example, that O may have a version 8 of an object in HBase, and I can consume a message from another partition with version 7 - in this case I must not update the data in HBase.
First I tried to join the input stream with HBase and filter out rows with version lower, than in HBase:
<...>
.select(
'decoded.body.*',
'timestamp'
) \
.join(
sqlc.read \
.format('org.apache.spark.sql.execution.datasources.hbase') \
.options(catalog=catalog_hbase) \
.load() \
.select('id', col('hbase_version').cast('integer')),
['id'],
'left'
) \
.na.fill({'hbase_version': 0}) \
.filter(col('version').cast('integer') > col('hbase_version'))
<...>
Doing like this I always get streamingQueryException: key not found: hbase_version - I think this doesn't work due to shc, it looks like a not supported feature.
Another way, that I see, is to set HBase row timestamp manually for each row (i have the LastUpdateDate attribute along with Version) - how can I do that?
Are there any other ways?
The resulting code must be in Python, but it is OK to compile some jars and pass them to script with spark-submit.

Related

run multiple streaming queries with pyspark

I have a job which writes (in streaming) two different dataframes in two different paths (S3).
I tried several solutions but it doesn't work
below is the code I used in pyspark:
query1 = df1.writeStream.format("parquet").option("checkpointLocation",
"checkpoints") \
.option("path", "output").partitionBy("year", "month", "day").outputMode("append") \
.start()
query2 = df2.writeStream.format("parquet").option("checkpointLocation",
"checkpoints2") \
.option("path", "output2").partitionBy("year", "month", "day").outputMode("append") \
.start()
first option:
query1.awaitTermination()
query2.awaitTermination()
second option :
spark.streams.awaitAnyTermination()
But all options don't work, only, the first dataframe gets written
any idea how to fix this please ?

how to run map transformation in a structured streaming job in pyspark

I am trying to setup a structured streaming job with a map() transformation that make REST API calls. Here are the details:
(1)
df=spark.readStream.format('delta') \
.option("maxFilesPerTrigger", 1000) \
.load(f'{file_location}')
(2)
respData=df.select("resource", "payload").rdd.map(lambda row: put_resource(row[0], row[1])).collect()
respDf=spark.createDataFrame(respData, ["resource", "status_code", "reason"])
(3)
respDf.writeStream \
.trigger(once=True) \
.outputMode("append") \
.format("delta") \
.option("path", f'{file_location}/Response') \
.option("checkpointLocation", f'{file_location}/Response/Checkpoints') \
.start()
However, I got an error: Queries with streaming sources must be executed with writeStream.start() on step (2).
Any help will be appreciated. Thank you.
you have to execute your stream on df also
meaning df.writeStream.start()..
there is a similar thread here :
Queries with streaming sources must be executed with writeStream.start();

What is the optimal way to read from multiple Kafka topics and write to different sinks using Spark Structured Streaming?

I am trying to write a Spark Structured Streaming job that reads from multiple Kafka topics (potentially 100s) and writes the results to different locations on S3 depending on the topic name. I've developed this snippet of code that currently reads from multiple topics and outputs the results to the console (based on a loop) and it works as expected. However, I would like to understand what the performance implications are. Would this be the recommended approach? Is it not recommended to have multiple readStream and writeStream operations? If so, what is the recommended approach?
my_topics = ["topic_1", "topic_2"]
for i in my_topics:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("subscribePattern", i) \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
output_df = df \
.writeStream \
.format("console") \
.option("truncate", False) \
.outputMode("update") \
.option("checkpointLocation", "s3://<MY_BUCKET>/{}".format(i)) \
.start()
It's certainly reasonable to run a number # of concurrent streams per driver node.
Each .start() consumes a certain amount of driver resources in spark. Your limiting factor will be the load on the driver node and its available resources.
100's of topics running continuously at high rate would need to be spread across multiple driver nodes [In Databricks there is one driver per cluster]. The advantage of Spark is as you mention, multiple sinks and also a unified batch & streaming apis for transformations.
The other issue will be dealing with the small writes you may end up making to S3 and file consistency. Take a look at delta.io to handle consistent & reliable writes to S3.
Advantages of below approach.
Generic
Multiple Threads, All threads will work individual.
Easy to maintain code & support for any issues.
If one topic is failed, No impact on other topics in production. You just have to focus on failed one.
If you want to pull all data for specific topic, You just have to stop job for that topic, update or change the config & restart same job.
Note - Below code is not complete generic, You may need to change or tune below code.
topic="" // Get value from input arguments
sink="" // Get value from input arguments
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("subscribePattern", topic) \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
output_df = df \
.writeStream \
.format("console") \
.option("truncate", False) \
.outputMode("update") \
.option("checkpointLocation", sink) \
.start()
Problems with below approach.
If one topic is failed, It will terminate complete program.
Limited Threads.
Difficult to maintain code, debug & support for any issues.
If you want to pull all data for specific topic from kafka, It's not possible as any config change will apply for all topics, hence its too costliest operation.
my_topics = ["topic_1", "topic_2"]
for i in my_topics:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("subscribePattern", i) \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
output_df = df \
.writeStream \
.format("console") \
.option("truncate", False) \
.outputMode("update") \
.option("checkpointLocation", "s3://<MY_BUCKET>/{}".format(i)) \
.start()

How to copy data from one Azure EventHub to another Azure EventHub?

There is no out-of-the box solution to clone data from one Azure EventHub to another EventHub. What are possible options to achieve this?
One simple option for duplicating an Azure EventHub stream is to write a clone-job in PySpark. You just read the stream from your source-Eventhub select the body and if relevant for your scenario also the properties from the source-streaming dataframe and write this stream to your target-EventHub:
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehSource) \
.load() \
.select ("properties", "body") \
.writeStream \
.format("eventhubs") \
.options(**ehTarget) \
.option("checkpointLocation", checkploc) \
.start()

How to write streaming Dataset to Cassandra?

So I have a Python Stream-sourced DataFrame df that has all the data I want to place into a Cassandra table with the spark-cassandra-connector. I've tried doing this in two ways:
df.write \
.format("org.apache.spark.sql.cassandra") \
.mode('append') \
.options(table="myTable",keyspace="myKeySpace") \
.save()
query = df.writeStream \
.format("org.apache.spark.sql.cassandra") \
.outputMode('append') \
.options(table="myTable",keyspace="myKeySpace") \
.start()
query.awaitTermination()
However I keep on getting this errors, respectively:
pyspark.sql.utils.AnalysisException: "'write' can not be called on streaming Dataset/DataFrame;
and
java.lang.UnsupportedOperationException: Data source org.apache.spark.sql.cassandra does not support streamed writing.
Is there anyway I can send my Streamed DataFrame into a my Cassandra Table?
There is currently no streaming Sink for Cassandra in the Spark Cassandra Connector. You will need to implement your own Sink or wait for it to become available.
If you were using Scala or Java you could use foreach operator and use a ForeachWriter as described in Using Foreach.
I know its an old post, updating it for future references.
You can process it as a batch from streaming data. like below
def writeToCassandra(writeDF, epochId):
writeDF.write \
.format("org.apache.spark.sql.cassandra") \
.options(table="table_name", keyspace="keyspacename")\
.mode("append") \
.save()
query = sdf3.writeStream \
.trigger(processingTime="10 seconds") \
.outputMode("update") \
.foreachBatch(writeToCassandra) \
.start()

Resources