By default, when you're using Hive partitions directory structure,the auto loader option cloudFiles.partitionColumns add these columns automatically to your schema (using schema inference).
This is the code:
checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table"
(
spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", checkpoint_path)
.load("s3://autoloader-source/json-data")
.writeStream
.option("checkpointLocation", checkpoint_path)
.trigger(availableNow=True)
.toTable("dev_catalog.dev_database.dev_table")
)
But can we have an option to also create partitionq to the target table like you can do with a simple CREATE TABLE ? (E.g. if you have such classical structure /year=xxxx/month=xxx/day=xx)
You can use the .partitionBy() function.
checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table"
(
spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", checkpoint_path)
.load("s3://autoloader-source/json-data")
.writeStream
.option("checkpointLocation", checkpoint_path)
.partitionBy("col1", "col2")
.trigger(availableNow=True)
.toTable("dev_catalog.dev_database.dev_table")
)
Related
I am trying to build the below spark streaming spark job that would read from kafka, perform aggregation (count on every min window) and store in Cassandra. I am getting an error on update mode.
java.lang.IllegalArgumentException: requirement failed: final_count does not support Update mode.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.execution.datasources.v2.V2Writes$.org$apache$spark$sql$execution$datasources$v2$V2Writes$$buildWriteForMicroBatch(V2Writes.scala:121)
at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:90)
at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:43)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
at
My spark source is
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0,com.datastax.spark:spark-cassandra-connector_2.12:3.2.0 pyspark-shell'
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "xxxx:9092") \
.option("subscribe", "yyyy") \
.option("startingOffsets", "earliest") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("parsed_value")) \
.select(col("parsed_value.country"), col("parsed_value.city"), col("parsed_value.Location").alias("location"), col("parsed_value.TimeStamp")) \
.withColumn('currenttimestamp', unix_timestamp(col('TimeStamp'), "yyyy-MM-dd HH:mm:ss").cast(TimestampType())) \
.withWatermark("currenttimestamp", "1 minutes");
df.printSchema();
df=df.groupBy(window(df.currenttimestamp, "1 minutes"), df.location) \
.count();
df = df.select(col("location"), col("window.start").alias("starttime"), col("count"));
df.writeStream.outputMode("update").format("org.apache.spark.sql.cassandra").option("checkpointLocation", '/tmp/check_point/').option("keyspace", "cccc").option("table", "bbbb").option("spark.cassandra.connection.host", "aaaa").option("spark.cassandra.auth.username", "ffff").option("spark.cassandra.auth.password", "eee").start().awaitTermination();
Schema for table in cassandra is as below
CREATE TABLE final_count (
starttime TIMESTAMP,
location TEXT,
count INT,
PRIMARY KEY (starttime,location);
Works on update mode printing on console, but fails with error while updating cassandra.
Any suggestions?
Need foreachBatch as Cassandra is still not a standard Sink.
See https://docs.databricks.com/structured-streaming/examples.html#write-to-cassandra-using-foreachbatch-in-scala
I have some kafkas topics with the nomenclatures below:
'ingestion_src_api_iq_BTCUSD_1_json', 'ingestion_src_api_iq_BTCUSD_5_json', 'ingestion_src_api_iq_BTCUSD_60_json'
I'm reading all these topics that has the same data structure using the "subscribePattern" param in spark.
(spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_server)
.option("subscribePattern", "ingestion_src_api.*")
.option("startingOffsets", "latest")
.load()
.select(col("topic").cast("string"), from_json(col("value").cast("string"), schema).alias("value"))
. select(to_json(struct(expr("value.active_id as active_id"), expr("value.size as timeframe"),
expr("cast(value.at / 1000000000 as timestamp) as executed_at"), expr("FROM_UNIXTIME(value.from) as candle_from"),
expr("FROM_UNIXTIME(value.to) as candle_to"), expr("value.id as period"),
"value.open", "value.close", "value.min", "value.max", "value.ask", "value.bid", "value.volume")).alias("value"))
.writeStream.format("kafka").option("kafka.bootstrap.servers", bootstrap_server)
.option("topic", "processed_src_api_iq_data")
.option("checkpointLocation", f"./checkpoint/")
.start()
)
How could I write the transformed data into differents topics like:
'processed_src_api_iq_BTCUSD_1_json', 'processed_src_api_iq_BTCUSD_5_json', 'processed_src_api_iq_BTCUSD_60_json'
In my code I am able to write only in one topic "processed_src_api_iq_data".
The outgoing dataframe to format("kafka") can include a String column named topic which will determine where the value and/or key byte/string columns will be produced to, rather than using option, as documented...
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka
The topic column is required if the “topic” configuration option is not specified
Use withColumn to add the necessary values, based on the other columns that you have.
Alternatively, create multiple dataframes and call writeStream.format("kafka") with the invidiual option("topic" settings on each.
raw_df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_server)
.option("subscribePattern", "ingestion_src_api.*")
.option("startingOffsets", "latest")
.load()
parsed_df = raw_df.select(col("topic").cast("string"), from_json(col("value").cast("string"), schema).alias("value"))
processed_df = parsed_df
.select(to_json(struct(
expr("value.active_id as active_id"),
expr("value.size as timeframe"),
expr("cast(value.at / 1000000000 as timestamp) as executed_at"),
expr("FROM_UNIXTIME(value.from) as candle_from"),
expr("FROM_UNIXTIME(value.to) as candle_to"),
expr("value.id as period"),
"value.open", "value.close", "value.min", "value.max", "value.ask", "value.bid", "value.volume"
)).alias("value"))
btc_1 = processed_df.filter( ... something to get just this data )
btc_5 = processed_df.filter( ... etc )
btc_1.writeStream.format("kafka")
.option("topic", "processed_src_api_iq_BTCUSD_1_json")
...
btc_5.writeStream.format("kafka")
.option("topic", "processed_src_api_iq_BTCUSD_5_json")
...
I am using spark structured streaming to read data from Kafka and apply some udf to the dataset. The code as below :
calludf = F.udf(lambda x: function_name(x))
dfraw = spark.readStream.format('kafka') \
.option('kafka.bootstrap.servers', KAFKA_CONSUMER_IP) \
.option('subscribe', topic_name) \
.load()
df = dfraw.withColumn("value", F.col('value').cast('string')).withColumn('value', calludf(F.col('value')))
ds = df.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format('console') \
.option('truncate', False) \
.start()
dsf = df.selectExpr("CAST (value AS STRING)") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_CONSUMER_IP) \
.option("topic", topic_name_two) \
.option("checkpointLocation", checkpoint_location) \
.start()
ds.awaitTermination()
dsf.awaitTermination()
Now the problem is that I am getting 10 dataframes as input. 2 of them failed due to some issue with the data which is understandable. The console displays rest of the 8 processed dataframes BUT only 6 of those 8 processed dataframes are written to the Kafka topic using dsf steaming query. Even though I have added checkpoint location to it but it is still not working.
PS: Do let me know if you have any suggestion regarding the code as well. I am new to spark structured streaming so maybe there is something wrong with the way I am doing it.
In my scenario, I have a structured streaming application which reads from kafka and writes to hdfs and kafka using 3 different sinks. Primary sink is the hdfs one and others are secondary. I want the primary sink to run first and then secondary sinks. All have a triggertime of 60seconds. Is there a way to achieve that in spark structured streaming. Adding the code snippet:
val spark = SparkSession
.builder
.master(StreamerConfig.sparkMaster)
.appName(StreamerConfig.sparkAppName)
.getOrCreate()
spark.conf.set("spark.sql.orc.impl", "native")
spark.conf.set("spark.streaming.stopGracefullyOnShutdown","true")
spark.conf.set("spark.sql.files.ignoreCorruptFiles","true")
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("spark.shuffle.service.enabled","true")
val readData = spark
.readStream
.format("kafka") .option("kafka.bootstrap.servers",StreamerConfig.kafkaBootstrapServer)
.option("subscribe",StreamerConfig.topicName)
.option("failOnDataLoss", false)
.option("startingOffsets",StreamerConfig.kafkaStartingOffset) .option("maxOffsetsPerTrigger",StreamerConfig.maxOffsetsPerTrigger)
.load()
val deserializedRecords = StreamerUtils.deserializeAndMapData(readData,spark)
val streamingQuery = deserializedRecords.writeStream
.queryName(s"Persist data to hive table for ${StreamerConfig.topicName}")
.outputMode("append")
.format("orc")
.option("path",StreamerConfig.hdfsLandingPath)
.option("checkpointLocation",StreamerConfig.checkpointLocation)
.partitionBy("date","hour")
.option("truncate","false")
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase1Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase2Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
PS: I am using spark 2.3.2
I am trying to write a Spark Structured Streaming job that reads from a Kafka topic and writes to separate paths (after performing some transformations) via the writeStream operation. However, when I run the following code, only the first writeStream gets executed and the second is getting ignored.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.load()
write_one = df.writeStream \
.foreachBatch(lambda x, y: transform_and_write_to_zone_one(x,y)) \
.start() \
.awaitTermination()
// transform df to df2
write_two = df2.writeStream \
.foreachBatch(lambda x, y: transform_and_write_to_zone_two(x,y)) \
.start() \
.awaitTermination()
I initially thought that my issue was related to this post, however, after changing my code to the following:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.load()
write_one = df.writeStream \
.foreachBatch(lambda x, y: transform_and_write_to_zone_one(x,y)) \
.start()
// transform df to df2
write_two = df2.writeStream \
.foreachBatch(lambda x, y: transform_and_write_to_zone_two(x,y)) \
.start()
write_one.awaitTermination()
write_two.awaitTermination()
I received the following error:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
I am not sure why the additional code between start() and awaitTermination() would cause the error above (but I think this is probably a separate issue that is referenced in this answer to the same post above). What is the correct way to call multiple writeStream operations within the same job? Would it be best to have both of the writes within the function that is invoked by foreachBatch or is there are a better way to achieve this?
Spark documentation says that in case you need perform writing into multiple locations you need to use foreachBatch method.
Your code should look something like:
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
batchDF.unpersist()
}
Note: persist in needed in order to prevent recomputations.
You can check more: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch
You just don't call awaiTermination() for each of your stream queries, but just one through spark session, eg:
spark.streams.awaitAnyTermination()