Using a dynamic index on Spark Structured Streaming with ES Hadoop - apache-spark

I have Elasticsearch 6.4.2 and Spark 2.2.0
Currently I have a working example where I can send data from a Dataset into Elasticsearch via the writeStream (Structured Streaming) API:
ds.writeStream
.outputMode("append")
.format("org.elasticsearch.spark.sql")
.option("checkpointLocation","hdfs://X.X.X.X:9000/tmp")
.option("es.resource.write","index/doc")
.option("es.nodes","X.X.X.X")
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
However, I am interested in using dynamic index names to create a new index based on the date of the event. Per the documentation, it its supposedly possible to do that using the es.resource.write configuration with a special format:
.option("es.resource.write","index-{myDateField}/doc")
Despite all my efforts, when I try to run the code with the curly braces on, it immediately crashes stating an illegal character '{' was detected.
¿Does the streamWrite API currently supports this configuration?

spark
.readStream
.option("sep", ",")
.schema(userSchema)
.csv("/data/csv/") // csv file home
.writeStream
.format("es")
.outputMode(OutputMode.Append())
.option("checkpointLocation", "file:/data/job/spark/checkpointLocation/example/StructuredEs")
.start("structured.es.example.{name}.{date|yyyy-MM}") // ES 7+ no type
.awaitTermination()

Related

Can't read via Apache Spark Structured Streaming from Hive Table

When I try and read from a hive table with the following code. I get the error buildReader is not supported for HiveFileFormat from spark driver pod.
spark.readStream \
.table("table_name") \
.repartition("filename") \
.writeStream \
.outputMode("append") \
.trigger(processingTime="10 minutes") \
.foreachBatch(perBatch)
Have tried every possible combination including most simple queries possible. Reading via parquet method direct from specified folder works, as does Spark SQL without streaming, but reading with Structured Streaming via readStream does not.
The documentation says the following...
Since Spark 3.1, you can also use DataStreamReader.table() to read tables as streaming DataFrames and use DataStreamWriter.toTable() to write streaming DataFrames as tables:
I'm using the latest Spark 3.2.1. Although reading from a table is not shown in the examples the paragraph above clearly suggests it should be possible.
Any assistance to help get this working would be really great and simplify my project a lot.

How to insert processed spark stream into kafka

am trying to insert spark stream into kafka after being processed using the below snippet
query = ds1 \
.selectExpr("CAST(value AS STRING)")\
.writeStream\
.foreachBatch(do_something) \
.format("kafka") \
.option("topic","topic-name") \
.option("kafka.bootstrap.servers", "borkers-IPs") \
.option("checkpointLocation", "/home/location") \
.start()
but it seems it's inserting the original stream not the processed one.
Use of foreachBatch has no effect here as you can see. Spark will not generate an error, it will just be like into the void.
Quote from the manuals:
Structured Streaming APIs provide two ways to write the output of a
streaming query to data sources that do not have an existing streaming
sink: foreachBatch() and foreach().
This excellent read is what you are looking for.
https://aseigneurin.github.io/2018/08/14/kafka-tutorial-8-spark-structured-streaming.html

Writing data as JSON array with Spark Structured Streaming

I have to write data from Spark Structure streaming as JSON Array, I have tried using below code:
df.selectExpr("to_json(struct(*)) AS value").toJSON
which returns me DataSet[String], but unable to write as JSON Array.
Current Output:
{"name":"test","id":"id"}
{"name":"test1","id":"id1"}
Expected Output:
[{"name":"test","id":"id"},{"name":"test1","id":"id1"}]
Edit (moving comments into question):
After using proposed collect_list method I am getting
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;
Then I tried something like this -
withColumn("timestamp", unix_timestamp(col("event_epoch"), "MM/dd/yyyy hh:mm:ss aa")) .withWatermark("event_epoch", "1 minutes") .groupBy(col("event_epoch")) .agg(max(col("event_epoch")).alias("timestamp"))
But I don't want to add a new column.
You can use the SQL built-in function collect_list for this. This function collects and returns a set of non-unique elements (compared to collect_set which returns only unique elements).
From the source code for collect_list you will see that this is an aggregation function. Based on the requirements given in the Structured Streaming Programming Guide on Output Modes it is highlighted that the output modes "complete" and "updated" are supported for aggregations without a watermark.
As I understand from your comments, you do not wish to add watermark and new columns. Also, the error you are facing
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;
reminds you to not use the output mode "append".
In the comments, you have mentioned that you plan to produce the results into a Kafka message. One big JSON Array as one Kafka value. The complete code could look like
val df = spark.readStream
.[...] // in my test I am reading from Kafka source
.load()
.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value", "offset", "partition")
// do not forget to convert you data into a String before writing to Kafka
.selectExpr("CAST(collect_list(to_json(struct(*))) AS STRING) AS value")
df.writeStream
.format("kafka")
.outputMode("complete")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "test")
.option("checkpointLocation", "/path/to/sparkCheckpoint")
.trigger(Trigger.ProcessingTime(10000))
.start()
.awaitTermination()
Given the key/value pairs (k1,v1), (k2,v2), and (k3,v3) as inputs you will get a value in the Kafka topic that contains all selected data as a JSON Array:
[{"key":"k1","value":"v1","offset":7,"partition":0}, {"key":"k2","value":"v2","offset":8,"partition":0}, {"key":"k3","value":"v3","offset":9,"partition":0}]
Tested with Spark 3.0.1 and Kafka 2.5.0.

Spark Structured Streaming custom partition directory name

I'm porting a streaming job (Kafka topic -> AWS S3 Parquet Files) from Kafka Connect to Spark Structured Streaming Job.
I partition my data by year/month/day.
The code is very simple:
df.withColumn("year", functions.date_format(col("createdAt"), "yyyy"))
.withColumn("month", functions.date_format(col("createdAt"), "MM"))
.withColumn("day", functions.date_format(col("createdAt"), "dd"))
.writeStream()
.trigger(processingTime='15 seconds')
.outputMode(OutputMode.Append())
.format("parquet")
.option("checkpointLocation", "/some/checkpoint/directory/")
.option("path", "/some/directory/")
.option("truncate", "false")
.partitionBy("year", "month", "day")
.start()
.awaitTermination();
The output files are in the following directory (as expected):
/s3-bucket/some/directory/year=2021/month=01/day=02/
Question:
Is there a way to customize the output directory name? I need it to be
/s3-bucket/some/directory/2021/01/02/
For backward compatibility reasons.
No, there is no way to customize the output directory names into the format you have mentioned within your Spark Structured Streaming application.
Partitions are based on the values of particular columns and without their column names in the directory path it would be ambiguous to which column their value belong to. You need to write a seperate application that transforms those directories into the desired format.

How to write a Streaming Structured Stream into Hive directly?

I want to achieve something like this :
df.writeStream
.saveAsTable("dbname.tablename")
.format("parquet")
.option("path", "/user/hive/warehouse/abc/")
.option("checkpointLocation", "/checkpoint_path")
.outputMode("append")
.start()
I am open to suggestions. I know Kafka Connect could be one of the options but how to achieve this using Spark. A possible workaround may be what I am looking for.
Thanks in Advance !!
Spark Structured Streaming does not support writing the result of a streaming query to a Hive table directly. You must write to paths.
For 2.4 they say try foreachBatch, but I have not tried it.

Resources