Unable to Count the documents using spark structured streaming - apache-spark

I am trying to use couchbase as the streaming source for spark structured streaming using spark connector.
val records = spark.readStream
.format(“com.couchbase.spark.sql”).schema(schema)
.load()
And I have this query
records
.groupBy(“type”)
.count()
.writeStream
.outputMode(“complete”)
.format(“console”)
.start()
.awaitTermination()
For this query I am not getting the correct output . My query output table is like this
Batch: 0
20/04/14 14:28:00 INFO CodeGenerator: Code generated in 10.538654 ms
20/04/14 14:28:00 INFO WriteToDataSourceV2Exec: Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter#17fe0ec7 committed.
±-------±----+
|type | count|
±-------±----+
±-------±----+
However if I use the couchbase to fetch the documents as non streaming. Like
val cdr = spark.read.couchbase(EqualTo(“type”, “cdr”))
cdr.count()
Schema is correctly inferred for this non streaming operation and used the same schema for the structured streaming as well.
INFO N1QLRelation: Inferred schema is StructType(StructField(META_ID,StringType,true), StructField(_class,StringType,true), StructField(accountId,StringType,true),
gives the correct output. (count= 28).
Please let me know why this is not working with structured streaming.

This is probably because you are streaming only what has changed from now onward and not the past events.
If you would like to stream everything "from the beginning" you need to specify that.
Example is shown in this blog post: https://blog.couchbase.com/couchbase-spark-connector-2-0-0-released/
basically in your stream, you need to specify the following line
.couchbaseStream(from = FromBeginning, to = ToInfinity)

Related

Can't read via Apache Spark Structured Streaming from Hive Table

When I try and read from a hive table with the following code. I get the error buildReader is not supported for HiveFileFormat from spark driver pod.
spark.readStream \
.table("table_name") \
.repartition("filename") \
.writeStream \
.outputMode("append") \
.trigger(processingTime="10 minutes") \
.foreachBatch(perBatch)
Have tried every possible combination including most simple queries possible. Reading via parquet method direct from specified folder works, as does Spark SQL without streaming, but reading with Structured Streaming via readStream does not.
The documentation says the following...
Since Spark 3.1, you can also use DataStreamReader.table() to read tables as streaming DataFrames and use DataStreamWriter.toTable() to write streaming DataFrames as tables:
I'm using the latest Spark 3.2.1. Although reading from a table is not shown in the examples the paragraph above clearly suggests it should be possible.
Any assistance to help get this working would be really great and simplify my project a lot.

How to insert spark structured streaming DataFrame to Hive external table/location?

One query on spark structured streaming integration with HIVE table.
I have tried to do some examples of spark structured streaming.
here is my example
val spark =SparkSession.builder().appName("StatsAnalyzer")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.streaming.checkpointLocation", "hdfs://pp/apps/hive/warehouse/ab.db")
.getOrCreate()
// Register the dataframe as a Hive table
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark.readStream.option("sep", ",").schema(userSchema).csv("file:///home/su/testdelta")
csvDF.createOrReplaceTempView("updates")
val query= spark.sql("insert into table_abcd select * from updates")
query.writeStream.start()
As you can see in the last step while writing data-frame to hdfs location, , the data is not getting inserted into the exciting directory (my existing directory having some old data partitioned by "age").
I am getting
spark.sql.AnalysisException : queries with streaming source must be executed with writeStream start()
Can you help why i am not able to insert data in to existing directory in hdfs location ? or is there any other way that i can do "insert into " operation on hive table ?
Looking for a solution
Spark Structured Streaming does not support writing the result of a streaming query to a Hive table.
scala> println(spark.version)
2.4.0
val sq = spark.readStream.format("rate").load
scala> :type sq
org.apache.spark.sql.DataFrame
scala> assert(sq.isStreaming)
scala> sq.writeStream.format("hive").start
org.apache.spark.sql.AnalysisException: Hive data source can only be used with tables, you can not write files of Hive data source directly.;
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:246)
... 49 elided
If a target system (aka sink) is not supported you could use use foreach and foreachBatch operations (highlighting mine):
The foreach and foreachBatch operations allow you to apply arbitrary operations and writing logic on the output of a streaming query. They have slightly different use cases - while foreach allows custom write logic on every row, foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch.
I think foreachBatch is your best bet.
import org.apache.spark.sql.DataFrame
sq.writeStream.foreachBatch { case (ds: DataFrame, batchId: Long) =>
// do whatever you want with your input DataFrame
// incl. writing to Hive
// I simply decided to print out the rows to the console
ds.show
}.start
There is also Apache Hive Warehouse Connector that I've never worked with but seems like it may be of some help.
On HDP 3.1 with Spark 2.3.2 and Hive 3.1.0 we have used Hortonwork's spark-llap library to write structured streaming DataFrame from Spark to Hive. On GitHub you will find some documentation on its usage.
The required library hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar is available on Maven and needs to be passed on in the spark-submit command. There are many more recent versions of that library, although I haven't had the chance to test them.
After creating the Hive table manually (e.g. through beeline/Hive shell) you could apply the following code:
import com.hortonworks.hwc.HiveWarehouseSession
val csvDF = spark.readStream.[...].load()
val query = csvDF.writeStream
.format(HiveWarehouseSession.STREAM_TO_STREAM)
.option("database", "database_name")
.option("table", "table_name")
.option("metastoreUri", spark.conf.get("spark.datasource.hive.warehouse.metastoreUri"))
.option("checkpointLocation", "/path/to/checkpoint/dir")
.start()
query.awaitTermination()
Just in case someone actually tried the code from Jacek Laskowski he knows that it does not really compile in Spark 2.4.0 (check my gist tested on AWS EMR 5.20.0 and vanilla Spark). So I guess that was his idea of how it should work in some future Spark version.
The real code is:
scala> import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Dataset
scala> sq.writeStream.foreachBatch((batchDs: Dataset[_], batchId: Long) => batchDs.show).start
res0: org.apache.spark.sql.streaming.StreamingQuery =
org.apache.spark.sql.execution.streaming.StreamingQueryWrapper#5ebc0bf5

PySpark structured streaming output sink as Kafka giving error

Using Kafka 0.9.0 and Spark 2.1.0 - I am using PySpark structured streaming to compute the results and output it on Kafka topic. I am referring the Spark docs for the same
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Now when I run the command
(output mode complete as it is aggregating the streaming data.)
(mydataframe.writeStream
.outputMode("complete")
.format("kafka")
.option("kafka.bootstrap.servers", "x.x.x.x:9092")
.option("topic", "topicname")
.option("checkpointLocation","/data/checkpoint/1")
.start())
It gives me error as below
ERROR StreamExecution: Query [id = 0686130b-8668-48fa-bdb7-b79b63d82680, runId = b4b7494f-d8b8-416e-ae49-ad8498dfe8f2] terminated with error
org.apache.spark.sql.AnalysisException: Required attribute 'value' not found;
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:73)
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:73)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.kafka010.KafkaWriter$.validateQuery(KafkaWriter.scala:72)
at org.apache.spark.sql.kafka010.KafkaWriter$.write(KafkaWriter.scala:88)
at org.apache.spark.sql.kafka010.KafkaSink.addBatch(KafkaSink.scala:38)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply$mcV$sp(StreamExecution.scala:503)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:503)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:503)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:502)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$mcV$sp(StreamExecution.scala:255)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:43)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:239)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:177)**
Not sure what attribute value does it expects. Need help in resolving this.
The console output sink produces the correct output on console so code seems to work fine. Only when using kafka as output sink causing this issue
Not sure what attribute value does it expects. Need help in resolving this.
Your myDataFrame needs a column value (of either StringType or BinaryType) containing the payload (message) which you want to send to Kafka.
Currently you are trying to write to Kafka, but don't describe which data is to be written.
One way to obtain such a colunm is to rename an existing column using .withColumnRenamed. If you want to write multiple columns, it's usually a good idea to create a column containing a JSON representation of the dataframe, which can be obtained using the to_json sql.function. But beware of .toJSON!
Spark 2.1.0 does not support Kafka as output sink. It's been introduced in 2.2.0 as per documentation.
See also this answer, which links to the commit introducing the feature, and provides an alternate solution, as well as this JIRA, which added the documentation in 2.2.1.

How to convert JSON Dataset to DataFrame in Spark Structured Streaming [duplicate]

This question already has an answer here:
How to read records in JSON format from Kafka using Structured Streaming?
(1 answer)
Closed 5 years ago.
I am using Spark Structured streaming to process data from Kafka. I transform each message to JSON. However, spark needs an explicit schema to obtain columns from JSON. Spark Streaming with DStreams allows doing following
spark.read.json(spark.createDataset(jsons))
where jsons is RDD[String].
In case of Spark Structured Streaming similar approach
df.sparkSession.read.json(jsons)
(jsons is DataSet[String])
results to the following exception
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
I assume that read triggers execution instead of start, but is there a way to bypass this?
To stream from JSON on Kafka to a DataFrame you need to do something like this:
case class Colour(red: Int, green: Int, blue: Int)
val colourSchema: StructType = new StructType()
.add("entity", "int")
.add("security", "int")
.add("client", "int")
val streamingColours: DataFrame = spark
.readStream
.format("kafka")
.load()
.selectExpr("CAST(value AS STRING)")
.select(from_json($"value", colourSchema))
streamingColours
.writeStream
.outputMode("complete")
.format("console")
.start()
This should create a streaming DataFrame, and show the results of reading from Kafka on the console.
I do not believe it is possible to use "infer schema" with streaming data sets. And this makes sense, since infer schema looks at a large set of data to work out what the types are etc. With streaming datasets the schema that could be inferred by processing the first message might be different to the schema of the second message, etc. And Spark needs one schema for all elements of the DataFrame.
What we have done in the past is to process a batch of JSON messages with Spark's batch processing and using infer schema. And then export that schema for use with streaming datasets.

Spark structured streaming kafka convert JSON without schema (infer schema)

I read Spark Structured Streaming doesn't support schema inference for reading Kafka messages as JSON. Is there a way to retrieve schema the same as Spark Streaming does:
val dataFrame = spark.read.json(rdd.map(_.value()))
dataFrame.printschema
Here is one possible way to do this:
Before you start streaming, get a small batch of the data from Kafka
Infer the schema from the small batch
Start streaming the data using the extracted schema.
The pseudo-code below illustrates this approach.
Step 1:
Extract a small (two records) batch from Kafka,
val smallBatch = spark.read.format("kafka")
.option("kafka.bootstrap.servers", "node:9092")
.option("subscribe", "topicName")
.option("startingOffsets", "earliest")
.option("endingOffsets", """{"topicName":{"0":2}}""")
.load()
.selectExpr("CAST(value AS STRING) as STRING").as[String].toDF()
Step 2:
Write the small batch to a file:
smallBatch.write.mode("overwrite").format("text").save("/batch")
This command writes the small batch into hdfs directory /batch. The name of the file that it creates is part-xyz*. So you first need to rename the file using hadoop FileSystem commands (see org.apache.hadoop.fs._ and org.apache.hadoop.conf.Configuration, here's an example https://stackoverflow.com/a/41990859) and then read the file as json:
val smallBatchSchema = spark.read.json("/batch/batchName.txt").schema
Here, batchName.txt is the new name of the file and smallBatchSchema contains the schema inferred from the small batch.
Finally, you can stream the data as follows (Step 3):
val inputDf = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "node:9092")
.option("subscribe", "topicName")
.option("startingOffsets", "earliest")
.load()
val dataDf = inputDf.selectExpr("CAST(value AS STRING) as json")
.select( from_json($"json", schema=smallBatchSchema).as("data"))
.select("data.*")
Hope this helps!
It is possible using this construct:
myStream = spark.readStream.schema(spark.read.json("my_sample_json_file_as_schema.json").schema).json("my_json_file")..
How can this be? Well, as the spark.read.json("..").schema returns exactly a wanted inferred schema, you can use this returned schema as an argument for the mandatory schema parameter of spark.readStream
What I did was to specify a one-liner sample-json as input for inferring the schema stuff so it does not unnecessary take up memory. In case your data changes, simply update your sample-json.
Took me a while to figure out (constructing StructTypes and StructFields by hand was pain in the ..), therefore I'll be happy for all upvotes :-)
It is not possible. Spark Streaming supports limited schema inference in development with spark.sql.streaming.schemaInference set to true:
By default, Structured Streaming from file based sources requires you to specify the schema, rather than rely on Spark to infer it automatically. This restriction ensures a consistent schema will be used for the streaming query, even in the case of failures. For ad-hoc use cases, you can reenable schema inference by setting spark.sql.streaming.schemaInference to true.
but it cannot be used to extract JSON from Kafka messages and DataFrameReader.json doesn't support streaming Datasets as arguments.
You have to provide schema manually How to read records in JSON format from Kafka using Structured Streaming?
It is possible to convert JSON to a DataFrame without having to manually type the schema, if that is what you meant to ask.
Recently I ran into a situation where I was receiving massively long nested JSON packets via Kafka, and manually typing the schema would have been both cumbersome and error-prone.
With a small sample of the data and some trickery you can provide the schema to Spark2+ as follows:
val jsonstr = """ copy paste a representative sample of data here"""
val jsondf = spark.read.json(Seq(jsonstr).toDS) //jsondf.schema has the nested json structure we need
val event = spark.readStream.format..option...load() //configure your source
val eventWithSchema = event.select($"value" cast "string" as "json").select(from_json($"json", jsondf.schema) as "data").select("data.*")
Now you can do whatever you want with this val as you would with Direct Streaming. Create temp view, run SQL queries, whatever..
Taking Arnon's solution to the next step (since it's deprecated in spark's newer versions, and would require iterating the whole dataframe just for a type casting)
spark.read.json(df.as[String])
Anyways, as for now, it's still experimental.

Resources