Use Spark to Write Kafka Messages Directly to a File - apache-spark

For a class project, I need a Spark Java program to listen as a Kafka consumer and write all of a Kafka topic's received messages to a file (e.g. "/user/zaydh/my_text_file.txt").
I am able to receive the messages in as a JavaPairReceiverInputDStream object; I can also convert it to a JavaDStream<String> (this is from the Spark Kafka example).
However, I could not find a good Java syntax to write this data to what is a essentially a single log file. I tried using foreachRDD on the JavaDStream object, but I could not find a clean, parallel-safe way to sink it to a single log file.
I understand this approach is non-traditional or non-ideal, but it is a requirement. Any guidance is much appreciated.

When you think of a stream , you got to think of it as something that wont stop giving out data .
Hence if Spark streaming had a way to save all the RDDs coming in to a single file , it would keep growing to a huge size (and the stream isnt supposed to stop remember ? :))
But in this case you can make use of the saveAsTextFile utility of an RDD,
Which will create many file in your output directory depending on your batch interval thats specified while creating the streaming context JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1))
You can then merge these fileparts into one using somthing like how-to-merge-all-text-files-in-a-directory-into-one

Related

How Kafka sink supports update mode in structured streaming?

I have read about the different output modes like:
Complete Mode - The entire updated Result Table will be written to the sink.
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage.
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage
At first I thought I understand the above explanations.
Then I come across this:
File sink supported modes: Append
Kafka sink supported modes: Append,Update,Complete
Wait!! What??!!
Why couldn't we just write out the entire result table to file?
How can we an already existing entry in Kafka update? It's a stream, you can't just look for certain messages and change/update them.
This makes no sense at all.
Could you help me understand this? I just dont get how this works technically
Spark writes one file per partition, often with one file per executor. Executors run in a distributed fashion. Files are local to each executor, so append just makes sense - you cannot full replace individual files, especially without losing data within the stream. So that leaves you with "appending new files to the filesystem", or inserting into existing files.
Kafka has no update functionality... Kafka Integration Guide doesn't mention any of these modes, so it is unclear what you are referring to. You use write or writeStream. It will always "append" the "complete" dataframe batch(es) to the end of the Kafka topic. The way Kafka implements something like updates is using compacted topics, but this has nothing to do with Spark.

Kafka broker (0.10.0 or higher) as DStream source for Spark Streaming in Python

Concretely, I am looking for a replacement or work-around for the KafkaUtils.createStream() API call in pyspark.streaming.kafka in Kafka 0.8.0.
Trying to use this (depreciated) function with Kafka 0.10.0 produces an error. I was thinking about creating a custom receiver but there isn't any pyspark support here either. It also seems like there is no fix in the make.
Here's an abstract of the application I'm trying to build. The application wants to create a live (aggregated) dashboard from different production line resources, which are fed into Kafka. At the same time, the processed data will go to permanent storage. The goal is to create an anomaly detection system from this permanent data.
I can work around this problem for the permanent storage by batch processing the data before sending it off. But this obviously doesn't work for streaming.
Below you can find the pseudo code of what the script should look like:
sc = SparkContext(appName='abc')
sc.setLogLevel('WARN')
ssc = StreamingContext(sc, 2)
## Create Dstream object from Kafka (This is where I'm stuck)
## Transform and create aggregated windows
ssc.start()
## Catch output and send back to Kafka as producer
All advice and solutions are more than welcome.

How to read streaming datasets from socket?

Below code reads from a socket, but I don't see any input going into the job. I have nc -l 1111 running, and dumping data though, not sure why my Spark job is not able to read data from 10.176.110.112:1111.
Dataset<Row> d = sparkSession.readStream().format("socket")
.option("host", "10.176.110.112")
.option("port", 1111).load();
Below code reads from a socket, but I don't see any input going into the job.
Well, honestly, you do not read anything from anywhere. You've only described what you are going to do when you start the streaming pipeline.
Since you use Structured Streaming to read datasets from a socket, you should use start operator to trigger data fetching (and that's only after you define the sink).
start(): StreamingQuery Starts the execution of the streaming query, which will continually output results to the given path as new data arrives. The returned StreamingQuery object can be used to interact with the stream.
Before start you should define where to stream your data. It could be Kafka, files, a custom streaming sink (perhaps using foreach operator) or console.
I use console sink (aka format) in the following example. I also use Scala and leave rewriting it to Java as your home exercise.
d.writeStream. // <-- this is the most important part
trigger(Trigger.ProcessingTime("10 seconds")).
format("console").
option("truncate", false).
start // <-- and this

Is it possible to Broadcast Spark Context?

I'm working in a scenario where i want to broadcast Spark context and get it in the other side. Is it possible in any other way? If not can someone explain why.
Any help is highly appreciated.
final JavaStreamingContext jsc = new JavaStreamingContext(conf,
Durations.milliseconds(2000));
final JavaSparkContext context = jsc.sc();
final Broadcast<JavaSparkContext> broadcastedFieldNames = context.broadcast(context);
Here's what i'm trying to achieve.
1. We have a XML EVENT that is coming form Kafka.
2. In the xml event we have one HDFS file path (hdfs:localhost//test1.txt)
3. We are using SparkStreamContext to create a DSTREAM and fetch the xml. Using Map Function we are reading the file path in each xml.
4. Now we need to read the file from HDFS (hdfs:localhost//test1.txt).
To Read this i need sc.readfile so i'm trying to broadcast the spark context to executor for parallel read of the input file.
Currently we are using HDFS Read file but that will not read parallel right?
You can't delete row using apache spark but if you use spark as olap engine to run SQL queries you also conce check apache incubator carbondata its provide you support of update delete records and it build on top of spark

Writing same Spark Streaming Output to different destinations

I have a DStream and I want to write each element to a socket and to cassandra DB.
I found a solution that use Apache Kafka and two consumer, one write to database and other write to socket.
Is there a way to do that without using this workaround?
I use Java so please post code on this language.
You just need to apply two different actions to the rdd within the DStream: One to save to cassandra and one to send the data to whatever other output.
Also, cache the rdd before these actions to improve performance.
(in pseudo code as I don't do Java)
dstream.foreachRDD{rdd =>
rdd.cache()
rdd.saveToCassandra(...)
rdd.foreach(...) // or rdd.foreachPartition(...)
}

Resources