Is it possible to Broadcast Spark Context? - apache-spark

I'm working in a scenario where i want to broadcast Spark context and get it in the other side. Is it possible in any other way? If not can someone explain why.
Any help is highly appreciated.
final JavaStreamingContext jsc = new JavaStreamingContext(conf,
Durations.milliseconds(2000));
final JavaSparkContext context = jsc.sc();
final Broadcast<JavaSparkContext> broadcastedFieldNames = context.broadcast(context);
Here's what i'm trying to achieve.
1. We have a XML EVENT that is coming form Kafka.
2. In the xml event we have one HDFS file path (hdfs:localhost//test1.txt)
3. We are using SparkStreamContext to create a DSTREAM and fetch the xml. Using Map Function we are reading the file path in each xml.
4. Now we need to read the file from HDFS (hdfs:localhost//test1.txt).
To Read this i need sc.readfile so i'm trying to broadcast the spark context to executor for parallel read of the input file.
Currently we are using HDFS Read file but that will not read parallel right?

You can't delete row using apache spark but if you use spark as olap engine to run SQL queries you also conce check apache incubator carbondata its provide you support of update delete records and it build on top of spark

Related

spark structured streaming parquet overwrite

i would like to be able to overwrite my output path with parquet format,
but it's not among available actions (append, complete, update),
Is there another solution here ?
val streamDF = sparkSession.readStream.schema(schema).option("header","true").parquet(rawData)
val query = streamDF.writeStream.outputMode("overwrite").format("parquet").option("checkpointLocation",checkpoint).start(target)
query.awaitTermination()
Apache Spark only support Append mode for File Sink. Check out here
You need to write code to delete path/folder/files from file system before writing a data.
Check out this stackoverflow link for ForeachWriter. This will help you to achieve your case.

Spark - jdbc read all happens on driver?

I have spark reading from Jdbc source (oracle) I specify lowerbound,upperbound,numpartitions,partitioncolumn but looking at web ui all the read is happening on driver not workers,executors. Is that expected?
In Spark framework, in general whatever code you write within a transformation such as map, flatMap etc. will be executed on the executor. To invoke a transformation you need a RDD which is created using the dataset that you are trying to compute on. To materialize the RDD you need to invoke an action so that transformations are applied to the data.
I believe in your case, you have written a spark application that reads jdbc data. If that is the case it will all be executed on Driver and not executor.
If you haven not already, try creating a Dataframe using this API.

What should be the input to setJars() method in Spark Streaming

val conf = new SparkConf(true)
.setAppName("Streaming Example")
.setMaster("spark://127.0.0.1:7077")
.set("spark.cassandra.connection.host","127.0.0.1")
.set("spark.cleaner.ttl","3600")
.setJars(Array("your-app.jar"))
Lets say I am creating a Spark Streaming Application
What should be the content of "your-app.jar" file ? Do I have to create it manually in my local file system and pass the path or Is that a Scala compiled file using sbt.
If thats a scala file please help to write the code
Since I am a beginer I am just trying to run some sample codes.
setJars method of the SparkConf class takes external JARs that need to be distributed on the cluster. Any external drivers like JDBC, etc.
You do not have to pass your own application JAR in this if that's what you are asking.

Spark Streaming : source HBase

Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :
sc.newAPIHadoopRDD(..)
I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.
Thanks!
The link provided does the following
Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.
The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.
I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.

Use Spark to Write Kafka Messages Directly to a File

For a class project, I need a Spark Java program to listen as a Kafka consumer and write all of a Kafka topic's received messages to a file (e.g. "/user/zaydh/my_text_file.txt").
I am able to receive the messages in as a JavaPairReceiverInputDStream object; I can also convert it to a JavaDStream<String> (this is from the Spark Kafka example).
However, I could not find a good Java syntax to write this data to what is a essentially a single log file. I tried using foreachRDD on the JavaDStream object, but I could not find a clean, parallel-safe way to sink it to a single log file.
I understand this approach is non-traditional or non-ideal, but it is a requirement. Any guidance is much appreciated.
When you think of a stream , you got to think of it as something that wont stop giving out data .
Hence if Spark streaming had a way to save all the RDDs coming in to a single file , it would keep growing to a huge size (and the stream isnt supposed to stop remember ? :))
But in this case you can make use of the saveAsTextFile utility of an RDD,
Which will create many file in your output directory depending on your batch interval thats specified while creating the streaming context JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1))
You can then merge these fileparts into one using somthing like how-to-merge-all-text-files-in-a-directory-into-one

Resources