I am new to Spark streaming.
I followed the tutorial from this link : https://spark.apache.org/docs/latest/streaming-programming-guide.html
When I ran the code, I could see the line was being processed, but I could not see output with timestamp.
I only could see this log:
14/10/22 15:24:17 INFO scheduler.ReceiverTracker: Stream 0 received 0 blocks
14/10/22 15:24:17 INFO scheduler.JobScheduler: Added jobs for time 1414005857000 ms
.....
Also I was trying to save last DStream with forEachRDD function call, the data was not being stored.
If anyone can help me with this, would be a great help..
I met the same problem, here is how I solved:
change
val conf = new SparkConf().setMaster("local")
to
val conf = new SparkConf().setMaster("local[*]")
It's a mistake to setMaster("local"), which will not calculate actually.
Hope this is the problem you meet.
The print is working as evidenced by the ..... separator, only there's nothing to print: the DStream is empty. The log provided actually shows that: Stream 0 received 0 blocks.
Make sure you're sending data correctly to your Receiver.
val conf = new SparkConf().setMaster("local[*]") works
local[*]: '*' means create the worker thread as the same number as the kernel number of CPU
if using "local", no worker is created, why default is not 1, isn't it a issue?
refer to.
What does setMaster `local[*]` mean in spark?
Related
Situation : I am producing a delta folder with data from a previous Streaming Query A, and reading later from another DF, as shown here
DF_OUT.writeStream.format("delta").(...).start("path")
(...)
DF_IN = spark.readStream.format("delta").load("path)
1 - When I try to read it this wayin a subsequent readStream (chaining queries for an ETL Pipeline) from the same program I end up having the Exception below.
2 - When I run it in the scala REPL however, it runs smoothly.
Not sure What is happening there but it sure is puzzling.
org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;
at org.apache.spark.sql.delta.DeltaErrors$.schemaNotSetException(DeltaErrors.scala:365)
at org.apache.spark.sql.delta.sources.DeltaDataSource.sourceSchema(DeltaDataSource.scala:74)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:209)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:95)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:95)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:171)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:225)
at org.apache.spark.ui.DeltaPipeline$.main(DeltaPipeline.scala:114)
From the Delta Lake Quick Guide - Troubleshooting:
Table schema is not set error
Problem:
When the path of Delta table is not existing, and try to stream data from it, you will get the following error.
org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;
Solution:
Make sure the path of a Delta table is created.
After reading the error message, I did try to be a good boy and follow the advice, so I tried to make sure there actually IS valid data in the delta folder I am trying to read from BEFORE calling the readStream, and voila !
def hasFiles(dir: String):Boolean = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).size > 0
} else false
}
DF_OUT.writeStream.format("delta").(...).start(DELTA_DIR)
while(!hasFiles(DELTA_DIR)){
print("DELTA FOLDER STILL EMPTY")
Thread.sleep(10000)
}
print("FOUND DATA ON DELTA A - WAITING 30 SEC")
Thread.sleep(30000)
DF_IN = spark.readStream.format("delta").load(DELTA_DIR)
It ended up working but I had to make sure to wait enough time for "something to happen" (don't know what exactly TBH, but it seems that reading from delta needs some writes to be complete - maybe metadata ? -
However, this still is a hack. I hope it was possible to start reading from an empty delta folder and wait for content to start pouring in it.
For me I couldnt find the absolute path a simple solution was using this alternative:
spark.readStream.format("delta").table("tableName")
Not able to create an index and push the data to elasticsearch using elastic sink
df.writeStream
.outputMode("append")
.format("org.elasticsearch.spark.sql")
.option("es.nodes","localhost:9200")
.option("checkpointLocation","/tmp/")
.option("es.resource","index/type")
.start
There are no errors, unfortunately it is not working.
at times(1 out of 10) above snippet creates new index but it doesn't push data of the dataframe/dataset to index created. Remaining times it doesn't even create a index. It seems something with properties of elastic search configurations.
Try the below code
df.writeStream
.outputMode("append")
.format("es")
.option("es.nodes","localhost")
.option("es.port","9200")
.option("checkpointLocation","/tmp/")
.start("index/type")
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
conf.set("es.index.auto.create", "true");
Refer the official document for more help
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
From one of the OP's comments:
Solved. The time column being used with WithWaterMark option should be a real time streaming value. My testing data is a dump taken from production which is old dated time value , post the correction of time column to current time stamp,it worked as expected
You need to give elastic-hadoop library from maven repository in order to do that.Here
After that use this:
df.writeStream.outputMode("append").format("org.elasticsearch.spark.sql").option("checkpointLocation","GiveCheckPointingName").option("es.port","9200").option("es.nodes","LocalHost").start("indexname/indextype").awaitTermination()
We are running a Spark streaming job that retrieves files from a directory (using textFileStream).
One concern we are having is the case where the job is down but files are still being added to the directory.
Once the job starts up again, those files are not being picked up (since they are not new or changed while the job is running) but we would like them to be processed.
1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older files to be picked up?
2) Is there a way to delete the processed files?
The article below pretty much covers all your questions.
https://blog.yanchen.ca/2016/06/28/fileinputdstream-in-spark-streaming/
1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older files to be picked up?
Stream reader initiates batch window using the system clock when a job/application is launched. Apparently all the files created before would be ignored. Try enabling checkpointing.
2) Is there a way to delete the processed files?
Deleting files might be unnecessary. If checkpointing works, the files not being processed are identified by Spark. If for some reason the files are to be deleted, implement a custom input format and reader (please refer article) to capture the file name and use this information as appropriate. But I wouldn't recommend this approach.
Is there a way to delete the processed files?
In my experience, I canĀ“t get to work the checkpointing feature so I had to delete/move the processed files that have entered each batch.
The way for getting those files is a bit tricky, but basically we can say that they are ancestors (dependencies) of the current RDD. What I use then, is a recursive method that crawls that structure and recovers the names of the RDDs that begin with hdfs.
/**
* Recursive method to extract original metadata files involved in this batch.
* #param rdd Each RDD created for each batch.
* #return All HDFS files originally read.
*/
def extractSourceHDFSFiles(rdd: RDD[_]): Set[String] = {
def extractSourceHDFSFilesWithAcc(rdd: List[RDD[_]]) : Set[String] = {
rdd match {
case Nil => Set()
case head :: tail => {
val name = head.toString()
if (name.startsWith("hdfs")){
Set(name.split(" ")(0)) ++ extractSourceHDFSFilesWithAcc(head.dependencies.map(_.rdd).toList) ++ extractSourceHDFSFilesWithAcc(tail)
}
else {
extractSourceHDFSFilesWithAcc(head.dependencies.map(_.rdd).toList) ++ extractSourceHDFSFilesWithAcc(tail)
}
}
}
}
extractSourceHDFSFilesWithAcc(rdd.dependencies.map(_.rdd).toList)
}
So, in the forEachRDD method you can easily invoke it:
stream.forEachRDD(rdd -> {
val filesInBatch = extractSourceHDFSFiles(rdd)
logger.info("Files to be processed:")
// Process them
// Delete them when you are done
})
The answer to your second question,
It is now possible in Spark 3. You can use "cleanSource" option for readStream.
Thanks to documentation https://spark.apache.org/docs/latest/structuread-streaming-programming-guide.html and this video https://www.youtube.com/watch?v=EM7T34Uu2Gg.
After searching for many hours, finally got the solution
I am using Hazelcast "3.6.3" on scala 2.11.8
I have written this code.
val config = new Config("mycluster")
config.getNetworkConfig.getJoin.getMultcastConfig.setEnabled(false)
config.getNetworkConfig.getJoin.getMulticastConfig.setEnabled(false)
config.getNetworkConfig.getJoin.getAwsConfig.setEnabled(false)
config.getNetworkConfig.getJoin.getTcpIpConfig.setMembers(...)
config.getNetworkConfig.getJoin.getTcpIpConfig.setEnabled(true)
val hc = Hazelcast.newHazelcastInstance(config)
hc.getConfig.addMapConfig(new MapConfig()
.setName("foo")
.setBackupCount(1)
.setTimeToLiveSeconds(3600)
.setAsyncBackupCount(1)
.setInMemoryFormat(InMemoryFormat.BINARY)
.setMaxSizeConfig(new MaxSizeConfig(1, MaxSizePolicy.USED_HEAP_SIZE))
)
hc.putValue[(String, Int)]("foo", "1", ("foo", 10))
I notice that when 1 hour is over hazelcast does not remove the items from the cache. The items seem to be living forever in the cache.
I don't want sliding expiration. I want absolute expiration this means that after 1 hour the item has to be kicked out no matter how many times it was accessed during the hour.
I have done required googling and I think my code above is correct. But when I look at my server logs, I am pretty sure that nothing is removed from the cache.
Sorry I am not a scala guy. But can you explain what does hc.addTimeToLiveMapConfig does?
Normally you need to add the TTL config into Config object before starting Hazelcast.
I believe in your case, you are starting Hazelcast and only then updating the config with TTL. Please try with reverse order.
If you don't like to add this to the configuration, there is an overloaded map.put method that takes TTL as an input. This way you can specify TTL per entry.
I am running Spark Streaming in local mode, pushing data into the stream by reading files from disk and pushing them into a SynchronizedQueue that belongs to a queueStream.
However, if I use a StreamingListener to catch BatchInfos and print the return value of the numRecords Method, it always comes out as 0.
I am confused by this because if I print the contents of my stream, using e.g. the print method, I see that the is not actually empty.
Example Output:
Number of Records: 0 //printed by the StreamingListener
-------------------------------------------
Time: 1468180140000 ms
-------------------------------------------
[D#2630210a
[D#2fff9ea2
[D#5b5153cd
[D#3854e691
[D#27185f49
[D#fb2b862
[D#1e6731fb
[D#7c4ab411
[D#25f701b
[D#47b8fdd4
...
Perhaps my understanding of what is meant by a "record" is wrong? Or could there be some bug that prevents this from working correctly in local mode or with queueStreams?