What is the meaning of numRecords in a BatchInfo? - apache-spark

I am running Spark Streaming in local mode, pushing data into the stream by reading files from disk and pushing them into a SynchronizedQueue that belongs to a queueStream.
However, if I use a StreamingListener to catch BatchInfos and print the return value of the numRecords Method, it always comes out as 0.
I am confused by this because if I print the contents of my stream, using e.g. the print method, I see that the is not actually empty.
Example Output:
Number of Records: 0 //printed by the StreamingListener
-------------------------------------------
Time: 1468180140000 ms
-------------------------------------------
[D#2630210a
[D#2fff9ea2
[D#5b5153cd
[D#3854e691
[D#27185f49
[D#fb2b862
[D#1e6731fb
[D#7c4ab411
[D#25f701b
[D#47b8fdd4
...
Perhaps my understanding of what is meant by a "record" is wrong? Or could there be some bug that prevents this from working correctly in local mode or with queueStreams?

Related

Chaining Delta Streams programmatically raising AnalysisException

Situation : I am producing a delta folder with data from a previous Streaming Query A, and reading later from another DF, as shown here
DF_OUT.writeStream.format("delta").(...).start("path")
(...)
DF_IN = spark.readStream.format("delta").load("path)
1 - When I try to read it this wayin a subsequent readStream (chaining queries for an ETL Pipeline) from the same program I end up having the Exception below.
2 - When I run it in the scala REPL however, it runs smoothly.
Not sure What is happening there but it sure is puzzling.
org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;
at org.apache.spark.sql.delta.DeltaErrors$.schemaNotSetException(DeltaErrors.scala:365)
at org.apache.spark.sql.delta.sources.DeltaDataSource.sourceSchema(DeltaDataSource.scala:74)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:209)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:95)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:95)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:171)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:225)
at org.apache.spark.ui.DeltaPipeline$.main(DeltaPipeline.scala:114)
From the Delta Lake Quick Guide - Troubleshooting:
Table schema is not set error
Problem:
When the path of Delta table is not existing, and try to stream data from it, you will get the following error.
org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;
Solution:
Make sure the path of a Delta table is created.
After reading the error message, I did try to be a good boy and follow the advice, so I tried to make sure there actually IS valid data in the delta folder I am trying to read from BEFORE calling the readStream, and voila !
def hasFiles(dir: String):Boolean = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).size > 0
} else false
}
DF_OUT.writeStream.format("delta").(...).start(DELTA_DIR)
while(!hasFiles(DELTA_DIR)){
print("DELTA FOLDER STILL EMPTY")
Thread.sleep(10000)
}
print("FOUND DATA ON DELTA A - WAITING 30 SEC")
Thread.sleep(30000)
DF_IN = spark.readStream.format("delta").load(DELTA_DIR)
It ended up working but I had to make sure to wait enough time for "something to happen" (don't know what exactly TBH, but it seems that reading from delta needs some writes to be complete - maybe metadata ? -
However, this still is a hack. I hope it was possible to start reading from an empty delta folder and wait for content to start pouring in it.
For me I couldnt find the absolute path a simple solution was using this alternative:
spark.readStream.format("delta").table("tableName")

In Spark Streaming how to process old data and delete processed Data

We are running a Spark streaming job that retrieves files from a directory (using textFileStream).
One concern we are having is the case where the job is down but files are still being added to the directory.
Once the job starts up again, those files are not being picked up (since they are not new or changed while the job is running) but we would like them to be processed.
1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older files to be picked up?
2) Is there a way to delete the processed files?
The article below pretty much covers all your questions.
https://blog.yanchen.ca/2016/06/28/fileinputdstream-in-spark-streaming/
1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older files to be picked up?
Stream reader initiates batch window using the system clock when a job/application is launched. Apparently all the files created before would be ignored. Try enabling checkpointing.
2) Is there a way to delete the processed files?
Deleting files might be unnecessary. If checkpointing works, the files not being processed are identified by Spark. If for some reason the files are to be deleted, implement a custom input format and reader (please refer article) to capture the file name and use this information as appropriate. But I wouldn't recommend this approach.
Is there a way to delete the processed files?
In my experience, I canĀ“t get to work the checkpointing feature so I had to delete/move the processed files that have entered each batch.
The way for getting those files is a bit tricky, but basically we can say that they are ancestors (dependencies) of the current RDD. What I use then, is a recursive method that crawls that structure and recovers the names of the RDDs that begin with hdfs.
/**
* Recursive method to extract original metadata files involved in this batch.
* #param rdd Each RDD created for each batch.
* #return All HDFS files originally read.
*/
def extractSourceHDFSFiles(rdd: RDD[_]): Set[String] = {
def extractSourceHDFSFilesWithAcc(rdd: List[RDD[_]]) : Set[String] = {
rdd match {
case Nil => Set()
case head :: tail => {
val name = head.toString()
if (name.startsWith("hdfs")){
Set(name.split(" ")(0)) ++ extractSourceHDFSFilesWithAcc(head.dependencies.map(_.rdd).toList) ++ extractSourceHDFSFilesWithAcc(tail)
}
else {
extractSourceHDFSFilesWithAcc(head.dependencies.map(_.rdd).toList) ++ extractSourceHDFSFilesWithAcc(tail)
}
}
}
}
extractSourceHDFSFilesWithAcc(rdd.dependencies.map(_.rdd).toList)
}
So, in the forEachRDD method you can easily invoke it:
stream.forEachRDD(rdd -> {
val filesInBatch = extractSourceHDFSFiles(rdd)
logger.info("Files to be processed:")
// Process them
// Delete them when you are done
})
The answer to your second question,
It is now possible in Spark 3. You can use "cleanSource" option for readStream.
Thanks to documentation https://spark.apache.org/docs/latest/structuread-streaming-programming-guide.html and this video https://www.youtube.com/watch?v=EM7T34Uu2Gg.
After searching for many hours, finally got the solution

AWS Lambda Function (Node) - Custom timeout logging

I'm wondering if there is any way of hijacking the standard "Task timed out after 1.00 seconds" log.
Bit of context : I'm streaming lambda function logs into AWS Elasticsearch / Kibana, and one of the things I'm logging is whether or not the function successfully executed (good to know). I've set up a test stream to ES, and I've been able to define a pattern to map what I'm logging to fields in ES.
From the function, I console log something like:
"\"FAIL\"\"Something messed up\"\"0.100 seconds\""
and with the mapping, I get a log structure like:
Status - Message -------------------- Execution Time
FAIL ---- Something messed up --- 0.100 seconds
... Which is lovely. However if a log comes in like:
"Task timed out after 1.00 seconds"
then the mapping will obviously not apply. If it's picked up by ES it will likely dump the whole string into "Status", which is not ideal.
I thought perhaps I could query context.getRemainingMillis() and if it goes maybe within 10 ms of the max execution time (which you can't get from the context object??) then fire the custom log and ignore the default output. This however feels like a hack.
Does anyone have much experience with logging from AWS Lambda into ES? The key to creating these custom logs with status etc is so that we can monitor the activity of the lambda functions (many), and the default log formats don't allow us to classify the result of the function.
**** EDIT ****
The solution I went with was to modify the lambda function generated by AWS for sending log lines to Elasticsearch. It would be nice if I could interface with AWS's lambda logger to set the log format, however for now this will do!
I'll share a couple key points about this:
The work done for parsing the line and setting the custom fields is done in transform() before the call to buildSource().
The message itself (full log line) is found in logEvent.message.
You don't just reassign the message in the desired format (in fact leaving it be is probably best since the raw line is sent to ES). The key here is to set the custom fields in logEvent.extractedFields. So once I've ripped apart the log line, I set logEvent.extractedFields.status = "FAIL", logEvent.extractedFields.message = "Whoops.", etc etc.

Missing ETW EventSource table in Azure SDK 2.6

I'm trying to use ETW for logging with several custom EventSource classes in Azure SDK 2.6.
When testing locally with the compute/storage emulator, three of my custom WADMyEventXYZ tables show up; however, the final expected table "WADMyDataSets" never seems to be created. How should I determine what is causing this problem? I see no errors from the compute emulator when the debugger is attached and stepping through the code in the debugger shows that WriteEntry on the EventSource is definitely called. The other tables show up in SchemasTable in the developer storage account, but there is no entry there for WADMyDataSets.
I exported WADDiagnosticInfrastrureLogsTable into CSV and examined it in Excel and see the following messages that reference "MyDataSets":
Validating table MyDataSets; DiskMB:451; RequiredQuota:451 RetentionSeconds:7776000 Pri:2 MinQuotaMB:0 RunningTotal:3757
Table does not exist
table C:\Users\Caleb\AppData\Local\dftmp\Resources\b316f531-f673-4db3-ac1c-e4649e289871\WAD0104\Tables\MyDataSets does not exist, CreationDisposition = 4
Table MyDataSets does not exist, will create a new one
Delaying the creation of table MyDataSets until the schema is known
Later on:
Converted EventSource provider name "MyDataSets" to {74a2b9c9-0bd8-547f-6cad-453da47055be}
Matched task with query id MyDataSetsQuery and regex ^MyDataSets$ to source table MyDataSets
Registering query MyDataSetsQuery_MyDataSets_XTableWadAccount:
Adding standard PkRk (MA) fields to 'MyDataSetsQuery_MyDataSets'
Successfully compiled the query 'MyDataSetsQuery_MyDataSets'
Added task MyDataSetsQuery_MyDataSets_WADMyDataSets_PT1M_XTableWadAccount from MyDataSets - Partitions:-1 Pri:normal TSPolicy:start StoreType:Central Repeat:2147483647 Timeout:3600s Deadline:300s DelayRange:0.00
Later on:
No checkpoint found for task MyDataSetsQuery_MyDataSets_WADMyDataSets_PT1M_XTableWadAccount after time 2015-05-13T00:44:21.000Z; retry time out is 3600 seconds
First scheduled task for MyDataSetsQuery_MyDataSets_WADMyDataSets_PT1M_XTableWadAccount is at 2015-05-13T01:44:00.000Z (plus a delay of 20s)
Later on:
Increasing query delay of task MyDataSetsQuery_MyDataSets_WADMyDataSets_PT1M_XTableWadAccount from 20 to 40 seconds to introduce randomness to the upload schedule
Later on:
Starting scheduled task MyDataSetsQuery_MyDataSets_WADMyDataSets_PT1M_XTableWadAccount from 2015-05-13T01:43:00.000Z to 2015-05-13T01:44:00.000Z; query delay 40 seconds
Table C:\Users\Caleb\AppData\Local\dftmp\Resources\b316f531-f673-4db3-ac1c-e4649e289871\WAD0104\Tables\MyDataSets does not exist
Ending scheduled task MyDataSetsQuery_MyDataSets_WADMyDataSets_PT1M_XTableWadAccount from 2015-05-13T01:43:00.000Z to 2015-05-13T01:44:00.000Z in 1ms
Update
The EventSource in question had one event on it:
[Event(1)]
public void DataSetLoaded(string traceActivityId, string userId, string reportCode, long timeToLoadMs)
Removing the fourth parameter "timeToLoadMs" resulted in the WAD event table showing up as expected. I tried changing the last parameter to a string, and it failed to show up again. Is there a documented limit on the number of parameters for an event method? I'm pretty sure I've seen samples that have four parameters.
I upgraded my web project to .NET 4.5.1 and now the WAD table shows up as expected (I had been running on just .NET 4.5 before this).
It would seem that there might be a bug with having 4 parameters on an EventSource event when using .NET 4.5.0.
As a side note, with 4.5.1, I now have the System.Diagnostics.Tracing.EventSource.SetCurrentThreadActivityId method which will let me get rid of manually including the CorrelationManager.ActivityId in my event output.
https://channel9.msdn.com/Series/ConnectOn-Demand/240 video released today says full support for Azure table logging for ETW eventsources.

JavaDStream print() function not printing

I am new to Spark streaming.
I followed the tutorial from this link : https://spark.apache.org/docs/latest/streaming-programming-guide.html
When I ran the code, I could see the line was being processed, but I could not see output with timestamp.
I only could see this log:
14/10/22 15:24:17 INFO scheduler.ReceiverTracker: Stream 0 received 0 blocks
14/10/22 15:24:17 INFO scheduler.JobScheduler: Added jobs for time 1414005857000 ms
.....
Also I was trying to save last DStream with forEachRDD function call, the data was not being stored.
If anyone can help me with this, would be a great help..
I met the same problem, here is how I solved:
change
val conf = new SparkConf().setMaster("local")
to
val conf = new SparkConf().setMaster("local[*]")
It's a mistake to setMaster("local"), which will not calculate actually.
Hope this is the problem you meet.
The print is working as evidenced by the ..... separator, only there's nothing to print: the DStream is empty. The log provided actually shows that: Stream 0 received 0 blocks.
Make sure you're sending data correctly to your Receiver.
val conf = new SparkConf().setMaster("local[*]") works
local[*]: '*' means create the worker thread as the same number as the kernel number of CPU
if using "local", no worker is created, why default is not 1, isn't it a issue?
refer to.
What does setMaster `local[*]` mean in spark?

Resources